User:NikkiBot/Lexicographical coverage

Summary edit

Task: Update the statistics on Wikidata:Lexicographical coverage
Schedule: Runs at 05:00 UTC on Wednesdays.
Source code: https://github.com/nikkiwd/lexcover

More information edit

The statistics on Wikidata:Lexicographical coverage are generated using the weekly JSON lexeme dump and corpus files from https://download.wmcloud.org/corpora/ and the Leipzig Corpora Collection.

The JSON lexeme dump typically becomes available on Wednesdays between 03:00 UTC and 04:00 UTC and is automatically downloaded and processed before generating the statistics. The corpus data files are downloaded and processed when setting up the bot and do not update automatically.

The bot has a fixed list of languages because it needs a corpus for each language. Languages which are not yet supported can be easily added if the Leipzig Corpora Collection has a corpus for it. Other sources of freely available corpora can potentially be used but would need more work.