Wikidata:WikiProject Limits of Wikidata

This WikiProject aims to catalogue the current limits of Wikidata and to extrapolate their development until about 2030.
The formula depicted here describes the resolution limit of the light microscope. After it had served science for about a century, it was set in stone for a monument. The image was taken yet more years later, but two months after that, the 2014 Nobel Prize in Chemistry was announced to be awarded for overcoming this limit using fluorescent molecules and lasers.
Which of the limits of Wikidata are set in stone, and which ones should we strive to overcome?

About edit

This Wikiproject aims at bringing together various strands of conversations that touch upon the respective limits of Wikidata, both in technical and social terms. The aim is not to duplicate existing documentation but to collect pointers to places where the respective limits for a given section are being described or discussed.

Timeframe edit

While fundamental limits exist in nature, the technical and social limits we are discussing here are likely to shift over time, so any discussion of such limits will have to come with some indication of an applicable timeframe. Since the Wikimedia community has used the year 2030 as a reference point for its Movement Strategy, we will use this here as a default as well for projections into the future and contrast this with current values (which may be available via Wikidata's Grafana dashboards). If other timeframes make more sense in specific contexts, please indicate that.

Design limits edit

"Design limits" are the limits which exist by intentional design of the infrastructure of our systems. As design choices, they have benefits and drawbacks. Such infrastructure limits are not necessarily problems to address and may instead be environmental conditions for using the Wikidata platform.

Software edit

Knowledge graphs in general edit

MediaWiki edit

maxlag edit

mw:Manual:Maxlag parameter, as explained here.

Page load performance edit
 
Reduced loading times cut

mw:Wikimedia Performance Team/Page load performance, as explained here.

Wikibase edit

Generic Wikibase Repository edit
  • By design, the repository stores statements, that *could* be true. There is no score yet, that describes the validity or "common sense agreement" of that statement.
Data types edit
  • Item
  • Monolingual string
  • Single value store, but no time-series for KPIs
Data formats edit
  • JSON
  • RDF
  • etc.
Generic Wikibase Client edit
Wikidata's Wikibase Repository edit
Wikidata's Wikibase Client edit
Wikibase Repositories other than Wikidata edit
Wikibase Clients other than Wikidata edit
Wikidata bridge edit
Wikimedia wikis edit
Non-Wikimedia wikis edit

Wikidata Query Service edit

See also Future-proof WDQS.

Triple store edit
Blazegraph edit
Virtuoso edit
JANUS edit
Apache Rya edit

Apache Rya (Q28915769), source code (no activity since dev 2020), manual

Oxigraph edit
Frontend edit
Timeout limit edit

Queries to the Wikidata Query Service time out after a certain time, which is a parameter that can be set.

There are multiple related timeouts, e.g. a queryTimeout behind Blazegraph's SPARQL LOAD command or a timeout parameter for the WDQS GUI build job.

JavaScript edit

The default UI is heavy on JavaScript, and so are many customizations. This creates problems with pages that have lots of statements in that they load more slowly or freeze the browser.

Python edit

SPARQL edit

Hardware edit

"Firstly we need a machine to hold the data and do the needed processing. This blog post will use a “n1-highmem-16” (16 vCPUs, 104 GB memory) virtual machine on the Google Cloud Platform with 3 local SSDs held together with RAID 0."
"This should provide us with enough fast storage to store the raw TTL data, munged TTL files (where extra triples are added) as well as the journal (JNL) file that the blazegraph query service uses to store its data.
"This entire guide will work on any instance size with more than ~4GB memory and adequate disk space of any speed."

Functional limits edit

A "functional limit" exists when the system design encourages an activity, but somehow engaging in the activity at a large scale exceeds the system's ability to permit that activity. For example, by design Wikidata encourages users to share data and make queries, but it cannot accommodate users doing a mass import of huge amounts of data or billions of quick queries.

A March 2019 report considered the extent to which various functions on Wikidata can scale with increased use - wikitech:WMDE/Wikidata/Scaling.

Wikidata editing edit

Edits by nature of account edit

Edits by human users edit
Manual edits edit
  • ...
Tool-assisted edits edit
  • ...
Edits by bots edit
  • ...

Edits by nature of edit edit

Page creations edit
Page modifications edit
Page merges edit
Reverts edit
Page deletions edit

Edits by size edit

Edits by frequency edit

WDQS querying edit

A clear example where we encounter problems, is SPARQL queries against the WDQS where things of some type (P31) are asked for, involving large number of hits. For example, querying all scholarly article titles. Queries that involve fewer items of that type do not typically give these issues.

Query timeout edit

This is a design limit discussed under #Timeout limit above. It manifests itself as an error when the query takes more time to run than the timeout limit allows for.

Queries by usage edit

One-off or rarely used queries edit
Showcase queries edit
Maintenance queries edit
Constraint checks edit

Queries by user type edit

Manually run queries edit
Queries run through tools edit
Queries run by bots edit

Queries by visualization edit

  • Table
  • Map
  • Bubble chart
  • Graph
  • etc.

Multiple simultaneous queries edit

Wikidata dumps edit

Creating dumps edit

Using dumps edit

Ingesting dumps edit
Ingesting dumps into a Wikibase instance edit
Ingesting dumps into the Wikidata Toolkit edit

Updating Triple Store Content edit

Creating large numbers of new items itself does not seem to cause problems (except the aforementioned WDQS querying issue). However, there frequently is a lag between updating the wiki pages of Wikidata and the updates being propagated to the Wikidata Query Service servers.

Edits to large items edit

Performance issues edit

One bottleneck is the editing of existing Wikidata items with a lot of properties. The underlying issue here is that, for each edit, RDF for the full item is created and that the WDQS needs to update that full RDF. Therefore, independent of the size of the edit, edits on large items stress the system more than edits on small items. There is a Phabricator ticket to change how the WDQS triple store is updated.

Page size limits edit

Pages at the top of Special:LongPages are often at the size limit for a wiki page, which is set via $wgMaxArticleSize.

Merged QuickStatement edits edit

The current QuickStatement website is not always efficient in making edits: adding a statement with references can result in multiple edits. This feature makes QuickStatement make the Large item edits issue very visible.

Human engagement limits edit

"Human engagement limits" include everything to do with human ability and attention to engage in Wikidata. In general Wikidata is more successful when humans of diverse talent and ability enjoy putting more attention and time into their engagement with Wikidata.

Limits in this space include the number of contributors Wikidata has, how much time each one gives, and the capacity of Wikidata to invite more human participants to spend more time in the platform.

Wikidata users edit

Human users edit

Human Wikidata readers edit
Human Wikidata contributors edit
  • Format is machine friendly but not human-friendly - hard for new editors to understand. Necessary to ensure that Wikidata brings in data that may not be already on the internet.
  • Difficult for college classes/instructors to know how to organize mass contributions from their students, such as Wikidata_talk:WikiProject_Chemistry#Edits_from_University_of_Cambridge.
  • Effective description of each type of entity requires guidance for the users who are entering a new item: What properties need to be used for each instance of tropical cyclone (Q8092)? How do we inform each user entering a new book item that they ought to create a version, edition, or translation (Q3331189) and a written work (Q47461344) entity for that book (per Wikidata:WikiProject_Books). In other words, how do we make the interface self-documenting for unfamiliar users? And where we have failed to do so, how do we clean up well-intentioned but non-standard edits by hundreds or thousands of editors operating without a common framework?
Human curation edit
  • Human curation of massive automated inputs of data - tool needed to ensure that data taken from large databases are reliable? Can we harness the power of human curators, who may identify different errors than machine-based checks?

Tools edit

Tools for reading Wikidata edit
Tools for contributing to Wikidata edit
Tools for curating Wikidata edit
  • "Wikidata vandalism dashboard". Wikimedia Toolforge.
  • "Author Disambiguator". Wikimedia Toolforge.

Bots edit

Bots that read Wikidata edit
Bots that contribute to Wikidata edit

Users of Wikidata client wikis edit

Users of Wikidata data dumps edit

Users of dynamic data from Wikidata edit

API edit

SPARQL edit

Linked Data Fragments edit

Other edit

Users of Wikibase repositories other than Wikidata edit

Content limits edit

"Content limits" describe how much data Wikidata can meaningfully hold. Of special concern is limits on growth. Wikidata hosts a certain amount of content now, but limits on adding additional content impede the future development of the project.

A March 2019 report considered the rate of growth for Wikidata's content — wikitech:WMDE/Wikidata/Growth.

Generic edit

How many triples can we manage? edit

How many languages should be supported? edit

How to link to individual statements? edit

Items edit

 
Timeline of Wikidata item creation

How many items should there be? edit

The Gaia project released data so far on over 1.6 billion stars in our galaxy. It would be nice if Wikidata could handle that. OpenStreetMap has about 540 million "ways". The number of scientific papers and their authors is on the order of 100-200 million. The total number of books ever published is probably over 130 million. OpenCorporates lists over 170 million companies. en:CAS Registry Number's have been assigned to over 200 million substances or sequences. There are over 100 large art museums in the world each with hundreds of thousands of items in their collection, so likely at least tens of millions of art works or other artifacts that could be listed. According to en:Global biodiversity there may be as few as a few million or as many as a trillion species on Earth; on the low end we already are close, but if the real number is on the high end, could Wikidata handle it? Genealogical databases provide information on billions of deceased persons who have left some record of themselves; could we allow them all here?

From all these different sources, it seems likely there would be a demand for at least 1 billion items within the next decade or so; perhaps many times more than that.

How many statements should an item have? edit

  • The top-listed items on Special:LongPages have over 5000 statements. This slows down operations like editing and display.

Properties edit

How many properties should there be? edit

How many statements should a property have? edit

Lexemes edit

 
Overview of lexicographical data as of May 2019. Does not discuss limits other than those of QuickStatements.

How many lexemes should there be? edit

English Wiktionary has about 6 million entries (see wikt:Wiktionary:Statistics); according to en:Wiktionary there are about 26 million entries across all the language variations. These numbers give a rough idea of potential scale; however they cannot be translated directly to expected lexeme counts due to the structural differences between Wikidata lexemes and Wiktionary entries. Lexemes have a single language, lexical category and (general) etymology, while Wiktionary entries depend only on spelling and include all languages, lexical categories and etymologies in a single page. On the other hand, each lexeme includes a variety of spellings due to the various forms associated with a single lexeme and spelling variations due to language varieties. Very roughly, then, one might expect the eventual number of lexemes in Wikidata to be on the order of 10 million, while the number of forms might be 10 times as large. The vast majority of lexemes will likely have only one sense, though common lexemes may have 10 or more senses, so the expected number of senses would be somewhere in between the number of lexemes and the number of forms, probably closer to the number of lexemes.

How many statements should a lexeme have? edit

So far there are only a handful of properties relevant for lexemes, in each case likely to have only one or a very small number of values for a given lexeme. So on the order of 1 to 10 statements per lexeme/form/sense seems to be expected. However, if we add more identifiers for dictionaries and link them, there's a possibility we may have a much larger number of external id links per lexeme in the long run - perhaps on the order of the number of dictionaries that have been published in each language?

References edit

How many references should there be? edit

How many references should a statement have? edit

Where should references be stored? edit

Subpages edit

Participants edit

The participants listed below can be notified using the following template in discussions:
{{Ping project|Limits of Wikidata}}