Wikidata:Recoin

logo

Recoin ("Relative Completeness Indicator") is a script that extends Wikidata entity pages with information about the relative completeness of the information. Relative completeness refers to the extent of information found on an item in comparison with other similar items.

Recoin adds a status indicator (top right) and two expandable lists of important absent properties and IDs to Wikidata (center). Here shown for Abbey Road, for which data is very detailed.

The indicator aggregates the extent of information into a colored progress bar, showing 5 possible color-coded levels of completeness that range from very detailed information to very basic information.

Recoin is intended to both help authors to know where to potentially focus their attention, and to make data consumers aware of the degree of information found in a specific article.

Max Planck Institute for Informatics: Detailed information
Arno Kompatscher: Basic information


Motivation edit

Recoin is intended to assist both authors and consumers of Wikidata.

For users (consumers), it provides a handy summary of the degree of completeness of information in Wikidata, which may help them in deciding whether to rely on Wikidata or not in order to satisfy their information need. This is because judging purely by article length may not always be a good idea, as for instance the chess player Jeff Sarwer (Q3494327) has a long article due to lots of statements about his Elo rating, but until recently was missing even very basic information such as citizenship or family name.

For authors, similarly it provides information about which persons' information is more complete than others', thus allowing them to focus attention on more incomplete persons. For an individual person, it allows them to see the most important properties that are missing, which authors they might focus on completing, or, if no values for these properties exist, might mark this with a novalue assertion.

What it shows edit

Recoin can add two kinds of information to Wikidata pages:

  • A 5-level status indicator icon, ranging from very detailed to very basic, summarizing the extent of information compared with other, similar entities;
  • Two expandable lists of most relevant absent properties and external IDs are added to the top of entity pages.

How it works edit

Architecture edit

 
Architecture of Recoin as of December 2017

The architecture depicted in the figure to the right shows both JavaScript modules recoin-core.js and recoin-explanations.js that send request to the getmissingattributes.php located on Toolforge. In turn this PHP script does the computation by making requests, first to the Wikidata SPARQL endpoint to get occupations for the given entity, and then by queries to databases on ToolsDB, to retrieve the attribute frequencies for the (previously computed) occupations (humans) or class (all non-humans). The results (completeness and the missing properties) are returned in JSON serialisation and are used by the JavaScript modules to render the page.

Computation edit

The script so far does computation for all classes contained in the table wikidatawiki_p.wbs_propertypairs [1]. Furthermore, it gives more refined results based on the 1000 most frequent professions of humans, by treating professions like classes.

Determination of absent properties and IDs edit

We first describe the case of an entity belonging to a single class/profession, and discuss multi-class-membership later below.

Given an entity that belongs to a certain class, we compute the properties most frequently occurring in that class, and check how many of those are absent for the entity. The top-10 missing properties are shown by the core script (a second script shows also external IDs). For classes contained in wikidatawiki_p.wbs_propertypairs, we use all properties available there. For professions of humans, we use the 100 most frequent properties per profession.

For instance, Jimmy Wales (Q181) misses, among other things, the properties languages spoken, written or signed (P1412), member of political party (P102) and position held (P39), which are specified for 13.435%, 9.347% and 8.376% of people of same occupation.

Status indicator computation edit

To determine the relative completeness on the 5-level scale, we compute the average frequency of the top 5 missing properties (if there are less than 5 missing properties, we assume their frequency to be zero). We then set the level as follows:

  • Level 5 (most complete) 0%-5% average frequency @ top 5 missing properties
  • Level 4 (quite complete) 5%-10% average frequency @ top 5 missing properties
  • Level 3 (medium complete) 10%-25% average frequency @ top 5 missing properties
  • Level 2 (low completeness) 25%-50% average frequency @ top 5 missing properties
  • Level 1 (least complete) 50%+ average frequency @ top 5 missing properties

For example, Arno Kompatscher (Q15074414) is missing

  • P39 (position held) - 54.33%
  • P1412 (languages spoken, written or signed) - 49.93%
  • P102 (member of political party) - 46.62%
  • P1559 (name in native language) - 31.14%
  • P937 (work location) - 30.67%

Thus, the average frequency of the top 5 missing properties is 42.53%, and thus his level of completeness is 2 (low).

Treatment of multi-class-membership edit

For entities belonging to multiple classes (see e.g. Dresden (Q1731)) or persons with multiple occupations (e.g. Arno Kompatscher (Q15074414)), Recoin does the computation based on the weighted frequency of each class/profession.

For instance, Arno Kompatscher (Q15074414) is both a politician and jurist. There are 297,370 politicians and 12,635 jurists in Wikidata. If among politicians, 40% do have the property position held (P39) set, while among jurists 20% do have, the final computed frequency is the weighted average of 39%.[2]

Special cases edit

  • For humans, the properties place of death (P20) and date of death (P570) are strictly filtered out, as they are frequent yet frequently undesired for living humans;
  • In the case of an entity belonging to a single class that does not have data in wikidatawiki_p.wbs_propertypairs, nothing is shown;
  • In the case of an entity belonging to multiple classes or professions, with one having no data, the frequency of properties in that class is assumed to be zero
  • Properties having a frequency of less than 0.01% in a class are assumed to have frequency zero
  • For entities that have a profession that is not among the 1000 most frequent ones, missing properties are computed based on general humans

Multilinguality edit

By default, Recoin shows the property labels in the language defined in the user settings, or where no label is available, in English. The same holds for the Strings of the tool (caption at the top of the page, altLabels of the status indicator icon). Translations can be added here.

Installation edit

Main Gadget edit

Recoin can be enabled at Special:Preferences under the section "Gadgets/Wikidata-centric".

Special version: IDs only edit

A special version only showing ID properties can be enabled by adding the following line to Special:MyPage/common.js:

 importScript('User:Vvekbv/recoin_id.js');

Where you maintain a global common file, the code to use in m:Special:MyPage/global.js:

 mw.loader.load('//www.wikidata.org/w/index.php?title=User:Vvekbv/recoin_id.js&action=raw&ctype=text/javascript');

APIs edit

Per-entity access edit

Recoin can also be accessed via an API available at

 https://tools.wmflabs.org/recoin/getmissingattributes.php?lang=en&subject=Q15074414&n=10

and

 https://tools.wmflabs.org/recoin/getmissingattributes_id.php?lang=en&subject=Q15074414&n=10

(substituting the desired entity Q-code, the language(default language is English) and n required properties(default is 10)).

Per-class access edit

To obtain a list of most frequent properties for a specific class, the following API can be used

 https://tools.wmflabs.org/recoin/getbyclassid.php?subject=Q185351&n=200

(substituting the desired class Q-code, "n" is the number of results returned(default is 200))

Data Dumps edit

An August 22, 2019 dump of property frequencies for classes and occupations is available here.

Besides the API above, a way to get fresh data on property frequencies for classes is quarry (example: most frequent properties for films: query).

Further information edit

Contact:

  • Vevake Balaraman - vevake.balaraman@gmail.com
  • Simon Razniewski - srazniew@mpi-inf.mpg.de
  • Werner Nutt - nutt@inf.unibz.it

Further reading:

  • Scientific paper "Recoin: Relative Completeness in Wikidata" by Vevake Balaraman, Simon Razniewski, Werner Nutt, Wiki Workshop at The Web Conference 2018 (link)
  • Talk at WikidataCon 2017 "How to know what Wikidata knows"
  • Scientific paper "Assessing the Completeness of Entities in Knowledge Bases" by Albin Ahmeti, Simon Razniewski, Axel Polleres, ESWC P&D 2017 (link)

Related projects:

  • Wikipedia article quality assessment using ORES
  • Wikidata property suggester, a tool that uses aggregated association rules for the suggestion of properties to add
  • COOL-WD, a tool that allows to assert the completeness of individual properties directly inside Wikidata.

Acknowledgment: This work is partially supported by the project TaDaQua, funded by the Free University of Bozen-Bolzano.

  1. 42078 as of November 15, 2017; query
  2. This is not the most precise way, as entities that are both politicians and jurists this way have twice the weight of other entities, but a precomputation of all combinations of professions/classes is infeasible both on the fly or a priori, and this weighting is a reasonable approximation.