Wikidata:WikidataCon 2017/Notes/Data completeness: How to know what Wikidata knows?

Title: Data completeness: How to know what Wikidata knows?

Note-taker(s): Sannita, Bene*

Speaker(s) edit

Name or username: Ls1g
Contact (email, Twitter, etc.): razniewski@inf.unibz.it

Abstract edit

Wikidata is a great project towards mapping structured information about the world, and exhibits a high degree of correctness. Its degree of completeness, in turn, is much less understood. Anecdotal evidence suggests that it covers many popular topics quite well, but there are few standard means that help in this assessment: At present, editors and consumers have to analyze largely on a case-by-case basis whether given information might be complete or not. This session concerns the automated assessment of the completeness of Wikidata, and consists of two parts:

In the first 35 minutes I will survey techniques to assess the completeness of parts of Wikidata. I will talk about three aspects of completeness, values, properties and entities:

For values, I will discuss no-value statements and predicates that talk about object counts, and the COOL-WD tool for asserting metadata.
For properties, I will review mandatory properties (like P1963) and the completeness status indicator icon via Recoin and tabular views like discussed here and exemplified here.
For entities, I will look at what is currently possible with the Class Browser, SQID, and what faceted browsing should hopefully make possible in the future.

The second part of the session (15 minutes) shall be an open discussion, guided by the questions

What kind of (anecdotal) knowledge about completeness of parts of Wikidata do participants have?
What kind of structured knowledge about completeness would participants like to obtain?
What tools could help towards this?

Collaborative notes of the session edit

What does Wikidata really knows? There are some areas who are "complete" (Physics Nobel Prizes, children of Obama...), other who are not (how many stops of Berlin S1 there are, ...).

Knowledge base engineers only tried to enrich the KBs, but now the point is to understand what they are trying to approximate or "the unknown unknowns" (cit. Donald Rumsfeld).

Wikidata quite complete on some topics, but missing data in other domains
We are constantly adding more data to Wikidata

Where is Wikidata going? Nobody really knows. Is easier to say that for specific topics.

on specific topics, completeness is well defined, but in general hard to tell (in respect to everything)

Bookkeeping: 1) values; 2) properties; 3) entities

value completeness: James A. Garfield (Q34597) has 4 children, are they all? (No, he had 7, but only 4 have items)
property completeness: Arno Kompatscher (Q15074414) has all the properties that has to have? (No, missing birth/death date, political party, position held, education-related properties, and so on)
entity completeness: run a query on Ministry of Foreign Affairs (Q6867013) and only one value is there, is it all? (We don't know)

"Solutions" and tools

value completeness: no-values + "number of ..." statements
challenges: possibilty of contradiction + complex querying
should completeness only be considered for "important" properties (where important is defined how?)
COOL-WD: http://cool-wd.inf.unibz.it/ (mark the statements for one property as complete)
property completeness: pre-defined schemata (P1963, manually defined tables)
Recoin: https://www.wikidata.org/wiki/User:Ls1g/Recoin (completeness indicator for a whole item, depends on statements on similar items)
ProWD (mockup)
entity completeness: ???
Is a specific query complete?
Class browser for a specific class
SQID browser

Summary

Without a goal, it's difficult to understand how Wikidata is faring
bookkeeping is possible on a small scale
there is possibility to try to approximate what's missing and what is not by the tools and solutions explained above

Related tickets on Phabricator: T150938, T150116

Questions / Answers edit

Data can be only complete as to a certain date (e.g. somebody has 2 children, then becomes a parent to a third child)
Granularity of data that changes -> open question
What about time dependend data that should not have big gaps? --> presidents of a country
Choose which features should be kept complete (eg. Wikipedia defines all articles about humans should have information about gender)
...
Link to session at 9:30 --> https://www.wikidata.org/wiki/Wikidata:WikidataCon_2017/Submissions/Well_structured_political_data_for_the_whole_world:_impossible_utopia,_or_Wikidata_at_its_best%3F