Wikidata:WikidataCon 2017/Notes/Wikidata quality: a data consumers' perspective

Title: Wikidata quality: A data consumers' perspective

Note-taker(s): Sannita and Thiemo

Speaker(s)

Name or username: Alessandro Piscopo

Contact (email, Twitter, etc.): A.Piscopo soton.ac.uk

Abstract

Data quality is an important topic for Wikidata, as the number of initiatives and projects around this topic testifies. To name just a few, the Item quality campaign relied on the work of the community to evaluate Items using a single-grading label scheme. The CoolWD project focuses on a particular dimension of quality, providing users with information about the completeness of the results of a query and allowing them to add this information to Wikidata. Furthermore, I have previously sought in a Request for Comment (Data quality framework for Wikidata) to gather opinions from the Wikidata community in order to create an appropriate data quality framework for this platform, which would be rooted in prior scientific literature and distinguish several quality dimensions.

All these projects focus either on measuring data quality under various viewpoints or on generating a conceptualisation of data quality in Wikidata. They are essential to our understanding of Wikidata, as they explore different aspects of its quality. Nevertheless, data quality is most commonly defined as "fitness for purpose". As such, it is seen from the point of view of data consumers. What can be an acceptable degree of completeness or accuracy for e.g. providing tourist information, it is not enough when it comes to using the data to provide medical advice. Therefore, for a comprehensive understanding of what data quality means in Wikidata we need to have a clear overview of how this is used as a resource.
Specifically, the aims of this session will be to:

identify typologies of data consumers for Wikidata;
gain an overview about the needs of each data consumer type and of the quality issues they experience.

This session is open to everyone interested in Wikidata. However, it would be ideal to have a mixed audience, with member of the Wikidata community and professionals using this project as a data resource, in order to facilitate the exchange of different points of view. The presence of both practitioners using Wikidata as individuals and members of organisations would be highly beneficial.
The session will be structured in three parts:

Short introduction by the author of the submission about data quality-related projects concerning Wikidata (10-15 min.);
Open discussion, where the attendees will be invited to report their experiences and express their ideas about the topic (35 min.);
Summing up of the discussion and final remarks (10-15 min.).

Collaborative notes of the session

This session is intended to be a workshop where participants provide their feedback and discuss different aspects of data quality in Wikidata under the point of view of data consumers, i.e. having in mind the needs for their task at hand when they use Wikidata as a knowledge resource.

In the first part of the session, I will give a short presentation about data quality in Wikidata, to summarise the current approaches to the topic.

So, why quality? Because it is important to know how successful Wikidata is, but it is also important to know where there is room for improvement.

Data quality is *fitness for purpose* (see below). What are the main tools now for this? Examples:

ORES
Property constraint warnings
Cool-WD judges the "completeness" of Wikidata entities
Research on taxonomic hierarchies

Problem: P31 and P279 (instance of and subclass of) got mixed too many *damn* times.

Analysis on external references: around 60% are relevant + authoritative; room for a (semi?)automated approach for this.

So given all these dimensions, what is quality? Items must be accurate, complete, etc., but data quality is a complex concept still to be defined.

From a data consumer perspective: which dimensions are the most relevant? What is the level of their quality at the moment?

Afterwards, participants will be asked to form groups. This is important to better structure the contributions of the attendees in the limited time available. Ideally, groups should be formed by people with similar data use requirements. Each group should include 4 to 6 people.

Please add a short description of the data needs of each group in the following (~5 words).

Group 1 - Developers who use Wikidata data to enhance other data

Group 2 - People who use Wikidata data for other purposes every day

Group 3 - Researchers who use Wikidata for their researches

Group 4 - People who work with GLAMs (aka Wikidatian Anonymous)

The central part of the session is structured as a discussion about different data quality dimensions, in order to gather from groups the importance they assign to each dimension, the requirements they have, whether Wikidata fulfils these requirements, and which issues they most commonly find.

Please add shortly your observations regarding each dimension we select for discussion.

1) Why is this dimension important for you(r group)? 2) What do you expect from WD to deal with this dimension? 3) Examples

Dimension 1: Accuracy

Group 1

Why is accuracy important for us?

What could Wikidata do to satisfy your needs?

Examples of issues.

"We are doing timelines", and accuracy on dates means it must be precise to at least the day.

Scientific thinking of "accuracy" is …?

We have "political data" users in the group.

"We just want it to be right."

Accuracy ties into completeness.

"I prefer a subset of data thats 100% accurate."

Can we tell *which* 5% are not accurate?

There are many types of being inaccurate. Some are obvious and easy to ignore, some have a bad impact.

Exposing data more must help increasing accuracy, because errors are much more visible.

Can we have the accuracy + error boundaries with the data?

Quantities already have error boundaries, but this is not used. Use it!

Tools that identify outliers, and help fixing/improving.

In general more help on identifying the quality of the data.

Confedence scores, maybe?

Group 3

Accuracy

We need to trust the data if you dont trust Wikidata data then Wikidata has a problem....

It could cost legal problems

Completeness

We have different applications that have different needs

That you can identify the data

I would like to have volume number, etc. so I can use it

Would like to see the object/property and a reference so I can check it

Consistency

You would like to have the same level of details

Example all horror movies should have the same level

For me the accuracy is important not completeness

Group 4

If an item is not accurate, people cannot build things on it.

Accuracy is not just a binary value (right/wrong), but also how much precise a data can be (ex. a date or a coordinate)

too much precision is not accuracy, but false precision, e.g. many decimal places for the geo co-ordinates of a city.

source databases often have wide margins of error, e.g. historical dates that are only narrowed down to a range, or "Iron Age". Wikidata needs to be able to express that uncertainty: not imply more precision than given in source

accuracy should be as accurate as it makes sense for the context -- but this is difficult to define

we can be as accurate as the source is: we don't want to lose the accuracy of the source, but we cannot be more accurate of the source itself.

We should find a way to define when some item or even some data is accurate, so to discourage at least import of "junk data" from Wikipedia (i.e. somebody already did an accurate job, then a bot replicates the data badly extracting data from a WP infobox)

Dimension 2: Consistency

Group 1

Top example is the class tree. Its very fuzzy.

It makes it impossible to write the one, simple query.

Suggested is an other layer on top of the raw data that unifies or cleans the tree up somehow.

Data from different language wikis is, even if consistent, are consistent in very different ways. One query does not fit all. Example: German movies are assigned to their genre in a different way than all the Spanish movies.

Group 2

Group 4

consistency as ontology, as values, as linking, as naming...

if data is not consistent, you can't find the value - basically is like data is non existing

Ideally, the query results would not depend on how you phrase query, e.g. whether in terms of a property or its inverse

There have been cases where querying for positions held by people (presidents) gives different results from holders of that office.

important to find if some data is missing

Dimension 3: Completeness

Group 1

Why is completeness important for us?

What could Wikidata do to satisfy your needs?

Examples of issues.

The group understands incompleteness as "items missing" as well as "properties on items missing".

Very important when dealing with political data.

Best practice: start with a critical subset that is easy to check, and you know very well.

Visualizing makes completeness visible.

Copyright makes it impossible to be "complete" when it comes to movie posters and screenshots.

Comparing to other databases is common practice to check for completeness.

We wish Wikidata makes "completeness" visible. Can we have a completeness score?

Suggest properties that are crucial, when some other properties are already there. Some might just be suggestions, but some might be super-important. "Don't even bother to create this item if you can't at least provide this and that."

Same completeness guidelines on qualifiers that are crucial.

Example: When there is president number 44 modelled, where is 43?

Group 4

are we considering the examples of children? What if we don't have a precise number of offsprings of a certain notable person?

having all the items in a catalogue

problem with living people - date of death is null, and what is that? person is living or we don't know the date of death?

unknown values

Some negative queries are interesting, e.g. prime ministers who didn't go to university.

think of the false positives :)

Sometimes ambiguity of unknown value versus known not to exist. E.g. if someone has one occupation in WD, did they do that one occupation all their life, or is it the only occupation they are notable for?

We have a way to say that someone had no children, but no way to say they had no more than two children.

Sometimes data is there, it's just missing from the relevant item

Being the first person to do something can be a notable fact in itself: without completeness, being the first in WD is ambiguous.

Summary

Even if we know there are mistake, how to spot them?

Issues with instance of and subclass of stick out.

We need a consistent/stable ontology.

Overview of the session

"Donald Trump" instance of: "terrible president" -- qualifier: disputed by: <no value>

"Your honour, I killed all those people for the sake of completeness of data on Wikidata!"