User:GerardM/Quality

Quality in Wikidata is a many faceted thing and it is definitely not an absolute. Some items have few or no articles in Wikimedia projects but are still extremely valuable because of a large amount of links. Some statements are important because they help in the disambiguation of people, things like date of birth maybe date of death. Quality is a goal to strive for.

Perfection edit

Quality does not mean that we have to be perfect in everything. As it is an aspiration, The quality sought in our data depends on requirements expressed. When we do not provide a specified quality, there are two aspects to consider. The extent we are lacking the expectations and the way the quality is expressed internally in Wikidata and perhaps externally as well.

Existing quality functionality of Wikidata edit

  • When Wikidata was started it was an immediate success because it replaced the "interwiki" links in Wikipedia. The old system was unreliable, took a lot of effort away from the editing process and the many obvious problems with these links were not easy to find and were not easy to remedy. There has been no quantitative research in this quality improvement that is generally accepted as really important.
  • It allowed for the linking of articles. This was done by making statements like "Barack Obama" "office held" "United States Senator".
  • It allowed for the labelling of items. All labels for items and properties can be expressed in all languages the WMF supports. This allows for "Barack Obama" "tehtävä tai virka" "Yhdysvaltain senaatin jäsen"..
  • It allows for the qualification of such qualifiers.. "From January 3, 2005 until November 16, 2008" you will find in the Reasonator.
  • It allows for the linking to sources external to the Wikimedia Foundation. There are many properties for this and there are many people active adding these identifiers. There are tools that help doing this.
  • Wikidata has no single focus. In the true spirit of the Wikimedia Foundation, it aims to support the provision of the sum of all data.
  • There has an active, knowledgeable community of people involved in many aspects of Wikidata. They all help Wikidata become a better resource. Wikidata is a work in progress.

What Wikidata is not edit

  • It is not complete and it is not completely accurate.
  • It is not a "professionally" maintained resource and its function is not how a professional would do it. As Wikidata has no single focus, there is no fixed vocabulary for including statements. It is a crowd sourced project.
  • It is not perfect. It is assumed that everyone who edits a project like this makes at least 4 to 6% errors. Much of the data comes from Wikipedia lists and categories and they have a similar error rate.

Measuring quality edit

Wikidata consists of data. It seems obvious that with 25,391,205 items an individual approach to quality is not realistic. When you consider the Wikidata statistics that Magnus provides us with, it becomes obvious how much effort is still needed to achieve at least one statement per item and one label per item.

The quality of Wikidata is best seen in the light of "set theory". Every Wikidata person has its own set of data that is of interest to him or her. There may be multiple sets of data but this they are what make people act. An example; when all the people featured in 100 women "(BBC)" are known to Wikidata and not just people with articles in English, it changes the reporting on these people considerably. There is a project to write about them so adding ALL these people is of relevance. Similarly people who once won an award are likely to be notable and there may be an item linking to an article in another language or the person won an other award as well or studied at a university or was a professor at a university.. etc. This may make these persons of relevance to people interested in another set of data.

Practical applications edit

  • When the English Wikipedia adds the {{authority control}} template to its articles, it is obvious that every author with a book in a library somewhere in the world is registered by that library for the book and as an author. The OCLC brings these registrations together and provides a VIAF identifier. When a Wikidatan adds a VIAF identifier to Wikidata, within a month VIAF will pick up the identifier and adds it to its registry. This in turn will enable Librarians and people in a library to link through Wikidata to Wikipedia articles in any language.
    • When we identify all authors and its subclasses, we find a substantial amount of items with no VIAF identifier. When we make this smaller; all authors with an article in Turkish, the resulting subset is relevant to someone who knows Turkish. It is more likely to find volunteers to work on a subset of data like this.
  • As Wikidata registers both VIAF and "Open Library" identifiers, we can provide both OCLC as the Internet Archive with a set of data with both Wikidata, VIAF and OL identifiers. As a result it is feasible for the Internet Archive to include a link to VIAF (and to Wikidata).
    • Because of this interest the Internet Archive will identify its redirects in the dataset provided by us and we can remove these redundant entries.
    • When a Wikipedia community is interested, they can include "Open Library" in their "authority control" and provide a gateway to freely licensed books by the authors known to Open Library.
    • VIAF are considering if they want to include a link to "Open Library". This will provide librarians and library users a gateway to free e-books to read.
  • NB this is not based on a complete dataset. It is possible because Wikidata has the capability to bring together authors from any system and identify them. This is not necessarily one identifier per system per author but it is based on the CURRENT data, data that will improve because of its relevance.
  • The "Black Lunch Table" is a project that documents artists from the "African diaspora". Identifying these people using particular statements is not possible. The property "catalog" is (ab)used to identify all the people who the project leaders identified as part of the project.
    • In stead of maintaining a spreadsheet and two Wikipedia lists, the data is maintained only once in Wikidata.
    • The result is that many errors in these list surfaced and were remedied.
    • It was found that several artists had an article but not an article in English.
    • Data for artists with existing articles was improved and this allows for queries that are considered relevant to the people managing the "Black Lunch Table".
  • Info boxes on Wikimedia projects use the data from Wikidata. As the information is added, the info boxes become more complete.
    • Are there queries that help a community pinpoint the items with missing data for these info boxes?

Quality projects edit

Given that Wikidata is immature, any help we can get from any project that helps us identify where some added focus is of relevance is vitally important. The help does not need to be perfect, it only needs to make a noticeable difference. As a community we tend to think in terms of absolutes. These absolutes are often a result from our Wikipedia sister project. Their focus is on articles and this does not transfer to items. When they have an article, we have an item; it does not need statements and it is still something we keep. When not enough sources are available, they may consider deletion but that is no consideration for us. When the same person won an award, the item stays even when an associated Wikipedia deletes its article.

The Quality project edit

At this time a project is underway that rates items to a set of quality levels. This project is run by a student who does not have the time for us to develop a set of quality levels that are acceptable to us. This is part of a thesis he is to write.

It does not really matter if the quality levels are perfect. What matters is that it is a first attempt to grade the quality of all our items in an automated way. This enables us to identify items that are in need for some "tender loving care". When this is to work well, it will interface into query and we have a proof of concept of a more intelligent way of engaging and growing our community.

This is a first phase of an iterative process where we will gain an instrument that needs more development / refinement in the future. What it is not is an instrument to identify items with absolutes as "ready for deletion" or "is of absolute quality". This iterative process will likely take us many years OR when it proves a success will see quick iterations in the future. This will rely heavily on the quality of the engine used.

Proposal edit

Quality is important. Engaging people is important. What we need is:

  • more attention to how we can engage Wikidata for practical purposes
  • more queries that help us identify where involvement of our community leads to what practical improvements
  • support for all activities that bring us quality in a measurable way.