Wikidata:Events/Data Quality Days 2021/Outcomes

Data Quality Days
8-15 September 2021

📆 Events

📗 Resources

👥 Participants

Projects board

🏅 Outcomes

💬 Telegram group

🗒 Discussions


On this page, let's list all the things that have been created, worked on or improved during the Data Quality Days. It doesn't have to be a big achievement! Did you improve some Items, update documentation or fix a tool? Feel free to add it here.

Template to add something:

* Short description of the outcome and what you did, with links if possible ([[user:X|X]])

Contributions edit

If you made contributions to the content of Wikidata in order to improve data quality, feel free to add a summary here.

Tools & development edit

If you created a new data quality related tool, or added improvements to an existing one, we want to know! Also, don't forget to add the new tools to this page.

Discussions edit

Important discussions and decisions made during the Data Quality Days can be summarized here.

Community edit

For new discussion pages or groups that were created.

  • After the shape expressions introduction, a dedicated Telegram group was created, as well as a page to ask for help with Schemas (link TBA)

Queries edit

Any interesting query that you wrote or adapted for your work on data quality can go here!

Documentation edit

Did you create or improve documentation pages during the Data Quality Days? Awesome! Please add the links below!

Notes of the events edit

The collaborative notes, when taken, have been copied from the Etherpad and pasted here.

Bibliometric-Enhanced Information Retrieval: A new alternative for the validation and enrichment of Wikidata Statements edit

Slides: https://commons.wikimedia.org/wiki/File:Wikidata_Data_Quality_Days_-_BIR.pdf

🖊️ Notes:

  • Speakers from University of Sfax (Tunisia)
  • Biomedical knowledge in Wikidata (uneven)
  • Bibliographic metadata i.e. title, authors, references, affiliations, external id, abstract, keywords, publication type, etc.
  • Bibliographic Databases
  • OpenCitations - CC 0
  • DBLP - CC 0
  • PubMed (biomedical & life sciences) - FairUse License
  • Parsing bibliographic databases
  • Usefulness of Bibliographic Information
  • Useful resources to generate "main subject" (p921) statement
  • Topic Modelling
  • Semantic Annotation
  • Word and Graph Embeddings
  • But also: Keywords and Controlled Keywords / can be used alongisde titles (HOW?)
  • Citations and co-citations -- Citation and Co-citation networks for a given topic should be constituted of one cluster. This means that papers about a topic should either be cited or co-cited with other papers about the topic. Papers that are neither cited or co-cited by another one about the same topic is probably an odd paper that is not related to the considered topic.
  • Section and Source title as P1433
  • Publication Type (artlicles can have varying rigorosity and quality), Publication Year (articles can get outdated) and Status (articles can get retracted) -- we need to consider all of these meta information
  • RefB - bot is presented
  • Code: https://github.com/Data-Engineering-and-Semantics/refb

💬 Questions & answers:

ORES: Using AI for quality control in Wikidata edit

🖊️ Notes:

  • tools mentioned:
  • https://item-quality-evaluator.toolforge.org/
  • https://wdvd.toolforge.org/

EntitySchemas and Shape Expressions on Wikidata edit

🖥️ Slides: https://docs.google.com/presentation/d/1mEFklF2DX6aQxsMzqlnrW4VPlSNzwrf_8_2K_z2vg_4/edit?usp=sharing

https://www.w3.org/2021/Talks/0908-shex-wikidata/

Tools used:

   * YASHE: https://www.weso.es/YASHE/

🖊️ Notes:

  • canonical examples of instance data are a good starting point

Mismatch finder and beyond: How can we incorporate feedback from our biggest data re-users at scale? edit

🖊️ Notes:

  • Tool code: https://github.com/wmde/wikidata-mismatch-finder
  • The Tool itself: https://mismatch-finder.toolforge.org/
  • How to upload mismatches: https://github.com/wmde/wikidata-mismatch-finder/blob/main/docs/UserGuide.md
  • Idea: "Looking forward to having the gadget showing probable mismatches in items!"
    • Wikidata VS Wikidata in the tool: this would help find mismatches within Wikidata
  • As the history of mismatches is keept, it could in a later version be possible to rate the reliability of mismatch uploads and us that to priorityse or save work in checking them
  • We should distinguish between small and big fishes when thinking of data re-users: big fishes should ideally provide their error dumps, while small fishes can fix Wikidata on the fly
  • tech-savy community members can work on automatic ways to detect mismatches

💬 Questions & answers:

  • Should constraint violations also go to the mismatch store, to have one starting point for processing errors?
    • looks like an open question that will depend on which constraints are violated. We talked about the same identifier in different Wikidata items, which is a good signal of duplicates in both directions (Wikidata & target databse)
  • Should diverging data preferably be uploaded to mismatch store or added as alternative values?
    • Depends on reason for divergence: Is it a typo or another point of view?
    • Concerning biographical data such as birth dates, we agreed that we should first upload mismatches to the store, then when they get curated, we can upload them to Wikidata
    • Not using deprecated statements for curated, but incorrect data leads another volunteer contributor to eventually have to repeat the curation effort
  • How does the mismatch store store the mismatches? Any RDF vocabulary for data quality (like https://www.w3.org/TR/vocab-dqv/ ) used?
    • Currently, its just a plain table with the affectes items, values and sources. Upload could be done as CSV.
  • How does the mismatch store repeated reports of same (not actual) mismatches? (Or: How to avoid keeping people busy be checking the same mismatches again and again?)
    • History of mismatches is keept, and upload could be checked against.
  • Couldn't mismatch store provide views to other platforms showing only mistakes related to them, as a supporting infrasructure?

Overview of ontology issues edit

💬 Questions & answers:

  • How to build a bulk of rules for people to easily apply ?
  • Do you have the capability to control the assignment of classes/sub-classes ?
  • Is there disadvantage to separating items for conceptual ambiguity issues (for example: two different items for a museum (building) and a museum (organization)?
    • sitelinks are the usual problem, plus stopping people from merging them again
    • for many uses and classes, it's not desirable to spilt building(s), its physical site, the organization it forms, owner of buildings, technical operators into 5 separate items
    • Is there a way to enhance the Wikidata interface to make it easier to view sub-items?
  • Are redundant startements a problem?
    • Redundant statements aren't really a big problem and might sometimes be desirable if, say, there's a reference for the more general statement but not the more specific ones
    • Making items instances of too specific subclasses isn't really desirable. Sample: use P31=Q5 for people
    • Somehow that paper considered people being instances of their occupation
  • Topical queries on these structural bugs would be valuable to WikiProjects. It's in their interest to clean it up, and it's conceivable they have the knowledge to do it too.
  • Is it indeed a problem the increase of number of items ? If this contributes to clarity and accuracy, then what ?
  • Missing issues: can items having neither instance nor subclass be considering a missing issue? "Phantom items"?
    • Yes, maybe we should include this.  
    • Items with neither instance nor subclass are 2.7M
      • https://www.wikidata.org/wiki/Special:Search/-haswbstatement:P31%20-haswbstatement:P279
      • Phantom items are about 2.8% and if you bear in mind that a huge chunk of the database is scholarly articles and stars and genes and stuff, the proportion of the rest without either property is even higher
      • Editors should be encouraged to add P31 and/or P279. Some editathon maybe ( dividing them according to sitelinks present )
      • Maybe fix the property suggestor to actually invite people to add P31 or P279 to these?
  • Overloading
    • heritage from Wikipedia
  • What is an upper ontology? (And could we get rid of it?)
    • https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology
    • https://angryloki.github.io/wikidata-graph-builder/?property=P279&item=Q35120&iterations=2&mode=reverse
    • Is there any benefit of having one?
    • Most contributors working on this seem highly specialized, but they have problems coming up with one.
  • Is that perhaps a general quality measure? % of items that have P31 or P279
  • "Semantic drift" seems like a misnomer, given the examples I saw. It seems like the problem is people misusing vocabulary, not meanings shifting in time.
  • who manages the WD ontology - a group of people within WMDE ?
    • the ontology comes from all Items -> the Wikidata community manages it bottom-up
    • Much of Wikidata assumes that items need to be an instance of another item. Genes even end up being both instances and subclasses of the same.
    • Does it even make sense to try to do that? (without even having the big picture at hand)
    • Some assumptions about ontology flow into Wikibase design decisions
  • Tool suggestion: A plugin that auto-suggests subclasses & superclasses (a tree of classes - based on queries/) as you start typing a particular "instance of" value. So that a user can already get a little preview of the class structure and can make a decision on the fly what is the most appropriate class / subclass relation.
    • comment was inspired by this talk: https://www.youtube.com/watch?v=dOMNzY5rUlE&ab_channel=NFDIDirektorat ... it's meant to address the inconsistency and confusion... even if there are no formal W3C standards to pull from in WD.)
  • Are there any samples of issues that create problems for small and medium users? Why aren't these discussed on wikidata.org?

Clarifying property application for effective SPARQL queries edit

🖥️ Slides: Link to Google presentation

Problem statements

By way of SPARQL queries to inform and formulate best practices on properties and qualifiers application in Wikidata

At the Smithsonian we are trying to model top leadership with the job title “director” -- may be director of a named museum or a unit within a museum or an independent organization such as the Center for Folklife and Cultural Heritage. Currently in Wikidata people with similar positions are modeled under “museum director” (Q22132694) and “director”(Q1162163) interchangeably, part of statements with properties including occupation (P106), position held (P39), and as qualifiers under the employer (P108) organization. There are multiple ways of representing the same information depending on the editor’s understanding of the definition of various related properties, and this complicates the writing of SPARQL queries. Attempts to compile a list of such directors proved very complex given the wide variety of ways these individuals are recorded in Wikidata. We are seeking to get a better sense of the consensus around how this type of individual’s job title should be described in Wikidata.


A sampling of SPARQL queries to find directors of museums:

  • Simple SPARQL query for occupation = museum director: https://w.wiki/43Kj  [not all museum directors have this statement in their items]
  • SPARQL query for occupation=museum director, employed by the Smithsonian: https://w.wiki/44TF
  • SPARQL query for director/managers of museums: https://w.wiki/43MD [many individuals listed as directors/managers in the items for specific museums lack reciprocal position information in their Wikidata item]
  • SPARQL  query for director/managers of museums with position held qualifiers under employer statement: https://w.wiki/43ML [not many individuals have this information, and the title varies widely if present]
  • We could definitely write more complex queries that combine some of these formats (occupation, position held, employer qualified by position held, etc) but as seen from the presentation, to capture every variation would be onerus and complex and most likely time out.


🖊️ Notes:

  • This has been a long-time mess in Wikidata, & is worth tackling
  • Which model do people like the most?
  • Start and end dates for leadership of museums etc. are important for creating timelines
  • Occupation qualified by employer, with position held for named directorship (not for generic "museum director"); would require creating items for each directorship though?
  • Model the occupation in a way similar to public office items? E.g. "Mayor of [place name]"?
  • Should employer be a qualifier under occupation?
  • Occupation makes more sense as a "root" description
  • "At universities you are being promoted so it makes sense that the positions are shown within the employer - and that's where I've generally seen position held"
  • "Thinking about it CV's are often grouped by employer"
  • ""director" is a kind of "position held", yes ?"
  • "https://phabricator.wikimedia.org/T97566" Though Phabricator community is small, hard to give specific issues much attention
  • No best practices yet; good way to start addressing it by having a group discussion (but we can't resolve this today)
  • Change the property constraints to help guide the modeling
  • A lot of clean-up, manual work mostly (maybe can move some things in bulk), needed once a model is agreed upon by consensus
  • Wikidata discussion pages don't usually provide a recommended approach to take
  • Maybe discuss more at Wikiconference North America, or Wikidata Conference; engage wider community discussion
  • Occupation/employer/position held -- add this topic and link to etherpad notes to property discussion pages to continue the group discussion
  • "It would be useful to look a the way RiC addresses the issues of Ocupation/Position held https://www.ica.org/sites/default/files/ric-cm-02_july2021_0.pdf"

💬 Questions & answers:

  • How to best reach consensus on these problems? Often, discussion on property talk pages does not ultimately clarify the issues in a definite way. There are merely <discussions>.
  • Without constraints / constraint violation flags there is nothing to prompt the editor to correct these types of statements to conform to consensus. What is the mechanism to better communicate consensus in property talk pages to the wider Wikidata community?
  • How to best retrospectively implement quality control following new consensus?
  • Is there really any interest in being able to query "all current museum directors"? What would be the educational value of such queries.
  • you may not want to query all current museum directors, but you may want to query female museum directors, or italian museum directors, for example, etc. and with the way the position is currently modeled you wouldn't be able to query an accurate subset of museum directors. For our purposes, we originally wished to query all directors of Smithsonian Institution organizations/museums currently in Wikidata so that we can improve existing items and add missing ones, but we were unable to do so in a simple way because of the way the information is currently recorded in items for these people.

From the Smithsonian Team notes:

One example of the complex ontology issues in Wikidata is the inconsistent modeling of individuals who hold the position of director of an organization. In this case we specifically looked at directors of museums and cultural institutions. Currently in Wikidata people with similar positions are modeled under “museum director” (Q22132694) and “director” (Q1162163) interchangeably, part of statements with properties including occupation (P106), position held (P39), and as qualifiers under the employer (P108) organization. A survey of known international museum directors produced a list of at least 15 potential ways this information can be modeled in Wikidata, and many individual items include a combination of these different statement types.

There are currently no best practices in the Wikidata community for modelling this type of complex data, and the issue is a long-time mess in Wikidata that is worth tackling. There may be a variety of sometimes conflicting arguments and use cases for how and why the data should be modeled a certain way. This session served as the beginning of a discussion that must continue before consensus is reached on recommendations for best practices.

Potential data models

Two major data modeling proposals came out of this session:

  1. Model via occupation property, with position held only if named directorship. Occupation as a better root description than employer or position held.
    1. Person has occupation [director/museum director]
      1. Qualified by employer [museum]
      2. Qualified by dates
      3. Qualified by replaces [former director]
      4. Qualified by followed by [subsequent director]
    2. Person has position held [named directorship only]
    3. Example of named directorship: Jean-Luc Martinez
    4. Example of modeling via occupation property with employer qualifier: Richard Owen
  2. Model similarly to CV, via employer property -- allows for multiple positions/career within an institution to be traced; this may model university careers better.
    1. Person has employer [museum]
      1. Qualified by position held [director/museum director]
      2. Qualified by dates
      3. Qualified by replaces [former director]
      4. Qualified by followed by [subsequent director]
    2. Examples of CV-style modeling via employer property:
      1. Rudi Fuchs
      2. Elisabeth Tietmeyer (shows career progression within same institution)

Further discussion

Current paths for raising the issue, collecting feedback, recording discussion and consensus, and establishing best practices are:

  • Including a summary of this session and a link to the Etherpad notes in the property talk pages for occupation (P106), position held (P39), and employer (P108) properties
  • After consensus is reached, make better use of Wikidata Usage Instructions properties, property documentation, item usage instructions, property constraints, and entity schemas to guide best practices?

Cross-checking on-wiki: visibility, duplication, migration edit

🖥️ Slides: https://docs.google.com/presentation/d/1MoxDJeSegpmA6NzklzrNO4OWd8MNlwNDE9Jbd4HJVyM/edit#slide=id.g1f43ddb61a_0_74

🖊️ Notes:

  • https://en.wikipedia.org/wiki/Template:Commons_category

Discover patrolling & quality tools edit

🖊️ Notes:

  • Item Quality Evaluator (Lydia Pintscher) https://item-quality-evaluator.toolforge.org/
  • SpeedPatrolling (Lucas Werkmeister) https://speedpatrolling.toolforge.org/
  • Wikidata Vandalism Dashboard (Ladsgroup) https://wdvd.toolforge.org/
  • Constraint Violation Checker (Lydia Pintscher) https://github.com/wmde/wikidata-constraints-violation-checker
  • slides: https://drive.google.com/file/d/1zk8M1PqiBSWj23LR7RwBRU4KJitvGjMh/view?usp=sharing
  • justified vs unjustified imbalance
  • slides: https://docs.google.com/presentation/d/1Hb5q5a2CC2XgXk_lvTkAQrc9SBtvi0MhBUwwAXbn5m0/edit?usp=sharing
  • Curious Facts (Lydia Pintscher) https://wikidata-analytics.wmcloud.org/app/CuriousFacts
  • soweego (Hjfocs) https://soweego.readthedocs.io/
  • RECH https://pltools.toolforge.org/rech/

💬 Questions & answers:

  • SpeedPatrolling: Can we use the tool to patrol only certain property changes etc? Not really, you could try other tools for that, such as https://pltools.toolforge.org/rech/. SpeedPatrolling is meant to be simpler.
  • SpeedPatrolling: does it only work with rollback rights? If you’re not a rollbacker, the rollback button will be disabled (there’s no “undo” option because that’s often not the right thing to do when there are several bad edits in a row).
  • ProWD: do you have something like high score, ranked list? most unequal group of entities? A student did that, but there can be trivial heavily imbalanced classes (e.g. with few entities, of which only one has many statements), so it’s not necessarily representative. Don’t have the data right now.
  • Would it be worth adding some of the scripts mentioned during this call as gadgets in the preferences?