Wikidata:WikiProject Ontology/Cleaning Task Force

This page serves as the main page for the ontology cleaning task force's efforts to clean up the Wikidata ontology.

The task force arose from a presentation (https://commons.wikimedia.org/wiki/File:Wikidata_Challenges_in_Semantic_Web_Community.pdf) by Andrea Westerinen at the 2023 Wikidata Data Quality Days (https://www.wikidata.org/wiki/Wikidata:Events/Data_Modelling_Days_2023)

Anyone interested in working to improve the Wikidata ontology is welcome to join the task force.

Scope edit

The scope of the task force has not yet been determined but just about any effort to improve the Wikidata ontology will probably be considered in scope, ranging from theoretical analyses of the ontology, to techniques for reasoning in Wikidata, to implementation of tools that help improve the ontology, to documentation of problems with the ontology, to direct editing of the ontology. There is a welcome page that gives some background on the Wikidata ontology and the task force.

Participation edit

Anyone interested in working to improve the Wikidata ontology is welcome. Add your user identification below to join the group. Add issues that you are interested in or currently work on to the sections below.

Participants edit

  1. Peter Patel-Schneider
  2. Lectrician1
  3. Andrea Westerinen
  4. PKM (talk)
  5. Chris Mungall
  6. CV213

Meetings edit

The task force uses Google Meet for meetings, as described in the task force's calendar. Currently meetings are Tuesday at noon ET (currently GMT-5). Members of the task force are active in the Wikiproject Ontology telegram group.

Next meeting agenda edit

Task force members should add agenda items for topics they wish to discuss. The default agenda has reports on the current efforts. If there is significant work on any effort please add something to its agenda item.


Tuesday, 16 April 2024, 12pm ET edit
  • Our meeting length is limited to 60 minutes by Google so meetings may be terminated abruptly.
  • Welcome new members
    • If you can't make meetings please add information to this page about your interests and activities.
  • Interests of new members

Meeting notes edit

Tuesday, 26 March 2024, 12pm ET edit

We discussed Andrea's paper, available at https://github.com/AndreaWesterinen/Wikidata-and-OWL/blob/main/papers/Understanding%20Wikidata.pdf. The paper is still a work in progress.

Tuesday, 20 February 2024, 12pm ET edit

Participants: Peter, Andrea, Ege

Andrea discussed running TFT (from BorderCloud) as a potential test for new systems to support WDQS. The BorderCloud repositories were forked and modified, including aligning all the tests with the current W3C repository (https://github.com/w3c/rdf-tests). TFT executes the W3C RDF tests for SPARQL 1.1. See https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_TFT for a complete description of how the code base and tests were modified. The TFT code is found at https://github.com/AndreaWesterinen/TFT, and the updated RDF tests at https://github.com/AndreaWesterinen/rdf-tests.

Andrea also discussed dumping item constraint violations by scraping the Wikidata mandatory constraint violations page and constructing RDF triples from them. The notebook to do this is found in GitHub at https://github.com/AndreaWesterinen/Wikidata-and-OWL/blob/main/notebooks/Wikidata-Constraint-Violations.ipynb. This should make it easier to understand the "noisiness" of the item data, and enable bots or KG code to selectively delete triples (such as properties with erroneous values), or to insert triples (for example, adding inverse relationships where they are missing). At a minimum, knowledge about the violations can be used to indicate reduced confidence in the information conveyed by the violating item.

Tuesday, 30 January 2024, 12pm ET edit

Participants: Peter, Andrea

  • Reports on status of current efforts.
    • Links to other ontologies Andrea Westerinen: Andrea is working on translations of Wikidata constraints to OWL. Andrea is continuing work on mapping to schema.org and connections to other ontologies.
  • Test Suite: Andrea is working on a test suite for SPARQL 1.1 to see how well replacements for BlazeGraph perform
Tuesday, 23 January 2024, 12 noon ET edit

Participants: Peter, Seth, Ege, Ozge, +

  • Welcome new members
    • Ozge Yalcin
  • Existing efforts.
    • Connections to other ontologies: Andrea working on querying all external links to external ontologies. There is progress but some hurdles still need to be overcome. Andrea is working on mapping from Wikidata classes and properties to OWL RDF properties so that an OWL reasoner could work on the Wikidata ontology.
  • Questions from Ege: Ege had questions about the ontology, notably about levels and metaclasses. Peter put together a short summary of what was discussed:

The Wikidata ontology is large and has a complex organization. Wikidata items can be roughly categorized as individuals, such as Douglas Adams (Q42), or classes, such as human (Q5). There are about four million Wikidata items that can be considered to be classes, with the defining characteristic of a class is that it can have instances using instance of (P31) as Douglas Adams (Q42) instance of (P31) human (Q5). Classes are organized into a generalization hierarchy using subclass of (P279), for example, human (Q5) subclass of (P279) mammal (Q110551885) subclass of (P279) vertebrate (Q110551902) subclass of (P279) animal (Q729). The most general class is entity (Q35120), the class of all entities. Classes are instances of other classes, as human (Q5) instance of (P31) first-order class (Q104086571) instance of (P31) second-order class (Q24017414). Classes whose instance are all not classes are first-order classes, for example human (Q5). Classes whose instances are all classes are metaclasses. The most general metaclass is class (Q16889133). One can think of the Wikidata ontology as being organized in two dimensions - one being generalization, using subclass of (P279), and the other being order, using instance of (P31). So human (Q5) is a moderately general first-order class and animal (Q729) is a quite general first-order class. ship type (Q2235308) is a second-order class and vehicle functional class (Q124315169) is a more general second-order class. But not all classes fit into a particular order, with prominent exceptions like entity (Q35120), class (Q16889133), and metaclass (Q19478619). Not all instance of (P31) and subclass of (P279) links that should be in Wikidata are.

Tuesday, 16 January 2024, 12pm ET edit

Present: Peter Patel-Schneider, Seth Deegan, Ege Atacan Dogan, Daniele Santini

  • Our meeting length is limited to 60 minutes by Google and may be terminated abruptly.
  • Ontology wikiproject homepage has been updated by Seth Deegan to include information on the task force.
  • Initial list of potential ontology cleaning tasks Peter F. Patel-Schneider
    • See Ontology Cleaning Tasks. Feel free to add new tasks there.
    • There is currently duplication between the task page and this page. Peter will work on eliminating duplications as much as reasonable.
    • This page may be moved or Phabricator tasks created for each task.
Tuesday, 9 January 2024, 1:30pm ET edit

Present: Peter Patel-Schneider, Andrea Westerinen, Seth Deegan, Gabriel Lopes, Yiwen Peng, Mehwish Alam, and Carl Mattocks

  • Meetings will be Tuesdays at noon ET (GMT-5) just under an hour. (Google meet may start the hour countdown when the first person joins.)
  • The meeting ran out of time.
Tuesday, 2 January 2024, 1pm ET edit
22 December 2023, 1pm ET edit

Discussion on what will be done over the break. There are ongoing efforts on disjointness, mapping to schema.org, and subclasses of entity.

15 December 2023, 1pm ET edit

How to better publicize the effort.

  • Andrea wants to have more semantic web people involved. Peter will send messages to appropriate lists.
  • Peter wants to have more Wikidata people involved. Peter will ask Lydia about good channels.
    • Email sent.
  • Peter will see about opening a phabricator ticket to increase the maximum project size for pinging.

There is a great tool to see the superclasses of a class and also to see disjointness problems. To use it see https://www.wikidata.org/wiki/Wikidata:Tools/Enhance_user_interface#Classification.js

8 December 2023 edit
  • The group will start work on several smaller efforts.
    • As part of improving the upper part of the Wikidata ontology we will look at the subclasses of entity (Q35120), and determine whether any of them to be moved or modified. Classes that have no superclasses will also be considered.
    • As part of improving the middle part of the Wikidata ontology we will look at the Wikidata classes that map to high-level classes in schema.org and see whether these Wikidata classes are correctly placed in the Wikidata ontology, using the organization of the schema.org ontology as a guide.
    • As part of improving support for ontologies in Wikidata the group will prepare descriptions of how to improve disjointness support in the Wikidata ontology and a proposal to add a constraint preventing values from belonging to classes.
  • The RDF dumps do not include constraint violations. Is there a way to augment the dumps?
7 December 2023 edit

The initial (preliminary) meeting on 7 December 2023 came up with the following preliminary decisions, which will be reexamined in the meeting on 8 December 2023.

  • The task force will try to record information in a minimal number of pages, with the task forces' main page linking to all group information pages.
  • The task force will provide links to relevant information created by others on the task forces' main page.
  • The task force will try to get access to Phabricator to track tasks undertaken by the task force.
  • The task force may consider ontological ideology issues as related to the Wikidata upper ontology, including what is the goal of the Wikidata ontology.
  • The task force willl consider volunteering resources to improve WikiMedia, for example to expand constraints.

Current efforts edit

Feel free to add efforts that you are participating in here and report on your progress.

See the task force Phabricator board for current and potential efforts being tracked in Phabricator.

Disjointness edit

Peter Patel-Schneider

Wikidata has disjoint union of (P2738), which states that a class is the disjoint union of a list of other classes. This implies that the classes in the list are pairwise disjoint. I am preparing a fuller report on violations of these disjointnesses, expanding on the information below. See User:Peter F. Patel-Schneider/disjoint_violations for a report on violations of the disjointnesses resulting from disjoint union of (P2738) statements.

Enumeration edit

There are 740 disjoint union of (P2738) statements on a total of 603 classes creating 6995 implied disjoint pairs of classes. After excluding all those related to day day (Q573) only 1455 remain. Excluding several other not-very-intersting groupings results in 870 implied disjoint pairs. Of these pairs, there are 18329 items that are either subclasses of both classes in a pair (with no superclass also being a subclass of both) or instances of both classes (that have no subclass in common) from a total of 123 pairs. Some of them appear to be simple mistakes about other classes or individulas, like liquid helium being a gas and Horse Grenadiers being a person. Others appear to be mistakes about disjointness, like game of skill stated as disjoint from game of chance, some perhaps resulting from confusion about disjointness between classes like female given name stated as disjoint from male given name. Someone should probably go through this list and try to cut it down. Some large parts of the list appear to be caused by a single questionable subclass of link or a single questionable disjoint union of.

High-level disjointness edit

The Wikidata ontology does not make many high-level distinctions. Nonetheless are the disjoint union of statements that make these distinctions, including disjointness of concrete object and abstract object under object, disjointness of artificial object and natural object under object, and disjointness of abstract entity and concrete object under entity. Each of these disjointnesses have many classes thare are subclasses of the two disjoint classes.

What should be done here? My suggestion is to look at each of the high-level disjointnesses and determine whether the disjointness is correct in the Wikidata ontology. If the disjointness is correct then edits should be made to both classes and individuals to remove violations of the disjointness. If the disjointness if not correct it should be removed, and replaced with appropriate disjointness between more-specific classes.

Class order edit

Peter Patel-Schneider

A class is first-order if it has no classes as instances. A class is second-order if it has only first-order classes as instances. Similarly for higher orders. Wikidata has several metaclasses - first-order class (Q104086571), second-order class (Q24017414), third-order class (Q24017465), fourth-order class (Q24027474), and fifth-order class (Q24027515) - that allow stating that a class has a particular order. There are lots of classes that are instances of one (or more) of these metaclasses but that violate the requirements for the class order. Some of these are individual errors but many are caused by a general confusion on orders (e.g., diseases and colors).

See https://www.wikidata.org/wiki/User:Peter_F._Patel-Schneider/order_violations for a longer description of class orders and a long list of violations as of 8 January 2024.

Alignment with schema.org edit

Andrea Westerinen

This would involve finding or creating Wikidata classes for each class in schema.org and checking that the generalizations in one are exactly the generalizations in another.

Another possibility would be to review and update the GitHub Wikidata-schema.org mapping work

Output of RDF/OWL edit

Andrea Westerinen

The current RDF output is valuable but insufficient to discover (and possibly) correct errors in Wikidata. There are several reasons for this: 1) the output does not allow reasoning/consistency checking; and 2) it does not provide a mechanism to view constraint violations for items.

The work involves programmatically converting the RDF to RDF/OWL and including information on mandatory constraint violations.

Review subclasses of entity (Q35120) edit

Lectrician1

Wikidata Graph Builder

ISO conformance edit

BFO and several ISO-ID standards are used in Wikidata.

ISO IDs edit

ISNI edit

CV213

Items that have a value for ISNI P213 should be instances of something using "instance of" P31, but never have a statement using subclass of P279. They should also not have as identifier certain other ISO-IDs like DOI, ISWC, ISRC etc.

Possible efforts edit

Add efforts that you are interested in pursuing or that you think should be pursued.

See the task force Phabricator board for current and potential efforts being tracked in Phabricator.

  • Adopt some existing upper ontology as the basis for the Wikidata upper ontology. Care has to be taken here as some upper ontologies are quite prescriptive and Wikidata has to support multiple modeling styles.
    • As opposed to adopting an existing upper ontology, it may be beneficial to collect the concept distinctions that are most valuable across several (BFO, SUMO, DOLCE, etc.) and then map the Wikidata concept model onto these distinctions.
    • It appears that a lot of BFO has already been added, sometimes creating problems for Wikidata.
  • Add facilities to Wikidata that help prevent problems with the ontology.
    • Is the property constraint none-of sufficient to forbid values from a particular class.
    • Disjointness of concepts is also important to define.
  • Suggest enhanced EntitySchema design to aid in use of the upper and middle ontologies

Some Specific Questions to Answer edit

  • When aligning with schema.org, what is the right set of properties to use to indicate the mapping?
    • Some possibilities are P1709 (equivalent class), P3950 (narrower external class), P4900 (broader class, but does this work for external classes), P2888 (exact match), P4070 (identifier shared with) and P1628 (equivalent property)
    • Or is there a single mapping property, perhaps with a qualifier indicating the nature of the mapping?
  • Should meta-class be included in the upper ontology? If so, why?
    • Some users don't like the idea of metaclasses. If one accepts that Q16521 (taxon) is a metaclass, than Q876500 (western low-land gorilla) is a class and Q12038481 (Moja) is P31 Q876500. People who don't like metaclasses however created P10241 (instance of taxon) to get around using P31 in this way.
  • How much should concepts rely on multiple inheritance to capture ambiguities (such as a geopolitical entity being both an agent/actor and a location)?
  • Is disjointness like a constraint and can be easily violated or should it be considered as inviolable.
  • There are temporally qualified ontology links (and they cause disjointness violations). Should it be required that the best one of these be preferred. There is an example of this for Berlin population in some Wikidata tutorial material.
  • Some stuff pulled from other ontologies make assumptions that are not true in Wikidata, e.g., disjointness of object and property pulled from BFO. Should these be deprecated or removed?
  • The Wikidata ontology does not make firm decisions between some very high-level categorizations. For example, many classes are subclasses of both artificial object and natural object or both concrete object and abstract entity. Nevertheless these distinctions are important and both pairs are in disjoint unions under classes very high in the ontology. What should be done about this?

Assorted Topics edit

is metaclass for edit

Peter F. Patel-Schneider

Some thoughts on metaclass relationships. Where can this be publicized?

The Wikidata ontology has a number of situations that are not easily captured with just subclass and instance links.

One example is invasive amphibian invasive amphibian (Q111535327) described in English as "amphibian that is spreading outside its original habitat" but whose instances are species, hence it is a subclass of subclass of (P279) invasive species invasive species (Q183368). But it is also stated to be a subclass of subclass of (P279) Amphibia Amphibia (Q10908), whose instances are individual amphibians, so this subclass link is incorrect.

Intead instances of instances of invasive amphibian are instances of Amphibia, i.e., instances of invasive amphibian are subclasses of Amphibia. There is an existing property in Wikidata for this relationship - is metaclass for is metaclass for (P8225).

But is metaclass for is metaclass for (P8225) is lacking support in Wikidata, as far as I know. Implications of statements using is metaclass of are not added to Wikidata. I don't know of any place that records these implications that are not already in Wikidata.

I would like to use is metaclass for when it is appropriate, and remove the incorrect subclass links. I would also like to have is metaclass for better supported.

One extra complication for using is metaclass for in biology is that the stated relationship used there is parent taxon parent taxon (P171), a subproperty of subclass of subclass of (P279). This makes querying for existing subclass of relationships very difficult.

Related Information edit

There is a page on what might be done to improve the Wikidata ontology at https://www.wikidata.org/wiki/Wikidata:Ontology_issues_prioritization. This was a result of a survey of Wikidata users on what problems they encountered when using the Wikidata ontology.

https://www.wikidata.org/wiki/Wikidata:Tools/Enhance_user_interface#Classification.js is a useful tool to show parts of the Wikidata ontology.