Wikidata:WikiProject Ontology/Problems

Problem cases in existing knowledge representation by Wikidata

The purpose of this page is to identify specific classes, metaclasses, or other groups of classes within wikidata where there are what appear to be significant ontological problems, and work out how we should resolve them. Difficult problems should probably be discussed on their own subpage.

items used as classes edit

Statements with instance of (P31) should only have classes as values. This rule is expressed as property constraint, so violations should be listed at Wikidata:Database_reports/Constraint_violations/P31.

subclass/instance of loops edit

Ontologies should generally be tree-like. That is they should have a partial order: "instance of" should point from an item to a group (class) of items, while "subclass of" should point from a smaller class to a larger class. Treating the instance-of and subclass-of relationships as edges of a graph, the resulting graph should be acyclic; there should be no loops. Unfortunately a few loops have made it into wikidata at this point in time; this section is to document and address them.

Subclass/Instance of loops in wikidata - autogenerated lists edit

General looping instance/subclass combination edit

This sparql query could in principle catch all the general loop cases - except it times out on the current WDQS...

    ?item (wdt:P31|wdt:P279)+ ?item .
Try it!

spurious high-order metaclasses edit

Ontologies should generally not be very deep. That is, if the instances of a class are themselves classes, that makes the class a "metaclass". First-order metaclasses are not unusual. Second-order metaclasses (whose instances are first-order metaclasses) should be quite rare. Higher-order metaclasses may be needed but should be extremely rare; variable-order metaclasses may also be helpful (some definition of "class" would be one) but also should be rare.

Problems with higher order metaclasses edit

A class can be identified as such by itself being a subclass of another class, by having another class be its subclass, or most directly by having instances. These provides three different mechanisms for detecting higher order classes as well, as the following queries illustrate.

These are only looking at direct instance-of relationships up the hierarchy. The most general query along these lines would look like, for example:

select DISTINCT ?item WHERE
  { ?metametaclass wdt:P31 ?item .
    ?metaclass wdt:P31/wdt:P279* ?metametaclass .
    ?class wdt:P31/wdt:P279* ?metaclass .
    ?otherclass wdt:P279 ?class . }
Try it!

However this times out in WDQS.

'concept' edit

concept (Q151885) comes up as a high-level metaclass in many subclass of/instance of trees; for example:

but concept then has two more levels above it that cross the instance-of (metaclass) leap:

symbol (Q80071) itself also appears frequently near the top of the ontology trees. This should probably be cleaned up.

classes with too many subclasses edit

Classes can have millions of instances (human (Q5) being a typical example in Wikidata). But in order to be useful abstractions, subclasses of a given class should be relatively limited in number. This should produce a reasonably understandable tree of groupings of whatever the class contains. Here is a list of classes with more than 1000 direct subclasses:

Anti-patterns from Multi-Level Modeling Theory edit

See Applying a Multi-Level Modeling Theory to Assess Taxonomic Hierarchies in Wikidata by F. Brasileiro et al. This paper was discussed on English Project Chat in March 2016. It lists several specific anti-patterns to check for, with associated sparql queries:

Anti-pattern 1 edit

An item is an instance of a class, but is also classified (perhaps via several intermediate classes) as a subclass of the same class. This often indicates that "instance of" has been used where "subclass of" makes more sense; alternatively it may mean the class in question should be considered a metaclass whose instances are classes. There are a lot of issues like this in wikidata right now.

The most general form to find these problems is:

select ?metaclass ?metaclassLabel (count(*) as ?count) WHERE {
    ?class wdt:P31 ?metaclass ;
           wdt:P279+ ?metaclass .
  service wikibase:label {
     bd:serviceParam wikibase:language "en" .
} group by ?metaclass ?metaclassLabel order by DESC(?count)
Try it!

but this general query times out. Some of the specific autogenerated lists may be empty also due to time-outs; when the queries work they all show many problems of this sort.

Anti-pattern 2 edit

This is where a subclass C has two superclasses A and B that are related to one another by an instance of relationship.

The general form for this query (which again times out) is:

select ?classA ?classALabel (count(*) as ?count) WHERE {
    ?classC wdt:P279+ ?classA ;
            wdt:P279+ ?classB .
    ?classB wdt:P31 ?classA .
  service wikibase:label {
     bd:serviceParam wikibase:language "en" .
} group by ?classA ?classALabel order by desc(?count)
Try it!

Also note this inconclusive RFC on color class relationships from 2016 (color (Q1075) is one of the classes appearing most often in these lists).

Anti-pattern 3 edit

Conflicting instance-of relations: C is an instance of A and B, but B is also an instance of A. The following query would fetch these cases:

SELECT ?classA (count(*) as ?count) WHERE {
  ?classC wdt:P31 ?classA;
          wdt:P31 ?classB .
  ?classB wdt:P31 ?classA .
  } group by ?classA order by desc(?count)
Try it!

but again it times out. However the paper mentioned above does list some specific cases to look into, and that there were over 7000 cases in all:

Central Park (Q160409) is considered an instance of both urban park (Q22746) and park (Q22698), while urban park is also an instance of park. This anti-pattern often occurs in chains with terms such as: award (Q618779), Chinese surname (Q1093580), family name (Q101352), Voivodeship road (Q1259617), Mikroregion (Q11781066) and natural region (Q1970725).

Other noted problems edit

From wikidata project chat April 20 2016: "SQID as a tool for editors" edit