Wikidata:WikiProject Ontology/Problems
Problem cases in existing knowledge representation by Wikidata
The purpose of this page is to identify specific classes, metaclasses, or other groups of classes within wikidata where there are what appear to be significant ontological problems, and work out how we should resolve them. Difficult problems should probably be discussed on their own subpage.
items used as classes edit
Statements with instance of (P31) should only have classes as values. This rule is expressed as property constraint, so violations should be listed at Wikidata:Database_reports/Constraint_violations/P31.
subclass/instance of loops edit
Ontologies should generally be tree-like. That is they should have a partial order: "instance of" should point from an item to a group (class) of items, while "subclass of" should point from a smaller class to a larger class. Treating the instance-of and subclass-of relationships as edges of a graph, the resulting graph should be acyclic; there should be no loops. Unfortunately a few loops have made it into wikidata at this point in time; this section is to document and address them.
Subclass/Instance of loops in wikidata - autogenerated lists edit
- Wikidata:WikiProject Ontology/Problems/subclass of self
- Wikidata:WikiProject Ontology/Problems/subclass of subclass of self
- Wikidata:WikiProject Ontology/Problems/3rd-order subclass of self
- Wikidata:WikiProject Ontology/Problems/4th-order subclass of self
- Wikidata:WikiProject Ontology/Problems/5th-order subclass of self
- Wikidata:WikiProject Ontology/Problems/6th-order subclass of self
- Wikidata:WikiProject Ontology/Problems/instance of self
- Wikidata:WikiProject Ontology/Problems/subclass of instance of self
- Wikidata:WikiProject Ontology/Problems/instance of 2nd-order subclass of self
- Wikidata:WikiProject Ontology/Problems/instance of 3rd-order subclass of self
- Wikidata:WikiProject Ontology/Problems/instance of 4th-order subclass of self
- Wikidata:WikiProject Ontology/Problems/instance of 5th-order subclass of self
General looping instance/subclass combination edit
This sparql query could in principle catch all the general loop cases - except it times out on the current WDQS...
SELECT ?item WHERE {
?item (wdt:P31|wdt:P279)+ ?item .
}
spurious high-order metaclasses edit
Ontologies should generally not be very deep. That is, if the instances of a class are themselves classes, that makes the class a "metaclass". First-order metaclasses are not unusual. Second-order metaclasses (whose instances are first-order metaclasses) should be quite rare. Higher-order metaclasses may be needed but should be extremely rare; variable-order metaclasses may also be helpful (some definition of "class" would be one) but also should be rare.
Problems with higher order metaclasses edit
A class can be identified as such by itself being a subclass of another class, by having another class be its subclass, or most directly by having instances. These provides three different mechanisms for detecting higher order classes as well, as the following queries illustrate.
- Wikidata:WikiProject Ontology/Problems/3rd order metaclasses by subclass
- Wikidata:WikiProject Ontology/Problems/3rd order metaclasses by superclass
- Wikidata:WikiProject Ontology/Problems/3rd order metaclasses by instance
These are only looking at direct instance-of relationships up the hierarchy. The most general query along these lines would look like, for example:
select DISTINCT ?item WHERE
{ ?metametaclass wdt:P31 ?item .
?metaclass wdt:P31/wdt:P279* ?metametaclass .
?class wdt:P31/wdt:P279* ?metaclass .
?otherclass wdt:P279 ?class . }
However this times out in WDQS.
'concept' edit
concept (Q151885) comes up as a high-level metaclass in many subclass of/instance of trees; for example:
- champagne (Q134862) instance of wine (Q282) subclass of ... liquid (Q11435) instance of fundamental state of matter (Q15831576) subclass of ... state (Q3505845) instance of concept (Q151885)
but concept then has two more levels above it that cross the instance-of (metaclass) leap:
- concept (Q151885) subclass of mental representation (Q2145290) instance of symbol (Q80071) subclass of ... depiction (Q1166770) instance of physical object (Q223557)
symbol (Q80071) itself also appears frequently near the top of the ontology trees. This should probably be cleaned up.
classes with too many subclasses edit
Classes can have millions of instances (human (Q5) being a typical example in Wikidata). But in order to be useful abstractions, subclasses of a given class should be relatively limited in number. This should produce a reasonably understandable tree of groupings of whatever the class contains. Here is a list of classes with more than 1000 direct subclasses:
Anti-patterns from Multi-Level Modeling Theory edit
See Applying a Multi-Level Modeling Theory to Assess Taxonomic Hierarchies in Wikidata by F. Brasileiro et al. This paper was discussed on English Project Chat in March 2016. It lists several specific anti-patterns to check for, with associated sparql queries:
Anti-pattern 1 edit
An item is an instance of a class, but is also classified (perhaps via several intermediate classes) as a subclass of the same class. This often indicates that "instance of" has been used where "subclass of" makes more sense; alternatively it may mean the class in question should be considered a metaclass whose instances are classes. There are a lot of issues like this in wikidata right now.
- Wikidata:WikiProject Ontology/Problems/instance and subclass of same class
- Wikidata:WikiProject Ontology/Problems/instance and subclass of subclass of same class
- Wikidata:WikiProject Ontology/Problems/instance and 3-level subclass of same class
- Wikidata:WikiProject Ontology/Problems/instance and 4-level subclass of same class
- Wikidata:WikiProject Ontology/Problems/instance and 5-level subclass of same class
- Wikidata:WikiProject Ontology/Problems/instance and 6-level subclass of same class
The most general form to find these problems is:
select ?metaclass ?metaclassLabel (count(*) as ?count) WHERE {
?class wdt:P31 ?metaclass ;
wdt:P279+ ?metaclass .
service wikibase:label {
bd:serviceParam wikibase:language "en" .
}
} group by ?metaclass ?metaclassLabel order by DESC(?count)
but this general query times out. Some of the specific autogenerated lists may be empty also due to time-outs; when the queries work they all show many problems of this sort.
Anti-pattern 2 edit
This is where a subclass C has two superclasses A and B that are related to one another by an instance of relationship.
- Wikidata:WikiProject Ontology/Problems/pattern 2 direct superclasses
- Wikidata:WikiProject Ontology/Problems/pattern 2 indirect superclasses case 1
- Wikidata:WikiProject Ontology/Problems/pattern 2 indirect superclasses case 2
The general form for this query (which again times out) is:
select ?classA ?classALabel (count(*) as ?count) WHERE {
?classC wdt:P279+ ?classA ;
wdt:P279+ ?classB .
?classB wdt:P31 ?classA .
service wikibase:label {
bd:serviceParam wikibase:language "en" .
}
} group by ?classA ?classALabel order by desc(?count)
Also note this inconclusive RFC on color class relationships from 2016 (color (Q1075) is one of the classes appearing most often in these lists).
Anti-pattern 3 edit
Conflicting instance-of relations: C is an instance of A and B, but B is also an instance of A. The following query would fetch these cases:
SELECT ?classA (count(*) as ?count) WHERE {
?classC wdt:P31 ?classA;
wdt:P31 ?classB .
?classB wdt:P31 ?classA .
} group by ?classA order by desc(?count)
but again it times out. However the paper mentioned above does list some specific cases to look into, and that there were over 7000 cases in all:
Central Park (Q160409) is considered an instance of both urban park (Q22746) and park (Q22698), while urban park is also an instance of park. This anti-pattern often occurs in chains with terms such as: award (Q618779), Chinese surname (Q1093580), family name (Q101352), Voivodeship road (Q1259617), Mikroregion (Q11781066) and natural region (Q1970725).
Other noted problems edit
From wikidata project chat April 20 2016: "SQID as a tool for editors" edit
- Classes that have large numbers of direct subclasses (say >300) and also have a small number of direct instances. These seem to indicate modelling issues in almost all cases (in the case of large numbers of both direct classes and instances, this is again the problem of items that are subclasses and instances of another class at the same time). Moreover, almost all cases where a class has more than 100 direct subclasses suggest that some more subclasses could be useful to hierarchically group things into smaller collections.
- Subclasses of Q5 that have an instance. You can see them in the class browser, or on the Q5 class page. Most of them should be changed, e.g., using occupation (P106).