User:TweetsFactsAndQueries/Problems

This is a list of various WDQS queries that point to potential problems in Wikidata. Some of them can be fixed automatically, others need manual review.

(Most of these queries are also mirrored somewhere on mw:Wikibase/Indexing/SPARQL Query Examples.)

Automatic edit

Actors whose Spanish label ends with “ (actor)” edit

Query

Wikipedia article titles must be unique, so when multiple items have the same title, the article title is often disambiguated by adding a property of the item in parentheses, e. g. en:Mercury (element) vs. en:Mercury (mythology) vs. en:Mercury (planet). For persons, their (main) profession is frequently used for this.

On Wikidata, labels do not need to be unique (the unique identifier is the Q number), and so these additions are unnecessary. However, they are often still present, since the label was imported (by a bot) from the title of the corresponding Wikipedia article. In English, these labels mostly seem to have been fixed, but other languages retain high numbers of such titles.

The above query finds all actors whose Spanish label ends with “ (actor)”. Since it is vanishingly unlikely that this label addition is actually intentional, a bot could remove this suffix from all labels that the query returns. Of course, the query can also easily be adapted for other languages and professions.

Labels containing HTML escape sequences edit

Query

This query finds all items where the label contains the text “"”, which is the HTML entity for the double quotation mark. This is probably a bug in whatever bot created the item, and can be fixed automatically by replacing the entity with its value (the double quote). Other entities (amp, apos, lt, gt, etc.) can also be fixed.

Mathematical formulae containing HTML escape sequences edit

Query

Same thing as above, broken data import.

British English spelling in English labels edit

Query

Wikidata’s “en” language code is usually taken to mean American English, as far as I understand (though I can’t find a reference for this). Items described as “colour” should be changed to “color”, and a separate British English description “colour” should be added.

URLs in page(s) (P304) references edit

Query

When a URL has been entered in a reference under the property page(s) (P304), it should be changed to reference URL (P854), which was probably the intention.

Manual edit

Person labels containing parentheses edit

Query

This is a more general version of #Actors whose Spanish label ends with “ (actor)”. It does not limit the search to actors (merely to humans), and matches any parentheses in the label. Since such labels are, in general, sometimes correct, this should not be fixed by a bot. However, a human could look over this list and fix any cases that stand out.

Instances of weapon edit

Query

This query finds all items that are an instance of (some subclass of) weapon (Q728). Most of the results are not actually instances (individual objects); for example, the Carcano (Q858434) is only a class of rifles (one instance would be the John F. Kennedy assassination rifle (Q2012291)). These results should be changed to be subclasses (P:P279) instead of instances (P:P31).

Non-integer populations edit

Query

Populations should generally be whole numbers. The above query finds populations that have a fractional component; one likely explanation is that the decimal separator and the thousands separator are switched between some locales (for example, English writes 1,000.0, whereas German writes 1.000,0), and someone entered the population with thousands separators which were then misinterpreted as decimal separators.

Odd countries edit

Query

This query finds “odd” country (P17) statements. Many of these are arguably correct, because politics is simply complicated; however, some are also obvious mistakes, such as a country Persian (Q9168), which is the item of the Persian language, or JA (Q224881), which is a disambiguation and not an abbreviation for Japan (Q17).

Paintings on taxons edit

Query

This query finds paintings where the painting surface is some taxon, e. g. Populus (Q25356). Usually, the intended statement is that the painting surface is this taxon’s wood, e. g. poplar wood (Q291034).

Authors who have worked together but whose Erdős numbers are more than 1 apart edit

Query

One’s Erdős number is the lowest Erdős number of all scientists one has collaborated with, plus one. It follows that authors who have published a paper together should not be more than one apart in Erdős number. This query finds papers whose authors have Erdős numbers more than 1 apart.

Descriptions that are just the default description edit

Query 1, Query 2

When you edit an item’s label, description, and aliases, the text box for the description displays a default text (in English, “enter a description in English”). A few items, for whatever reason, have this default text entered and saved as actual description. This is almost certainly an error. (You can adapt the query for other languages.)

Items that are simultaneously instance and subclass of the same class edit

Query

The decision whether to use instance of (P31) or subclass of (P279) can often be difficult; however, it is usually an error if an item is both an instance and a subclass of the same class.

Language statements that point to a country edit

Query

There are a variety of statements whose object should always be a language; if it’s a country instead, that’s probably an easy-to-correct mistake.

People with statements where start and end time are over 100 years apart edit

Query

Since humans only rarely live for over 100 years, it is likely that a statement about a person where the start time (P580) and end time (P582) qualifiers are over 100 years apart is an error (for example, entering 21999 instead of 1999, or 20013 instead of 2013). (Note that this is not always the case: according to the Japanese traditional order of succession, Emperor Kōan (Q312821) was actually in office for about 101 years.)

Capitals that aren’t capitals edit

Query

An item with instance of (P31)capital city (Q5119) should probably also be the capital (P36) of something. (For a lot more results with potentially more false positives, add subclasses of capital: query)

Statements with reason for deprecated rank (P2241) that aren’t deprecated edit

Query

A deprecated item can have the reason for its deprecation specified with a reason for deprecated rank (P2241) qualifier. If a statement with this qualifier isn’t deprecated, something is probably amiss – either the statement is correctly no longer deprecated, in which case the qualifier should perhaps be removed, or the statement should be deprecated but isn’t for some reason, in which its rank should perhaps be adjusted. (Notable exception: the Wikidata property example (P1855) statement on reason for deprecated rank (P2241) itself.)

Nonhuman CEOs edit

Query

Some CEO statements have an object that isn’t a human. Most of the time, this is a misuse of the property – it should go “company – CEO – person”, but these cases are entered as “person – CEO – company” (with the intention of “CEO of”).

People who have a date as place of birth edit

Query

Some items have a date item (e. g. January 1 (Q2150) or October 19 (Q2961)) as place of birth (P19), probably as a result of incorrectly parsing Wikipedia first sentences.

Humans with male / female creature in statements edit

Query

Some human items have male organism (Q44148) or female organism (Q43445) in statements. In sex or gender (P21), that should be male (Q6581097) or female (Q6581072); in other statements, it’s probably a mistake or vandalism.

Dates of birth with unknown year edit

Query

“point in time” properties, such as date of birth (P569), cannot contain a month and day with no year. If “February 10” is entered, it is interpreted as the month of February of the year 10 AD (with no day; precision: month). For the case of date of birth (P569), the property birthday (P3150) has been created, which can be used instead to link to an item for the birthday if the birthday is known but the year isn’t.