User:SCIdude/Protein bugs

Bugs and their fixes of protein (Q8054) or gene (Q7187) objects

Unresolved edit

De-merge RNA and gene edit

  • 88,855 inst-of ncRNA, 88,621 also inst-of gene, having gene ids and RNA ids (P639) mixed, also inst-of ncRNA
  • 6,249 of them have no RefSeq RNA ID, cannot be created from the gene item alone, just remove the (4,331) P31s for now (1) and the (6,249) P279 (2). Later fix the descriptions.

Cell component duplicates from FMA edit

SELECT DISTINCT ?item ?label WHERE {
  ?item wdt:P1402 [].
  MINUS { ?item wdt:P31 wd:Q5058355 }
  MINUS { ?item wdt:P279 wd:Q5058355 }
  MINUS { ?item schema:description ?desc.
        FILTER(lang(?desc) = 'en') }
  ?item wdt:P279/wdt:P279* wd:Q66557947.
  ?item rdfs:label ?label.
  FILTER(lang(?label) = 'en')
}
Try it!

Also filter those that match names with GO entities.

Possible misplacements of MeSH ID edit

  • MeSH entries are usually species-independent but there are >1,400 of:
SELECT ?item ?itemLabel ?mname
WHERE 
{
    ?item wdt:P31 wd:Q8054.
    ?item p:P486 ?mesh.
    OPTIONAL { ?mesh pq:P1810 ?mname. }
  	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
Try it!

Reactome R-MTU imports marked as human edit

  • 2019-Oct-01
  • 21 proteins, complexes, sets from M.tuberculosis have got "found in taxon"-->human
SELECT ?p ?pLabel ?tLabel
 WHERE
 {
   ?p wdt:P703 wd:Q15978631 .
   OPTIONAL{ ?p wdt:P31 ?t . }
   ?p wdt:P2888 ?url .
   FILTER ( STRSTARTS(STR(?url), 'https://identifiers.org/reactome:R-MTU') ).
   SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
 }
Try it!

Reactome protein complex duplicates edit

Preproproteins (human) pt. I edit

  • 2019-Jul-30
  • Description: Proteins that in UniProt have the keyword "Cleavage on pair of basic residues [KW-0165]" but the WD object misses the protein precursor (Q258658) class. NOTE: reviewed UniProt (SwissProt) entries contain precursors AND their products, while in TrEMBL the products have different entries.
  • Example: [1]
  • Talk page(s): 1
  • Reason (guess): complete UniProt import was too difficult
  • Number of human proteins affected: unknown
  • Code to view UniProt-associated items:
SELECT ?item ?itemLabel ?uniprotid
WHERE
{
	?item wdt:P352 ?uniprotid ;
          wdt:P703 wd:Q15978631 .
  
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
  • Proposed fix: for each member of a given set of preproprotein UniProtIDs 1. find the corr. WD item, and 2. add the instance statement. If the corr. item does not exist, add it.
  • Script that produces a QuickStatement batch from above data and the UniProt search TSV:
  • Efforts after this description was made: manual work ongoing. Actually, we started to create fragment objects for every nontrivial fragment

arbitrary symbols on unlocated Entrez genes edit

There are gene entries from Entrez where only the chromosome position is known, and so no official gene name cn be given. It seems Entrez then just took symbols from OMIM, regardless if that was a gene or phenotype entry, and gave that as symbol. Example: TEC (Q26241247) from Entrez 100124696 where the symbol is from OMIM 227050 (Transient erythroblastopenia of childhood) which symbol collides with TEC (Q18031939).

Such items should be marked somehow, maybe genomic start/end ---> unknown

Practically irrelevant or misguided edit

UniProt ID but not instance of protein/peptide edit

SELECT ?item ?itemLabel 
WHERE
{
  ?item wdt:P352 ?uid .
  MINUS { ?item wdt:P31 [] } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}
  • Note: there may be protein groups covered by a UniProt ID
  • number of hits: 23,303 (2019-Aug-15), 1 (2019-Aug-17)
  • code to output QS:
from sys import *
import csv

reader = csv.DictReader(stdin, delimiter=',')
items = {}
for item in reader:
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    print("{}|P31|Q8054".format(it))

Determination method on refs edit

#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?stmt1 ?meth ?ref WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item p:P680 ?stmt1 .
  ?stmt1 prov:wasDerivedFrom [ pr:P459 ?ref ] .
  ?stmt1 pq:P459 ?meth .
}
  • Proposed fix:
  • Script that produces a QuickStatement batch from above data:
  • Efforts after this description was made:

Aliases edit

  • see Wikidata:Bot_requests#Protein_aliases
  • alias identical to label, remove
  • alias: "hypothetical protein", move/append to descriptionUniProt actually lists this as name...
  • also aliases of form Uniprot:xyz (insulin)

GeneDB ID as label edit

  • 2019-Aug-02
  • Description: Proteins having (en)labels identical to their GeneDB ID.
  • Example: https://www.wikidata.org/w/index.php?title=Q62305547&oldid=990962483 got "EmuJ_001072400.1" instead of "hypothetical protein" (which is the name given by UniProt)
  • Talk page(s): 1
  • Reason (guess): unknown
  • Problem recognized by bot/batch maintainer: unknown
  • Bug in bot fixed: unknown
  • In the following the example species is Echinococcus multilocularis. It makes sense to partition the task into species because atm it appears only from GeneDB scraped proteins are affected, and these are only for a few invertebrates. Also the query server has much to do, and we want to be nice.
  • Number of proteins affected: 10659 for Emu alone...
  • Code to view affected proteins:
SELECT DISTINCT ?item ?itemLabel ?itemAl WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item wdt:P703 wd:Q669922 .
  ?item wdt:P3382 ?str2 .
  ?item rdfs:label ?itemLabel .
  FILTER (STR(?itemLabel) = ?str2) .
  ?item skos:altLabel ?itemAl .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}
  • Proposed fix: for each affected species, manually check affected items and replace the label with the alias.
  • Script that produces a QuickStatement batch from above data:
from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    label = item.get('itemLabel')
    al = item.get('itemAl')
    if (al[:9] != "expressed"
            and al[:9] != "conserved"
            and al[:12] != "hypothetical"):
        print("{}|Len|\"{}\"".format(itemid, al),
            file=stdout)
        print("{}|Aen|\"{}\"".format(itemid, label),
            file=stdout)
  • Efforts after this description was made: [2](Emu), [3] (Lin)

UniProt ID but no encoding gene edit

  • 2019-Aug-15
  • Code to view UniProt-associated items:
SELECT ?item ?itemLabel 
WHERE
{
  ?item wdt:P352 ?uid .
  #?item wdt:P703 wd:Q15978631 .
  MINUS { ?item wdt:P702 [] } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}
  • number of hits: 29730 (2019-Aug-16)
  • reason is mainly that the organism/gene is uninteresting. Restricting to human (331), mouse (13), rat (1)
  • reasons for those: gene/protein is in doubt/unknown, UniProt on gene entry, mobile element (=no database gene entry), part of antibody (why?)
  • manual fixes necessary for some of these. Remaining are: human (323), mouse (8)

instance + subclass of protein edit

SELECT ?stmt
WHERE 
{
  ?item wdt:P31 wd:Q8054.
  ?item p:P279 ?stmt. 
  ?stmt ps:P279 wd:Q8054.
}
Try it!

Permanent maintenance jobs edit

obsolete UniProt IDs not marked edit

obsolete GO entries edit

Duplicate UniProt IDs edit

inexact GO synonyms edit

proteins with instance-of a main family edit

  • potential idiotic mergers
SELECT DISTINCT ?item
{
  VALUES ?class { wd:Q84467700 wd:Q67015883 wd:Q417841 wd:Q7251477 wd:Q67101749 wd:Q68461428 }
  ?item wdt:P31 ?class.
  ?item wdt:P31 wd:Q8054.
}
Try it!

Fixed in database edit

Objects being instance of both gene and protein edit

  • 2019-Jul-31
  • Number of objects affected: 29 (2019-Jul-31), 0 (2019-Aug-01)
  • Code to view affected proteins:
SELECT (COUNT(?item) AS ?count) WHERE {
#SELECT DISTINCT ?item WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item wdt:P31 wd:Q7187 .
}
  • Efforts after this description was made: manual edits around 29 items (until 2019-aug-01)

Natural peptides without encoding genes/taxon/UniProt edit

  • 2019-Aug-05
  • Description: see title. Mostly they have en-WP articles with UniProt/Interpro/Pfam entries. Peptides are niche articles, were all scraped from WP anyway (14 with, 4 without en-WP article).
  • can be manually resolved
SELECT DISTINCT ?item ?article ?itemLabel WHERE {
  ?item wdt:P31 wd:Q172847 .
#  ?article schema:about ?item  ; schema:inLanguage "en"
#  FILTER NOT EXISTS { ?wen schema:about ?item ; schema:inLanguage "en" }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Gene/protein associations with different taxa edit

  • 2019-Aug-01
  • CAUTION: near bacterial strains may have identical proteins, these should be excluded
  • Number of objects affected: 3368 (2019-Aug-01) (C.botulinum diff. strains: 2758, B.anthracis diff. strains: 606)
  • Code to view affected proteins:
SELECT (COUNT(?itemp) AS ?count) WHERE {
#SELECT DISTINCT ?itemp ?itemg ?taxp ?taxg WHERE {
  ?itemp wdt:P31 wd:Q8054 .
  ?itemp wdt:P702 ?itemg .
  ?itemp wdt:P703 ?taxp .
  ?itemg wdt:P703 ?taxg .
  FILTER (?taxp != ?taxg)
  }
  • Efforts after this description was made: one case human/mouse edited (2019-Aug-01)

UniProt ID but no taxon edit

  • 2019-Aug-15
  • Code to view UniProt-associated items:
SELECT ?item ?itemLabel 
WHERE
{
  ?item wdt:P352 ?uid .
  MINUS { ?item wdt:P703 [] } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}
  • number of hits: 28, manually resolved

Duped labels (proteins) edit

  • 2019-Jul-26
  • Description: Proteins having labels consisting of two words, differing only by case, or not at all. Usually one of the words is identical to an alias.
  • Example: [4] has the label "lmo0063 lmo0063"
  • Talk page(s): 1 2
  • Reason (guess): unknown
  • Problem recognized by bot/batch maintainer: unknown
  • Bug in bot fixed: unknown
  • Number of proteins affected: 11434 (2019-Jul-27), 0 (2019-Aug-17)
  • Code to view affected proteins:

NOTE: this gives a query timeout now (2019-Aug-16), so should be checked differently before resolving!

#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?itemLabel (lang(?itemLabel) AS ?itemLabel_lang) ?str1 ?str2 WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel .
  FILTER CONTAINS(?itemLabel, " ") .
  BIND (STRBEFORE(?itemLabel, " ") AS ?str1) .
  BIND (STRAFTER(?itemLabel, " ") AS ?str2) .
  FILTER (STRLEN(?str1) = STRLEN(?str2)) .
  FILTER (?str1 = ?str2 || LCASE(?str1) = LCASE(?str2)) .
}
  • Proposed fix: for each such (object,label) tuple, replace the label with the uppercase version of the two words. Rationale: protein symbols are always uppercase (Ref: 1).
  • Script that produces a QuickStatement batch from above data:
from sys import *
import csv

def upcase(s): return s[:1].upper() + s[1:]

reader = csv.DictReader(stdin, delimiter=',')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    lang = item.get('itemLabel_lang')
    str1 = upcase(item.get('str1'))
    str2 = upcase(item.get('str2'))
    if (str1 == str2):
        print("{}|L{}|\"{}\"".format(itemid, lang, str1),
            file=stdout)
  • Efforts after this description was made: QS #16460 #16461
  • Alternative way to check database: get all protein labels
SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

and run through:

from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('itemLabel')
    if itemstr is None:
        continue
    pos = itemstr.find(' ')
    if pos<0:
        continue
    str1 = itemstr[:pos]
    str2 = itemstr[pos+1:]
    if (str1 == str2):
        print(itemstr)

Duped labels (genes) edit

  • 2019-Jul-30
  • Description: Genes having labels consisting of two identical words. Usually one of the words is identical to an alias.
  • Example: [5] has the label "lmo0819 lmo0819"
  • Talk page(s): 1 2
  • Reason (guess): unknown
  • Problem recognized by bot/batch maintainer: unknown
  • Bug in bot fixed: unknown
  • Number of proteins affected: 11560 (2019-Jul-30), 0 (2019-Aug-17)
  • Code to view affected items:

NOTE: this gives a query timeout now (2019-Aug-16), so should be checked differently before resolving!

#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?itemLabel (lang(?itemLabel) AS ?itemLabel_lang) ?str1 ?str2 WHERE {
  ?item wdt:P31 wd:Q7187 .
  ?item rdfs:label ?itemLabel .
  FILTER CONTAINS(?itemLabel, " ") .
  BIND (STRBEFORE(?itemLabel, " ") AS ?str1) .
  BIND (STRAFTER(?itemLabel, " ") AS ?str2) .
  FILTER (STRLEN(?str1) = STRLEN(?str2)) .
  FILTER (?str1 = ?str2) .
} TIMEOUT!
  • Proposed fix: for each such (object,label) tuple, replace the label with the single word.
  • Script that produces a QuickStatement batch from above data:
from sys import *
import csv

reader = csv.DictReader(stdin, delimiter=',')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    lang = item.get('itemLabel_lang')
    if (str1 == str2):
        print("{}|L{}|\"{}\"".format(itemid, lang, str1),
            file=stdout)
  • Efforts after this description was made: #16494
  • Alternative way to check database: get all protein labels
SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

and run through:

from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('itemLabel')
    if itemstr is None:
        continue
    pos = itemstr.find(' ')
    if pos<0:
        continue
    str1 = itemstr[:pos]
    str2 = itemstr[pos+1:]
    if (str1 == str2):
        print(itemstr)

Human genes without Entrez gene ID edit

  • 2019-08-19
SELECT ?item ?itemLabel ?geneid WHERE {
  ?item wdt:P31 wd:Q7187 .
  ?item wdt:P703 wd:Q15978631 .
  MINUS { ?item wdt:P351 [] } .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

showed 4 hits which were resolved manually

Objects with HGNC symbol but not instance of anything edit

  • 2019-Aug-19
SELECT ?item ?itemLabel ?geneid WHERE {
  ?item wdt:P353 ?dum .
  MINUS { ?item wdt:P31 [] } .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}
  • Hits: 186 (2019-Aug-19), 0 (2019-Aug-20)
  • wildly different objects, all have sitelinks, quick creations?
  • manually resolved

Genes: missing OMIM gene entry edit

  • 2019-Aug-21
  • Using an OMIM/Entrez index download and the output of
SELECT ?item ?itemLabel ?geneid WHERE {
  ?item wdt:P31 wd:Q277338 .
#  ?item wdt:P31 wd:Q7187 .
  ?item wdt:P703 wd:Q15978631 .
  ?item wdt:P351 ?geneid .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

we used the script https://gist.github.com/rwst/760ee4454d306c4f619053bf5798becd to create the QS batches #17650, #17652, #17675, #17676, #17679, #17680, #17682

diseases with "anatomical location" instead of "location" edit

  • 2019-Aug-23
SELECT DISTINCT ?item ?itemLabel WHERE {
	?statement wikibase:hasViolationForConstraint wds:P927-22a699be-4c01-5ee5-4295-81f6ac028a65 .
	?item ?p ?statement .
       ?item wdt:P279+ wd:Q12136 .
	FILTER( ?item NOT IN ( wd:Q4115189, wd:Q13406268, wd:Q15397819 ) ) .
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
  • 12 hits, manually resolved

Reactome pathway labels with semicolon substitution edit

  • 2019-Aug-23... that's what happens when you pipe strings containing commas through csv
SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P31 wd:Q4915012 .
  ?item rdfs:label ?itemLabel .
  FILTER( LANG(?itemLabel)="en" ) .
  FILTER( CONTAINS(?itemLabel, ";") ) .
}
  • 81 hits, script:
from sys import *
import csv

reader = csv.DictReader(open('t.tab', 'r'), delimiter='\t')
for item in reader:
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    l = item.get('itemLabel')
    print('{}|Len|"{}"'.format(it, l.replace(';',',')))
  • #17750

superfluous subclass-of-gene without ref edit

SELECT ?item
WHERE
{
  ?item wdt:P279 wd:Q277338 .
  ?item p:P279 ?c .
  ?c ps:P279 wd:Q7187 .
  MINUS {?c prov:wasDerivedFrom [pr:P248 ?ref]} .
}
  • #17833 (201)

ncRNA with "encodes" edit

SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P279 wd:Q427087 .
  ?item wdt:P688 ?d .
}
  • 55 hits, manually resolved, some unresolvable like Q18051378

specialize subclass claims edit

general case edit

SELECT ?item ?itemLabel ?inst ?instLabel
WHERE
{
  ?item wdt:P279 wd:Q8054 .
  ?item wdt:P31 ?inst .
  ?inst wdt:P279 wd:Q8054 .
  ?item rdfs:label ?itemLabel . FILTER( LANG(?itemLabel)="en" )
  ?inst rdfs:label ?instLabel . FILTER( LANG(?instLabel)="en" )
}
  • #17740

proteins: specialized from GOA function (transporter activity --> transport protein) edit

SELECT ?item ?itemLabel ?inst ?instLabel
WHERE
{
  ?item wdt:P279 wd:Q8054 .
  ?item wdt:P31 wd:Q8054 .
  ?item wdt:P680 ?inst .
  ?inst wdt:P279+ wd:Q14864384 .
  ?item rdfs:label ?itemLabel . FILTER( LANG(?itemLabel)="en" )
  ?inst rdfs:label ?instLabel . FILTER( LANG(?instLabel)="en" )
}
  • #17745, #17746, #17772, #17805, #17826, #17844, #17882, #17884

subclass of proteins and encoded without instance of protein edit

s2 = set(open('wd-subc-of-prot', 'r').readlines())
s3 = set(open('wd-inst-of-prot', 'r').readlines())
s4 = set(open('wd-encoded', 'r').readlines())
for i in s2.intersection(s4).difference(s3):
    print(i)
  • 50 cases, manually fixed 2009-Aug-31, leaving 3 valid exceptions

RefSeq protein missing valid UniProt edit

  • 2019-Sep-01
from sys import *
import csv

reader = csv.DictReader(open('refseqp-wd.tab', 'r'), delimiter='\t')
refs = {}
dups = set()
for item in reader:
    uid = item.get('refseq')
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    git = refs.get(uid)
    if git is None or git == it:
        refs[it] = uid
    else:
        dups.add(it)
for k in dups:
    refs.pop(k)

reader = csv.DictReader(open('uniprot-refseq.tab', 'r'), delimiter='\t')
unips = {}
dups = set()
for item in reader:
    uid = item.get('uniprot')
    if '-' in uid:
        continue
    ref = item.get('refseq')
    if ref.find('.') > -1:
        ref = ref[:ref.find('.')]
    git = unips.get(ref)
    if git is None or git == it:
        unips[ref] = uid
    else:
        dups.add(ref)
for k in dups:
    unips.pop(k)

ids = set(l.rstrip() for l in open('wd-refseq-without-uniprot', 'r').readlines())
for it in ids:
    r = refs.get(it)
    if r is not None:
        u = unips.get(refs[it])
        if u is not None:
            print('{}|P352|"{}"'.format(it, u))
  • hits: 4,598 all from MicrobeBot imports of Myxococcus and Chlamydia; that leaves those without valid UniProt
  • batch: #17948

genes with EC number edit

 ?item wdt:P591 ?ec .
 ?item wdt:P2888 ?url .
 FILTER CONTAINS(STR(?url), 'ncbigene')
  • 272 hits, script:
reader = csv.DictReader(open('wd-genes-with-ec.tab', 'r'), delimiter='\t')
for item in reader:
    ec = item.get('ec')
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    print('-{}|P591|"{}"'.format(it, ec))
  • batch #17966

Stubs from early days edit

  • 2019-Dec-02
  • The following queries comes up with a lot of items dumped from enwiki people, common theme "found in taxon" "Homo sapiens"
 ?p wdt:P703 wd:Q15978631 .
 ?article 	schema:about ?p ;
			schema:isPartOf <https://en.wikipedia.org/> .
 MINUS { 
   ?p p:P703 ?stmt.
   ?stmt prov:wasDerivedFrom []
 }
 MINUS { ?p wdt:P31 wd:Q16521 }
 MINUS { ?p wdt:P31 wd:Q11173 }
 MINUS { ?p wdt:P31 wd:Q420927 }
 MINUS { ?p wdt:P31 wd:Q37748 }
 MINUS { ?p wdt:P31 wd:Q7187 }
 MINUS { ?p wdt:P31 wd:Q78782478 }
 MINUS { ?p wdt:P352 [] }
 MINUS { ?p wdt:P639 [] }

and "subclass+ protein"

 ?p wdt:P279+ wd:Q8047 .
 ?article 	schema:about ?p ;
			schema:isPartOf <https://en.wikipedia.org/> .
 MINUS { ?p wdt:P31 wd:Q417841 }
 MINUS { ?p wdt:P31 wd:Q67015883 }
 MINUS { ?p wdt:P31 wd:Q67101749 }
 MINUS { ?p wdt:P31 wd:Q49695242 }
 MINUS { ?p wdt:P31 wd:Q68461428 }
 MINUS { ?p wdt:P352 [] }
   	SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en" }
}
  • about 500 items were manually integrated, finished 2019-12-29

Stubs from early days II edit

SELECT DISTINCT ?item ?itemLabel
{
  ?article 	schema:about ?item ;
			schema:isPartOf <https://en.wikipedia.org/> .
  ?item  wdt:P279+  wd:Q8054.
  MINUS {
    ?item wdt:P31 []
    }
  ?item ?prop ?val.
  FILTER (STRSTARTS(STR(?prop), 'http://www.wikidata.org/prop/direct/') && ?prop != wdt:P646 && ?prop != wdt:P279)
  	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
Try it!
  • finished 2020-Feb-19

Stubs from early days III edit

SELECT DISTINCT ?p ?pLabel ?ec
{
    ?p wdt:P591 ?ec.
    MINUS {
      ?p wdt:P31 [].
    }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
Try it!
  • finished 2020-Mar-11

Duplicate exact external ids edit

GO functions edit

SELECT DISTINCT ?item1 ?item1Label ?funcLabel ?item2 ?item2Label 
{
    ?item1 p:P680 [ ps:P680 ?func; pq:P4390 wd:Q39893449; ].
    ?item2 p:P680 [ ps:P680 ?func; pq:P4390 wd:Q39893449; ].
    FILTER (?item1 != ?item2 && STR( ?item1 ) < STR( ?item2 )).
  	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
}
Try it!
  • finished 2020-Feb-29

GO entities without P31 edit

SELECT DISTINCT ?item ?goid
{
  ?item wdt:P686 ?goid.
  MINUS {
    { ?item wdt:P31 wd:Q5058355 } UNION { ?item wdt:P31 wd:Q2996394 } UNION { ?item wdt:P31 wd:Q14860489 }
  }
}
}
Try it!
  • most of them obsolete, should be either merged or tagged; finished 2020-Jul-1

Silly nl descriptions by Edoderoobot edit

  • 2019-Aug-02
  • Description: Proteins having (nl)descriptions consisting of "proteïne in XYZ" where XYZ is a GeneDB protein ID, usually one of the ids of the protein item itself.
  • Example: https://www.wikidata.org/w/index.php?title=Q62305547&oldid=990962798 got "proteïne in EmuJ_001072400.1"
  • Talk page(s): 1
  • Reason (guess): unknown
  • Problem recognized by bot/batch maintainer: unknown
  • Bug in bot fixed: unknown
  • In the following the example species is Echinococcus multilocularis. It makes sense to partition the task into species because atm it appears only from GeneDB scraped proteins are affected, and these are only for a few invertebrates. Also the query server has much to do, and we want to be nice.
  • Number of proteins affected: 10668 for Emu alone...
  • Code to view affected proteins:
#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?itemLabel ?itemDesc ?str2 WHERE {
 ?item wdt:P31 wd:Q8054 .
 ?item wdt:P703 wd:Q669922 .
 ?item schema:description ?itemDesc .
 FILTER CONTAINS(?itemDesc, "proteïne in ") .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "nl" .
  }
 }
}
  • Proposed fix: for each affected species, manually check affected items and replace the label with "proteïne in species XYZ".
  • Script that produces a QuickStatement batch from above data:
from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    label = item.get('itemLabel')
    desc = item.get('itemDesc')
    if desc[:12] == "proteïne in " and desc[12:] == label:
        print("{}|Dnl|\"{}\"".format(itemid, "proteïne in Echinococcus multilocularis"),
            file=stdout)
  • Efforts after this description was made: [6]
  • I think these were fixed by Edoderoobot

WP Orphans (was: MS IDs on early entries) edit

  • mostly domains which should have been placed on IPR domain items
SELECT ?item ?itemLabel WHERE {
  ?item wdt:P279 wd:Q8054.
  ?item wdt:P6366 ?ms.
  MINUS { ?item wdt:P31 [] }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!
  • this was resolved by editing all orphans (2260 in en, de?), see catscan.py