User:SCIdude/Protein bugs

Bugs and their fixes of protein (Q8054) or gene (Q7187) objects

UnresolvedEdit

De-merge RNA and geneEdit

  • 88,855 inst-of ncRNA, 88,621 also inst-of gene, having gene ids and RNA ids (P639) mixed, also inst-of ncRNA
  • 6,249 of them have no RefSeq RNA ID, cannot be created from the gene item alone, just remove the (4,331) P31s for now (1) and the (6,249) P279 (2). Later fix the descriptions.

Cell component duplicates from FMAEdit

SELECT DISTINCT ?item ?label WHERE {
  ?item wdt:P1402 [].
  MINUS { ?item wdt:P31 wd:Q5058355 }
  MINUS { ?item wdt:P279 wd:Q5058355 }
  MINUS { ?item schema:description ?desc.
        FILTER(lang(?desc) = 'en') }
  ?item wdt:P279/wdt:P279* wd:Q66557947.
  ?item rdfs:label ?label.
  FILTER(lang(?label) = 'en')
}

Try it!

Also filter those that match names with GO entities.

Possible misplacements of MeSH IDEdit

  • MeSH entries are usually species-independent but there are >1,400 of:
SELECT ?item ?itemLabel ?mname
WHERE 
{
    ?item wdt:P31 wd:Q8054.
    ?item p:P486 ?mesh.
    OPTIONAL { ?mesh pq:P1810 ?mname. }
  	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}

Try it!

Reactome R-MTU imports marked as humanEdit

  • 2019-Oct-01
  • 21 proteins, complexes, sets from M.tuberculosis have got "found in taxon"-->human
SELECT ?p ?pLabel ?tLabel
 WHERE
 {
   ?p wdt:P703 wd:Q15978631 .
   OPTIONAL{ ?p wdt:P31 ?t . }
   ?p wdt:P2888 ?url .
   FILTER ( STRSTARTS(STR(?url), 'https://identifiers.org/reactome:R-MTU') ).
   SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
 }

Try it!

Reactome protein complex duplicatesEdit

Preproproteins (human) pt. IEdit

  • 2019-Jul-30
  • Description: Proteins that in UniProt have the keyword "Cleavage on pair of basic residues [KW-0165]" but the WD object misses the protein precursor (Q258658) class. NOTE: reviewed UniProt (SwissProt) entries contain precursors AND their products, while in TrEMBL the products have different entries.
  • Example: [1]
  • Talk page(s): 1
  • Reason (guess): complete UniProt import was too difficult
  • Number of human proteins affected: unknown
  • Code to view UniProt-associated items:
SELECT ?item ?itemLabel ?uniprotid
WHERE
{
	?item wdt:P352 ?uniprotid ;
          wdt:P703 wd:Q15978631 .
  
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
  • Proposed fix: for each member of a given set of preproprotein UniProtIDs 1. find the corr. WD item, and 2. add the instance statement. If the corr. item does not exist, add it.
  • Script that produces a QuickStatement batch from above data and the UniProt search TSV:
  • Efforts after this description was made: manual work ongoing. Actually, we started to create fragment objects for every nontrivial fragment

arbitrary symbols on unlocated Entrez genesEdit

There are gene entries from Entrez where only the chromosome position is known, and so no official gene name cn be given. It seems Entrez then just took symbols from OMIM, regardless if that was a gene or phenotype entry, and gave that as symbol. Example: TEC (Q26241247) from Entrez 100124696 where the symbol is from OMIM 227050 (Transient erythroblastopenia of childhood) which symbol collides with TEC (Q18031939).

Such items should be marked somehow, maybe genomic start/end ---> unknown

Practically irrelevant or misguidedEdit

UniProt ID but not instance of protein/peptideEdit

SELECT ?item ?itemLabel 
WHERE
{
  ?item wdt:P352 ?uid .
  MINUS { ?item wdt:P31 [] } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}
  • Note: there may be protein groups covered by a UniProt ID
  • number of hits: 23,303 (2019-Aug-15), 1 (2019-Aug-17)
  • code to output QS:
from sys import *
import csv

reader = csv.DictReader(stdin, delimiter=',')
items = {}
for item in reader:
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    print("{}|P31|Q8054".format(it))

Determination method on refsEdit

#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?stmt1 ?meth ?ref WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item p:P680 ?stmt1 .
  ?stmt1 prov:wasDerivedFrom [ pr:P459 ?ref ] .
  ?stmt1 pq:P459 ?meth .
}
  • Proposed fix:
  • Script that produces a QuickStatement batch from above data:
  • Efforts after this description was made:

AliasesEdit

  • see Wikidata:Bot_requests#Protein_aliases
  • alias identical to label, remove
  • alias: "hypothetical protein", move/append to descriptionUniProt actually lists this as name...
  • also aliases of form Uniprot:xyz (insulin)

GeneDB ID as labelEdit

  • 2019-Aug-02
  • Description: Proteins having (en)labels identical to their GeneDB ID.
  • Example: https://www.wikidata.org/w/index.php?title=Q62305547&oldid=990962483 got "EmuJ_001072400.1" instead of "hypothetical protein" (which is the name given by UniProt)
  • Talk page(s): 1
  • Reason (guess): unknown
  • Problem recognized by bot/batch maintainer: unknown
  • Bug in bot fixed: unknown
  • In the following the example species is Echinococcus multilocularis. It makes sense to partition the task into species because atm it appears only from GeneDB scraped proteins are affected, and these are only for a few invertebrates. Also the query server has much to do, and we want to be nice.
  • Number of proteins affected: 10659 for Emu alone...
  • Code to view affected proteins:
SELECT DISTINCT ?item ?itemLabel ?itemAl WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item wdt:P703 wd:Q669922 .
  ?item wdt:P3382 ?str2 .
  ?item rdfs:label ?itemLabel .
  FILTER (STR(?itemLabel) = ?str2) .
  ?item skos:altLabel ?itemAl .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}
  • Proposed fix: for each affected species, manually check affected items and replace the label with the alias.
  • Script that produces a QuickStatement batch from above data:
from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    label = item.get('itemLabel')
    al = item.get('itemAl')
    if (al[:9] != "expressed"
            and al[:9] != "conserved"
            and al[:12] != "hypothetical"):
        print("{}|Len|\"{}\"".format(itemid, al),
            file=stdout)
        print("{}|Aen|\"{}\"".format(itemid, label),
            file=stdout)
  • Efforts after this description was made: [2](Emu), [3] (Lin)

UniProt ID but no encoding geneEdit

  • 2019-Aug-15
  • Code to view UniProt-associated items:
SELECT ?item ?itemLabel 
WHERE
{
  ?item wdt:P352 ?uid .
  #?item wdt:P703 wd:Q15978631 .
  MINUS { ?item wdt:P702 [] } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}
  • number of hits: 29730 (2019-Aug-16)
  • reason is mainly that the organism/gene is uninteresting. Restricting to human (331), mouse (13), rat (1)
  • reasons for those: gene/protein is in doubt/unknown, UniProt on gene entry, mobile element (=no database gene entry), part of antibody (why?)
  • manual fixes necessary for some of these. Remaining are: human (323), mouse (8)

instance + subclass of proteinEdit

SELECT ?stmt
WHERE 
{
  ?item wdt:P31 wd:Q8054.
  ?item p:P279 ?stmt. 
  ?stmt ps:P279 wd:Q8054.
}

Try it!

Permanent maintenance jobsEdit

obsolete UniProt IDs not markedEdit

obsolete GO entriesEdit

Duplicate UniProt IDsEdit

inexact GO synonymsEdit

proteins with instance-of a main familyEdit

  • potential idiotic mergers
SELECT DISTINCT ?item
{
  VALUES ?class { wd:Q84467700 wd:Q67015883 wd:Q417841 wd:Q7251477 wd:Q67101749 wd:Q68461428 }
  ?item wdt:P31 ?class.
  ?item wdt:P31 wd:Q8054.
}

Try it!

Fixed in databaseEdit

Objects being instance of both gene and proteinEdit

  • 2019-Jul-31
  • Number of objects affected: 29 (2019-Jul-31), 0 (2019-Aug-01)
  • Code to view affected proteins:
SELECT (COUNT(?item) AS ?count) WHERE {
#SELECT DISTINCT ?item WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item wdt:P31 wd:Q7187 .
}
  • Efforts after this description was made: manual edits around 29 items (until 2019-aug-01)

Natural peptides without encoding genes/taxon/UniProtEdit

  • 2019-Aug-05
  • Description: see title. Mostly they have en-WP articles with UniProt/Interpro/Pfam entries. Peptides are niche articles, were all scraped from WP anyway (14 with, 4 without en-WP article).
  • can be manually resolved
SELECT DISTINCT ?item ?article ?itemLabel WHERE {
  ?item wdt:P31 wd:Q172847 .
#  ?article schema:about ?item  ; schema:inLanguage "en"
#  FILTER NOT EXISTS { ?wen schema:about ?item ; schema:inLanguage "en" }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Gene/protein associations with different taxaEdit

  • 2019-Aug-01
  • CAUTION: near bacterial strains may have identical proteins, these should be excluded
  • Number of objects affected: 3368 (2019-Aug-01) (C.botulinum diff. strains: 2758, B.anthracis diff. strains: 606)
  • Code to view affected proteins:
SELECT (COUNT(?itemp) AS ?count) WHERE {
#SELECT DISTINCT ?itemp ?itemg ?taxp ?taxg WHERE {
  ?itemp wdt:P31 wd:Q8054 .
  ?itemp wdt:P702 ?itemg .
  ?itemp wdt:P703 ?taxp .
  ?itemg wdt:P703 ?taxg .
  FILTER (?taxp != ?taxg)
  }
  • Efforts after this description was made: one case human/mouse edited (2019-Aug-01)

UniProt ID but no taxonEdit

  • 2019-Aug-15
  • Code to view UniProt-associated items:
SELECT ?item ?itemLabel 
WHERE
{
  ?item wdt:P352 ?uid .
  MINUS { ?item wdt:P703 [] } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}
  • number of hits: 28, manually resolved

Duped labels (proteins)Edit

  • 2019-Jul-26
  • Description: Proteins having labels consisting of two words, differing only by case, or not at all. Usually one of the words is identical to an alias.
  • Example: [4] has the label "lmo0063 lmo0063"
  • Talk page(s): 1 2
  • Reason (guess): unknown
  • Problem recognized by bot/batch maintainer: unknown
  • Bug in bot fixed: unknown
  • Number of proteins affected: 11434 (2019-Jul-27), 0 (2019-Aug-17)
  • Code to view affected proteins:

NOTE: this gives a query timeout now (2019-Aug-16), so should be checked differently before resolving!

#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?itemLabel (lang(?itemLabel) AS ?itemLabel_lang) ?str1 ?str2 WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel .
  FILTER CONTAINS(?itemLabel, " ") .
  BIND (STRBEFORE(?itemLabel, " ") AS ?str1) .
  BIND (STRAFTER(?itemLabel, " ") AS ?str2) .
  FILTER (STRLEN(?str1) = STRLEN(?str2)) .
  FILTER (?str1 = ?str2 || LCASE(?str1) = LCASE(?str2)) .
}
  • Proposed fix: for each such (object,label) tuple, replace the label with the uppercase version of the two words. Rationale: protein symbols are always uppercase (Ref: 1).
  • Script that produces a QuickStatement batch from above data:
from sys import *
import csv

def upcase(s): return s[:1].upper() + s[1:]

reader = csv.DictReader(stdin, delimiter=',')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    lang = item.get('itemLabel_lang')
    str1 = upcase(item.get('str1'))
    str2 = upcase(item.get('str2'))
    if (str1 == str2):
        print("{}|L{}|\"{}\"".format(itemid, lang, str1),
            file=stdout)
  • Efforts after this description was made: QS #16460 #16461
  • Alternative way to check database: get all protein labels
SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

and run through:

from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('itemLabel')
    if itemstr is None:
        continue
    pos = itemstr.find(' ')
    if pos<0:
        continue
    str1 = itemstr[:pos]
    str2 = itemstr[pos+1:]
    if (str1 == str2):
        print(itemstr)

Duped labels (genes)Edit

  • 2019-Jul-30
  • Description: Genes having labels consisting of two identical words. Usually one of the words is identical to an alias.
  • Example: [5] has the label "lmo0819 lmo0819"
  • Talk page(s): 1 2
  • Reason (guess): unknown
  • Problem recognized by bot/batch maintainer: unknown
  • Bug in bot fixed: unknown
  • Number of proteins affected: 11560 (2019-Jul-30), 0 (2019-Aug-17)
  • Code to view affected items:

NOTE: this gives a query timeout now (2019-Aug-16), so should be checked differently before resolving!

#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?itemLabel (lang(?itemLabel) AS ?itemLabel_lang) ?str1 ?str2 WHERE {
  ?item wdt:P31 wd:Q7187 .
  ?item rdfs:label ?itemLabel .
  FILTER CONTAINS(?itemLabel, " ") .
  BIND (STRBEFORE(?itemLabel, " ") AS ?str1) .
  BIND (STRAFTER(?itemLabel, " ") AS ?str2) .
  FILTER (STRLEN(?str1) = STRLEN(?str2)) .
  FILTER (?str1 = ?str2) .
} TIMEOUT!
  • Proposed fix: for each such (object,label) tuple, replace the label with the single word.
  • Script that produces a QuickStatement batch from above data:
from sys import *
import csv

reader = csv.DictReader(stdin, delimiter=',')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    lang = item.get('itemLabel_lang')
    if (str1 == str2):
        print("{}|L{}|\"{}\"".format(itemid, lang, str1),
            file=stdout)
  • Efforts after this description was made: #16494
  • Alternative way to check database: get all protein labels
SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

and run through:

from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('itemLabel')
    if itemstr is None:
        continue
    pos = itemstr.find(' ')
    if pos<0:
        continue
    str1 = itemstr[:pos]
    str2 = itemstr[pos+1:]
    if (str1 == str2):
        print(itemstr)

Human genes without Entrez gene IDEdit

  • 2019-08-19
SELECT ?item ?itemLabel ?geneid WHERE {
  ?item wdt:P31 wd:Q7187 .
  ?item wdt:P703 wd:Q15978631 .
  MINUS { ?item wdt:P351 [] } .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

showed 4 hits which were resolved manually

Objects with HGNC symbol but not instance of anythingEdit

  • 2019-Aug-19
SELECT ?item ?itemLabel ?geneid WHERE {
  ?item wdt:P353 ?dum .
  MINUS { ?item wdt:P31 [] } .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}
  • Hits: 186 (2019-Aug-19), 0 (2019-Aug-20)
  • wildly different objects, all have sitelinks, quick creations?
  • manually resolved

Genes: missing OMIM gene entryEdit

  • 2019-Aug-21
  • Using an OMIM/Entrez index download and the output of
SELECT ?item ?itemLabel ?geneid WHERE {
  ?item wdt:P31 wd:Q277338 .
#  ?item wdt:P31 wd:Q7187 .
  ?item wdt:P703 wd:Q15978631 .
  ?item wdt:P351 ?geneid .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

we used the script https://gist.github.com/rwst/760ee4454d306c4f619053bf5798becd to create the QS batches #17650, #17652, #17675, #17676, #17679, #17680, #17682

diseases with "anatomical location" instead of "location"Edit

  • 2019-Aug-23
SELECT DISTINCT ?item ?itemLabel WHERE {
	?statement wikibase:hasViolationForConstraint wds:P927-22a699be-4c01-5ee5-4295-81f6ac028a65 .
	?item ?p ?statement .
       ?item wdt:P279+ wd:Q12136 .
	FILTER( ?item NOT IN ( wd:Q4115189, wd:Q13406268, wd:Q15397819 ) ) .
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
  • 12 hits, manually resolved

Reactome pathway labels with semicolon substitutionEdit

  • 2019-Aug-23... that's what happens when you pipe strings containing commas through csv
SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P31 wd:Q4915012 .
  ?item rdfs:label ?itemLabel .
  FILTER( LANG(?itemLabel)="en" ) .
  FILTER( CONTAINS(?itemLabel, ";") ) .
}
  • 81 hits, script:
from sys import *
import csv

reader = csv.DictReader(open('t.tab', 'r'), delimiter='\t')
for item in reader:
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    l = item.get('itemLabel')
    print('{}|Len|"{}"'.format(it, l.replace(';',',')))
  • #17750

superfluous subclass-of-gene without refEdit

SELECT ?item
WHERE
{
  ?item wdt:P279 wd:Q277338 .
  ?item p:P279 ?c .
  ?c ps:P279 wd:Q7187 .
  MINUS {?c prov:wasDerivedFrom [pr:P248 ?ref]} .
}
  • #17833 (201)

ncRNA with "encodes"Edit

SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P279 wd:Q427087 .
  ?item wdt:P688 ?d .
}
  • 55 hits, manually resolved, some unresolvable like Q18051378

specialize subclass claimsEdit

general caseEdit

SELECT ?item ?itemLabel ?inst ?instLabel
WHERE
{
  ?item wdt:P279 wd:Q8054 .
  ?item wdt:P31 ?inst .
  ?inst wdt:P279 wd:Q8054 .
  ?item rdfs:label ?itemLabel . FILTER( LANG(?itemLabel)="en" )
  ?inst rdfs:label ?instLabel . FILTER( LANG(?instLabel)="en" )
}
  • #17740

proteins: specialized from GOA function (transporter activity --> transport protein)Edit

SELECT ?item ?itemLabel ?inst ?instLabel
WHERE
{
  ?item wdt:P279 wd:Q8054 .
  ?item wdt:P31 wd:Q8054 .
  ?item wdt:P680 ?inst .
  ?inst wdt:P279+ wd:Q14864384 .
  ?item rdfs:label ?itemLabel . FILTER( LANG(?itemLabel)="en" )
  ?inst rdfs:label ?instLabel . FILTER( LANG(?instLabel)="en" )
}
  • #17745, #17746, #17772, #17805, #17826, #17844, #17882, #17884

subclass of proteins and encoded without instance of proteinEdit

s2 = set(open('wd-subc-of-prot', 'r').readlines())
s3 = set(open('wd-inst-of-prot', 'r').readlines())
s4 = set(open('wd-encoded', 'r').readlines())
for i in s2.intersection(s4).difference(s3):
    print(i)
  • 50 cases, manually fixed 2009-Aug-31, leaving 3 valid exceptions

RefSeq protein missing valid UniProtEdit

  • 2019-Sep-01
from sys import *
import csv

reader = csv.DictReader(open('refseqp-wd.tab', 'r'), delimiter='\t')
refs = {}
dups = set()
for item in reader:
    uid = item.get('refseq')
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    git = refs.get(uid)
    if git is None or git == it:
        refs[it] = uid
    else:
        dups.add(it)
for k in dups:
    refs.pop(k)

reader = csv.DictReader(open('uniprot-refseq.tab', 'r'), delimiter='\t')
unips = {}
dups = set()
for item in reader:
    uid = item.get('uniprot')
    if '-' in uid:
        continue
    ref = item.get('refseq')
    if ref.find('.') > -1:
        ref = ref[:ref.find('.')]
    git = unips.get(ref)
    if git is None or git == it:
        unips[ref] = uid
    else:
        dups.add(ref)
for k in dups:
    unips.pop(k)

ids = set(l.rstrip() for l in open('wd-refseq-without-uniprot', 'r').readlines())
for it in ids:
    r = refs.get(it)
    if r is not None:
        u = unips.get(refs[it])
        if u is not None:
            print('{}|P352|"{}"'.format(it, u))
  • hits: 4,598 all from MicrobeBot imports of Myxococcus and Chlamydia; that leaves those without valid UniProt
  • batch: #17948

genes with EC numberEdit

 ?item wdt:P591 ?ec .
 ?item wdt:P2888 ?url .
 FILTER CONTAINS(STR(?url), 'ncbigene')
  • 272 hits, script:
reader = csv.DictReader(open('wd-genes-with-ec.tab', 'r'), delimiter='\t')
for item in reader:
    ec = item.get('ec')
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    print('-{}|P591|"{}"'.format(it, ec))
  • batch #17966

Stubs from early daysEdit

  • 2019-Dec-02
  • The following queries comes up with a lot of items dumped from enwiki people, common theme "found in taxon" "Homo sapiens"
 ?p wdt:P703 wd:Q15978631 .
 ?article 	schema:about ?p ;
			schema:isPartOf <https://en.wikipedia.org/> .
 MINUS { 
   ?p p:P703 ?stmt.
   ?stmt prov:wasDerivedFrom []
 }
 MINUS { ?p wdt:P31 wd:Q16521 }
 MINUS { ?p wdt:P31 wd:Q11173 }
 MINUS { ?p wdt:P31 wd:Q420927 }
 MINUS { ?p wdt:P31 wd:Q37748 }
 MINUS { ?p wdt:P31 wd:Q7187 }
 MINUS { ?p wdt:P31 wd:Q78782478 }
 MINUS { ?p wdt:P352 [] }
 MINUS { ?p wdt:P639 [] }

and "subclass+ protein"

 ?p wdt:P279+ wd:Q8047 .
 ?article 	schema:about ?p ;
			schema:isPartOf <https://en.wikipedia.org/> .
 MINUS { ?p wdt:P31 wd:Q417841 }
 MINUS { ?p wdt:P31 wd:Q67015883 }
 MINUS { ?p wdt:P31 wd:Q67101749 }
 MINUS { ?p wdt:P31 wd:Q49695242 }
 MINUS { ?p wdt:P31 wd:Q68461428 }
 MINUS { ?p wdt:P352 [] }
   	SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en" }
}
  • about 500 items were manually integrated, finished 2019-12-29

Stubs from early days IIEdit

SELECT DISTINCT ?item ?itemLabel
{
  ?article 	schema:about ?item ;
			schema:isPartOf <https://en.wikipedia.org/> .
  ?item  wdt:P279+  wd:Q8054.
  MINUS {
    ?item wdt:P31 []
    }
  ?item ?prop ?val.
  FILTER (STRSTARTS(STR(?prop), 'http://www.wikidata.org/prop/direct/') && ?prop != wdt:P646 && ?prop != wdt:P279)
  	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}

Try it!

  • finished 2020-Feb-19

Stubs from early days IIIEdit

SELECT DISTINCT ?p ?pLabel ?ec
{
    ?p wdt:P591 ?ec.
    MINUS {
      ?p wdt:P31 [].
    }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}

Try it!

  • finished 2020-Mar-11

Duplicate exact external idsEdit

GO functionsEdit

SELECT DISTINCT ?item1 ?item1Label ?funcLabel ?item2 ?item2Label 
{
    ?item1 p:P680 [ ps:P680 ?func; pq:P4390 wd:Q39893449; ].
    ?item2 p:P680 [ ps:P680 ?func; pq:P4390 wd:Q39893449; ].
    FILTER (?item1 != ?item2 && STR( ?item1 ) < STR( ?item2 )).
  	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
}

Try it!

  • finished 2020-Feb-29

GO entities without P31Edit

SELECT DISTINCT ?item ?goid
{
  ?item wdt:P686 ?goid.
  MINUS {
    { ?item wdt:P31 wd:Q5058355 } UNION { ?item wdt:P31 wd:Q2996394 } UNION { ?item wdt:P31 wd:Q14860489 }
  }
}
}

Try it!

  • most of them obsolete, should be either merged or tagged; finished 2020-Jul-1

Silly nl descriptions by EdoderoobotEdit

  • 2019-Aug-02
  • Description: Proteins having (nl)descriptions consisting of "proteïne in XYZ" where XYZ is a GeneDB protein ID, usually one of the ids of the protein item itself.
  • Example: https://www.wikidata.org/w/index.php?title=Q62305547&oldid=990962798 got "proteïne in EmuJ_001072400.1"
  • Talk page(s): 1
  • Reason (guess): unknown
  • Problem recognized by bot/batch maintainer: unknown
  • Bug in bot fixed: unknown
  • In the following the example species is Echinococcus multilocularis. It makes sense to partition the task into species because atm it appears only from GeneDB scraped proteins are affected, and these are only for a few invertebrates. Also the query server has much to do, and we want to be nice.
  • Number of proteins affected: 10668 for Emu alone...
  • Code to view affected proteins:
#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?itemLabel ?itemDesc ?str2 WHERE {
 ?item wdt:P31 wd:Q8054 .
 ?item wdt:P703 wd:Q669922 .
 ?item schema:description ?itemDesc .
 FILTER CONTAINS(?itemDesc, "proteïne in ") .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "nl" .
  }
 }
}
  • Proposed fix: for each affected species, manually check affected items and replace the label with "proteïne in species XYZ".
  • Script that produces a QuickStatement batch from above data:
from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    label = item.get('itemLabel')
    desc = item.get('itemDesc')
    if desc[:12] == "proteïne in " and desc[12:] == label:
        print("{}|Dnl|\"{}\"".format(itemid, "proteïne in Echinococcus multilocularis"),
            file=stdout)
  • Efforts after this description was made: [6]
  • I think these were fixed by Edoderoobot

WP Orphans (was: MS IDs on early entries)Edit

  • mostly domains which should have been placed on IPR domain items
SELECT ?item ?itemLabel WHERE {
  ?item wdt:P279 wd:Q8054.
  ?item wdt:P6366 ?ms.
  MINUS { ?item wdt:P31 [] }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

  • this was resolved by editing all orphans (2260 in en, de?), see catscan.py