User:SCIdude/Protein bugs

Bugs and their fixes of protein (Q8054) or gene (Q7187) objects

Unresolved edit

De-merge RNA and gene edit

88,855 inst-of ncRNA, 88,621 also inst-of gene, having gene ids and RNA ids (P639) mixed, also inst-of ncRNA
6,249 of them have no RefSeq RNA ID, cannot be created from the gene item alone, just remove the (4,331) P31s for now (1) and the (6,249) P279 (2). Later fix the descriptions.

Cell component duplicates from FMA edit

SELECT DISTINCT ?item ?label WHERE {
  ?item wdt:P1402 [].
  MINUS { ?item wdt:P31 wd:Q5058355 }
  MINUS { ?item wdt:P279 wd:Q5058355 }
  MINUS { ?item schema:description ?desc.
        FILTER(lang(?desc) = 'en') }
  ?item wdt:P279/wdt:P279* wd:Q66557947.
  ?item rdfs:label ?label.
  FILTER(lang(?label) = 'en')
}

Try it!

Also filter those that match names with GO entities.

Possible misplacements of MeSH ID edit

MeSH entries are usually species-independent but there are >1,400 of:

SELECT ?item ?itemLabel ?mname
WHERE 
{
    ?item wdt:P31 wd:Q8054.
    ?item p:P486 ?mesh.
    OPTIONAL { ?mesh pq:P1810 ?mname. }
  	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}

Try it!

Reactome R-MTU imports marked as human edit

2019-Oct-01
21 proteins, complexes, sets from M.tuberculosis have got "found in taxon"-->human

SELECT ?p ?pLabel ?tLabel
 WHERE
 {
   ?p wdt:P703 wd:Q15978631 .
   OPTIONAL{ ?p wdt:P31 ?t . }
   ?p wdt:P2888 ?url .
   FILTER ( STRSTARTS(STR(?url), 'https://identifiers.org/reactome:R-MTU') ).
   SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
 }

Try it!

Reactome protein complex duplicates edit

2019-Dec-01
Description: PathwayBot imported protein complex items but Reactome has for each location different IDs so we have lots of duplicates, e.g. https://www.wikidata.org/wiki/Special:WhatLinksHere/Q21106702.

Preproproteins (human) pt. I edit

2019-Jul-30
Description: Proteins that in UniProt have the keyword "Cleavage on pair of basic residues [KW-0165]" but the WD object misses the protein precursor (Q258658) class. NOTE: reviewed UniProt (SwissProt) entries contain precursors AND their products, while in TrEMBL the products have different entries.
Example: [1]
Talk page(s): 1
Reason (guess): complete UniProt import was too difficult
Number of human proteins affected: unknown
Code to view UniProt-associated items:

SELECT ?item ?itemLabel ?uniprotid
WHERE
{
	?item wdt:P352 ?uniprotid ;
          wdt:P703 wd:Q15978631 .
  
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

Proposed fix: for each member of a given set of preproprotein UniProtIDs 1. find the corr. WD item, and 2. add the instance statement. If the corr. item does not exist, add it.
Script that produces a QuickStatement batch from above data and the UniProt search TSV:
Efforts after this description was made: manual work ongoing. Actually, we started to create fragment objects for every nontrivial fragment

arbitrary symbols on unlocated Entrez genes edit

There are gene entries from Entrez where only the chromosome position is known, and so no official gene name cn be given. It seems Entrez then just took symbols from OMIM, regardless if that was a gene or phenotype entry, and gave that as symbol. Example: TEC (Q26241247) from Entrez 100124696 where the symbol is from OMIM 227050 (Transient erythroblastopenia of childhood) which symbol collides with TEC (Q18031939).

Such items should be marked somehow, maybe genomic start/end ---> unknown

Practically irrelevant or misguided edit

UniProt ID but not instance of protein/peptide edit

2019-Aug-23 that was actually misguided and not necessary, see Wikidata_talk:WikiProject_Molecular_biology#usage_of_instance_on_genes/proteins
2019-Aug-15
Code to view UniProt-associated items:

SELECT ?item ?itemLabel 
WHERE
{
  ?item wdt:P352 ?uid .
  MINUS { ?item wdt:P31 [] } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}

Note: there may be protein groups covered by a UniProt ID
number of hits: 23,303 (2019-Aug-15), 1 (2019-Aug-17)
code to output QS:

from sys import *
import csv

reader = csv.DictReader(stdin, delimiter=',')
items = {}
for item in reader:
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    print("{}|P31|Q8054".format(it))

batches: #17507

Determination method on refs edit

2019-Jul-29
Description: Proteins having function/process/loc statements with ref having a determination method, which triggers a scope violation. Proposed solution: Wikidata_talk:WikiProject_Molecular_biology#"determination_method"_property_on_GOA_references
Example: https://www.wikidata.org/w/index.php?title=Q27757881&oldid=988373683
Talk page(s): 1
Reason (guess): unknown
Problem recognized by bot/batch maintainer: unknown
Bug in bot fixed: unknown
Number of proteins affected:
Code to view affected proteins:

#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?stmt1 ?meth ?ref WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item p:P680 ?stmt1 .
  ?stmt1 prov:wasDerivedFrom [ pr:P459 ?ref ] .
  ?stmt1 pq:P459 ?meth .
}

Proposed fix:
Script that produces a QuickStatement batch from above data:
Efforts after this description was made:

Aliases edit

see Wikidata:Bot_requests#Protein_aliases
alias identical to label, remove
~~alias: "hypothetical protein", move/append to description~~UniProt actually lists this as name...
also aliases of form Uniprot:xyz (insulin)

GeneDB ID as label edit

2019-Aug-02
Description: Proteins having (en)labels identical to their GeneDB ID.
Example: https://www.wikidata.org/w/index.php?title=Q62305547&oldid=990962483 got "EmuJ_001072400.1" instead of "hypothetical protein" (which is the name given by UniProt)
Talk page(s): 1
Reason (guess): unknown
Problem recognized by bot/batch maintainer: unknown
Bug in bot fixed: unknown
In the following the example species is Echinococcus multilocularis. It makes sense to partition the task into species because atm it appears only from GeneDB scraped proteins are affected, and these are only for a few invertebrates. Also the query server has much to do, and we want to be nice.
Number of proteins affected: 10659 for Emu alone...
Code to view affected proteins:

SELECT DISTINCT ?item ?itemLabel ?itemAl WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item wdt:P703 wd:Q669922 .
  ?item wdt:P3382 ?str2 .
  ?item rdfs:label ?itemLabel .
  FILTER (STR(?itemLabel) = ?str2) .
  ?item skos:altLabel ?itemAl .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}

Proposed fix: for each affected species, manually check affected items and replace the label with the alias.
Script that produces a QuickStatement batch from above data:

from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    label = item.get('itemLabel')
    al = item.get('itemAl')
    if (al[:9] != "expressed"
            and al[:9] != "conserved"
            and al[:12] != "hypothetical"):
        print("{}|Len|\"{}\"".format(itemid, al),
            file=stdout)
        print("{}|Aen|\"{}\"".format(itemid, label),
            file=stdout)

Efforts after this description was made: [2](Emu), [3] (Lin)

UniProt ID but no encoding gene edit

2019-Aug-15
Code to view UniProt-associated items:

SELECT ?item ?itemLabel 
WHERE
{
  ?item wdt:P352 ?uid .
  #?item wdt:P703 wd:Q15978631 .
  MINUS { ?item wdt:P702 [] } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}

number of hits: 29730 (2019-Aug-16)
reason is mainly that the organism/gene is uninteresting. Restricting to human (331), mouse (13), rat (1)
reasons for those: gene/protein is in doubt/unknown, UniProt on gene entry, mobile element (=no database gene entry), part of antibody (why?)
manual fixes necessary for some of these. Remaining are: human (323), mouse (8)

instance + subclass of protein edit

2019-Aug-05
Description: User_talk:GeneDBot#modeling_issue
but see Wikidata_talk:WikiProject_Molecular_biology#bulk_statement_deletion

SELECT ?stmt
WHERE 
{
  ?item wdt:P31 wd:Q8054.
  ?item p:P279 ?stmt. 
  ?stmt ps:P279 wd:Q8054.
}

Try it!

Permanent maintenance jobs edit

obsolete UniProt IDs not marked edit

2019-Aug-30, hits: 22,427 (3.5%) of 645,897 items with UniProt ID
batches: #17928, #17929, #17930, #17935, #17946, #17947
2020-May-01: 6712 new cases, fixed via https://github.com/rwst/wikidata-molbio/blob/master/obsolete-uniprots.py so this becomes a maintenance job

obsolete GO entries edit

2019-Dec-11
89 F items have obsolete GO ids, fixed manually 2020-Jan-04
now a maintenance bot module, see https://github.com/rwst/wikidata-molbio/blob/master/obsolete-gos.py

Duplicate UniProt IDs edit

2019-Jul-31
now a maintenance bot module, see https://github.com/rwst/wikidata-molbio/blob/master/dedup-uniprot.py

inexact GO synonyms edit

2019-Dec-11
the bot adds/added all synonyms, not only the exact ones. Also, not all items have their GO id as alias. This lists the aliases:
User_talk:ProteinBoxBot#GO_"synonym"
function/process/component aliases: 132,441 (1,816 of these GO ids), all-language: 143,607. Exact aliases in GO: 90,393. Really? What are the remaining 40k?
also reported at https://github.com/SuLab/GeneWikiCentral/issues/131
now part of maintenance: https://github.com/rwst/wikidata-molbio/blob/master/go-sync-exact-aliases.py

proteins with instance-of a main family edit

potential idiotic mergers

SELECT DISTINCT ?item
{
  VALUES ?class { wd:Q84467700 wd:Q67015883 wd:Q417841 wd:Q7251477 wd:Q67101749 wd:Q68461428 }
  ?item wdt:P31 ?class.
  ?item wdt:P31 wd:Q8054.
}

Try it!

Fixed in database edit

Objects being instance of both gene and protein edit

2019-Jul-31
Number of objects affected: 29 (2019-Jul-31), 0 (2019-Aug-01)
Code to view affected proteins:

SELECT (COUNT(?item) AS ?count) WHERE {
#SELECT DISTINCT ?item WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item wdt:P31 wd:Q7187 .
}

Efforts after this description was made: manual edits around 29 items (until 2019-aug-01)

Natural peptides without encoding genes/taxon/UniProt edit

2019-Aug-05
Description: see title. Mostly they have en-WP articles with UniProt/Interpro/Pfam entries. Peptides are niche articles, were all scraped from WP anyway (14 with, 4 without en-WP article).
can be manually resolved

SELECT DISTINCT ?item ?article ?itemLabel WHERE {
  ?item wdt:P31 wd:Q172847 .
#  ?article schema:about ?item  ; schema:inLanguage "en"
#  FILTER NOT EXISTS { ?wen schema:about ?item ; schema:inLanguage "en" }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Gene/protein associations with different taxa edit

2019-Aug-01
CAUTION: near bacterial strains may have identical proteins, these should be excluded
Number of objects affected: 3368 (2019-Aug-01) (C.botulinum diff. strains: 2758, B.anthracis diff. strains: 606)
Code to view affected proteins:

SELECT (COUNT(?itemp) AS ?count) WHERE {
#SELECT DISTINCT ?itemp ?itemg ?taxp ?taxg WHERE {
  ?itemp wdt:P31 wd:Q8054 .
  ?itemp wdt:P702 ?itemg .
  ?itemp wdt:P703 ?taxp .
  ?itemg wdt:P703 ?taxg .
  FILTER (?taxp != ?taxg)
  }

Efforts after this description was made: one case human/mouse edited (2019-Aug-01)

UniProt ID but no taxon edit

2019-Aug-15
Code to view UniProt-associated items:

SELECT ?item ?itemLabel 
WHERE
{
  ?item wdt:P352 ?uid .
  MINUS { ?item wdt:P703 [] } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}

number of hits: 28, manually resolved

Duped labels (proteins) edit

2019-Jul-26
Description: Proteins having labels consisting of two words, differing only by case, or not at all. Usually one of the words is identical to an alias.
Example: [4] has the label "lmo0063 lmo0063"
Talk page(s): 1 2
Reason (guess): unknown
Problem recognized by bot/batch maintainer: unknown
Bug in bot fixed: unknown
Number of proteins affected: 11434 (2019-Jul-27), 0 (2019-Aug-17)
Code to view affected proteins:

NOTE: this gives a query timeout now (2019-Aug-16), so should be checked differently before resolving!

#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?itemLabel (lang(?itemLabel) AS ?itemLabel_lang) ?str1 ?str2 WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel .
  FILTER CONTAINS(?itemLabel, " ") .
  BIND (STRBEFORE(?itemLabel, " ") AS ?str1) .
  BIND (STRAFTER(?itemLabel, " ") AS ?str2) .
  FILTER (STRLEN(?str1) = STRLEN(?str2)) .
  FILTER (?str1 = ?str2 || LCASE(?str1) = LCASE(?str2)) .
}

Proposed fix: for each such (object,label) tuple, replace the label with the uppercase version of the two words. Rationale: protein symbols are always uppercase (Ref: 1).
Script that produces a QuickStatement batch from above data:

from sys import *
import csv

def upcase(s): return s[:1].upper() + s[1:]

reader = csv.DictReader(stdin, delimiter=',')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    lang = item.get('itemLabel_lang')
    str1 = upcase(item.get('str1'))
    str2 = upcase(item.get('str2'))
    if (str1 == str2):
        print("{}|L{}|\"{}\"".format(itemid, lang, str1),
            file=stdout)

Efforts after this description was made: QS #16460 #16461
Alternative way to check database: get all protein labels

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

and run through:

from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('itemLabel')
    if itemstr is None:
        continue
    pos = itemstr.find(' ')
    if pos<0:
        continue
    str1 = itemstr[:pos]
    str2 = itemstr[pos+1:]
    if (str1 == str2):
        print(itemstr)

Duped labels (genes) edit

2019-Jul-30
Description: Genes having labels consisting of two identical words. Usually one of the words is identical to an alias.
Example: [5] has the label "lmo0819 lmo0819"
Talk page(s): 1 2
Reason (guess): unknown
Problem recognized by bot/batch maintainer: unknown
Bug in bot fixed: unknown
Number of proteins affected: 11560 (2019-Jul-30), 0 (2019-Aug-17)
Code to view affected items:

NOTE: this gives a query timeout now (2019-Aug-16), so should be checked differently before resolving!

#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?itemLabel (lang(?itemLabel) AS ?itemLabel_lang) ?str1 ?str2 WHERE {
  ?item wdt:P31 wd:Q7187 .
  ?item rdfs:label ?itemLabel .
  FILTER CONTAINS(?itemLabel, " ") .
  BIND (STRBEFORE(?itemLabel, " ") AS ?str1) .
  BIND (STRAFTER(?itemLabel, " ") AS ?str2) .
  FILTER (STRLEN(?str1) = STRLEN(?str2)) .
  FILTER (?str1 = ?str2) .
} TIMEOUT!

Proposed fix: for each such (object,label) tuple, replace the label with the single word.
Script that produces a QuickStatement batch from above data:

from sys import *
import csv

reader = csv.DictReader(stdin, delimiter=',')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    lang = item.get('itemLabel_lang')
    if (str1 == str2):
        print("{}|L{}|\"{}\"".format(itemid, lang, str1),
            file=stdout)

Efforts after this description was made: #16494
Alternative way to check database: get all protein labels

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

and run through:

from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('itemLabel')
    if itemstr is None:
        continue
    pos = itemstr.find(' ')
    if pos<0:
        continue
    str1 = itemstr[:pos]
    str2 = itemstr[pos+1:]
    if (str1 == str2):
        print(itemstr)

Human genes without Entrez gene ID edit

2019-08-19

SELECT ?item ?itemLabel ?geneid WHERE {
  ?item wdt:P31 wd:Q7187 .
  ?item wdt:P703 wd:Q15978631 .
  MINUS { ?item wdt:P351 [] } .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

showed 4 hits which were resolved manually

Objects with HGNC symbol but not instance of anything edit

2019-Aug-19

SELECT ?item ?itemLabel ?geneid WHERE {
  ?item wdt:P353 ?dum .
  MINUS { ?item wdt:P31 [] } .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

Hits: 186 (2019-Aug-19), 0 (2019-Aug-20)
wildly different objects, all have sitelinks, quick creations?
manually resolved

Genes: missing OMIM gene entry edit

2019-Aug-21
Using an OMIM/Entrez index download and the output of

SELECT ?item ?itemLabel ?geneid WHERE {
  ?item wdt:P31 wd:Q277338 .
#  ?item wdt:P31 wd:Q7187 .
  ?item wdt:P703 wd:Q15978631 .
  ?item wdt:P351 ?geneid .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

we used the script https://gist.github.com/rwst/760ee4454d306c4f619053bf5798becd to create the QS batches #17650, #17652, #17675, #17676, #17679, #17680, #17682

diseases with "anatomical location" instead of "location" edit

2019-Aug-23

SELECT DISTINCT ?item ?itemLabel WHERE {
	?statement wikibase:hasViolationForConstraint wds:P927-22a699be-4c01-5ee5-4295-81f6ac028a65 .
	?item ?p ?statement .
       ?item wdt:P279+ wd:Q12136 .
	FILTER( ?item NOT IN ( wd:Q4115189, wd:Q13406268, wd:Q15397819 ) ) .
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}

12 hits, manually resolved

Reactome pathway labels with semicolon substitution edit

2019-Aug-23... that's what happens when you pipe strings containing commas through csv

SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P31 wd:Q4915012 .
  ?item rdfs:label ?itemLabel .
  FILTER( LANG(?itemLabel)="en" ) .
  FILTER( CONTAINS(?itemLabel, ";") ) .
}

81 hits, script:

from sys import *
import csv

reader = csv.DictReader(open('t.tab', 'r'), delimiter='\t')
for item in reader:
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    l = item.get('itemLabel')
    print('{}|Len|"{}"'.format(it, l.replace(';',',')))

#17750

superfluous subclass-of-gene without ref edit

SELECT ?item
WHERE
{
  ?item wdt:P279 wd:Q277338 .
  ?item p:P279 ?c .
  ?c ps:P279 wd:Q7187 .
  MINUS {?c prov:wasDerivedFrom [pr:P248 ?ref]} .
}

#17833 (201)

ncRNA with "encodes" edit

SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P279 wd:Q427087 .
  ?item wdt:P688 ?d .
}

55 hits, manually resolved, some unresolvable like Q18051378

specialize subclass claims edit

general case edit

SELECT ?item ?itemLabel ?inst ?instLabel
WHERE
{
  ?item wdt:P279 wd:Q8054 .
  ?item wdt:P31 ?inst .
  ?inst wdt:P279 wd:Q8054 .
  ?item rdfs:label ?itemLabel . FILTER( LANG(?itemLabel)="en" )
  ?inst rdfs:label ?instLabel . FILTER( LANG(?instLabel)="en" )
}

#17740

proteins: specialized from GOA function (transporter activity --> transport protein) edit

SELECT ?item ?itemLabel ?inst ?instLabel
WHERE
{
  ?item wdt:P279 wd:Q8054 .
  ?item wdt:P31 wd:Q8054 .
  ?item wdt:P680 ?inst .
  ?inst wdt:P279+ wd:Q14864384 .
  ?item rdfs:label ?itemLabel . FILTER( LANG(?itemLabel)="en" )
  ?inst rdfs:label ?instLabel . FILTER( LANG(?instLabel)="en" )
}

#17745, #17746, #17772, #17805, #17826, #17844, #17882, #17884

subclass of proteins and encoded without instance of protein edit

s2 = set(open('wd-subc-of-prot', 'r').readlines())
s3 = set(open('wd-inst-of-prot', 'r').readlines())
s4 = set(open('wd-encoded', 'r').readlines())
for i in s2.intersection(s4).difference(s3):
    print(i)

50 cases, manually fixed 2009-Aug-31, leaving 3 valid exceptions

RefSeq protein missing valid UniProt edit

2019-Sep-01

from sys import *
import csv

reader = csv.DictReader(open('refseqp-wd.tab', 'r'), delimiter='\t')
refs = {}
dups = set()
for item in reader:
    uid = item.get('refseq')
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    git = refs.get(uid)
    if git is None or git == it:
        refs[it] = uid
    else:
        dups.add(it)
for k in dups:
    refs.pop(k)

reader = csv.DictReader(open('uniprot-refseq.tab', 'r'), delimiter='\t')
unips = {}
dups = set()
for item in reader:
    uid = item.get('uniprot')
    if '-' in uid:
        continue
    ref = item.get('refseq')
    if ref.find('.') > -1:
        ref = ref[:ref.find('.')]
    git = unips.get(ref)
    if git is None or git == it:
        unips[ref] = uid
    else:
        dups.add(ref)
for k in dups:
    unips.pop(k)

ids = set(l.rstrip() for l in open('wd-refseq-without-uniprot', 'r').readlines())
for it in ids:
    r = refs.get(it)
    if r is not None:
        u = unips.get(refs[it])
        if u is not None:
            print('{}|P352|"{}"'.format(it, u))

hits: 4,598 all from MicrobeBot imports of Myxococcus and Chlamydia; that leaves those without valid UniProt
batch: #17948

genes with EC number edit

 ?item wdt:P591 ?ec .
 ?item wdt:P2888 ?url .
 FILTER CONTAINS(STR(?url), 'ncbigene')

272 hits, script:

reader = csv.DictReader(open('wd-genes-with-ec.tab', 'r'), delimiter='\t')
for item in reader:
    ec = item.get('ec')
    iturl = item.get('item')
    it = iturl[iturl.rfind('/')+1:]
    print('-{}|P591|"{}"'.format(it, ec))

batch #17966

Stubs from early days edit

2019-Dec-02
The following queries comes up with a lot of items dumped from enwiki people, common theme "found in taxon" "Homo sapiens"

 ?p wdt:P703 wd:Q15978631 .
 ?article 	schema:about ?p ;
			schema:isPartOf <https://en.wikipedia.org/> .
 MINUS { 
   ?p p:P703 ?stmt.
   ?stmt prov:wasDerivedFrom []
 }
 MINUS { ?p wdt:P31 wd:Q16521 }
 MINUS { ?p wdt:P31 wd:Q11173 }
 MINUS { ?p wdt:P31 wd:Q420927 }
 MINUS { ?p wdt:P31 wd:Q37748 }
 MINUS { ?p wdt:P31 wd:Q7187 }
 MINUS { ?p wdt:P31 wd:Q78782478 }
 MINUS { ?p wdt:P352 [] }
 MINUS { ?p wdt:P639 [] }

and "subclass+ protein"

 ?p wdt:P279+ wd:Q8047 .
 ?article 	schema:about ?p ;
			schema:isPartOf <https://en.wikipedia.org/> .
 MINUS { ?p wdt:P31 wd:Q417841 }
 MINUS { ?p wdt:P31 wd:Q67015883 }
 MINUS { ?p wdt:P31 wd:Q67101749 }
 MINUS { ?p wdt:P31 wd:Q49695242 }
 MINUS { ?p wdt:P31 wd:Q68461428 }
 MINUS { ?p wdt:P352 [] }
   	SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en" }
}

about 500 items were manually integrated, finished 2019-12-29

Stubs from early days II edit

SELECT DISTINCT ?item ?itemLabel
{
  ?article 	schema:about ?item ;
			schema:isPartOf <https://en.wikipedia.org/> .
  ?item  wdt:P279+  wd:Q8054.
  MINUS {
    ?item wdt:P31 []
    }
  ?item ?prop ?val.
  FILTER (STRSTARTS(STR(?prop), 'http://www.wikidata.org/prop/direct/') && ?prop != wdt:P646 && ?prop != wdt:P279)
  	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}

Try it!

finished 2020-Feb-19

Stubs from early days III edit

SELECT DISTINCT ?p ?pLabel ?ec
{
    ?p wdt:P591 ?ec.
    MINUS {
      ?p wdt:P31 [].
    }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}

Try it!

finished 2020-Mar-11

Duplicate exact external ids edit

GO functions edit

SELECT DISTINCT ?item1 ?item1Label ?funcLabel ?item2 ?item2Label 
{
    ?item1 p:P680 [ ps:P680 ?func; pq:P4390 wd:Q39893449; ].
    ?item2 p:P680 [ ps:P680 ?func; pq:P4390 wd:Q39893449; ].
    FILTER (?item1 != ?item2 && STR( ?item1 ) < STR( ?item2 )).
  	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
}

Try it!

finished 2020-Feb-29

GO entities without P31 edit

SELECT DISTINCT ?item ?goid
{
  ?item wdt:P686 ?goid.
  MINUS {
    { ?item wdt:P31 wd:Q5058355 } UNION { ?item wdt:P31 wd:Q2996394 } UNION { ?item wdt:P31 wd:Q14860489 }
  }
}
}

Try it!

most of them obsolete, should be either merged or tagged; finished 2020-Jul-1

Silly nl descriptions by Edoderoobot edit

2019-Aug-02
Description: Proteins having (nl)descriptions consisting of "proteïne in XYZ" where XYZ is a GeneDB protein ID, usually one of the ids of the protein item itself.
Example: https://www.wikidata.org/w/index.php?title=Q62305547&oldid=990962798 got "proteïne in EmuJ_001072400.1"
Talk page(s): 1
Reason (guess): unknown
Problem recognized by bot/batch maintainer: unknown
Bug in bot fixed: unknown
In the following the example species is Echinococcus multilocularis. It makes sense to partition the task into species because atm it appears only from GeneDB scraped proteins are affected, and these are only for a few invertebrates. Also the query server has much to do, and we want to be nice.
Number of proteins affected: 10668 for Emu alone...
Code to view affected proteins:

#SELECT (COUNT(?item) AS ?count) WHERE {
SELECT DISTINCT ?item ?itemLabel ?itemDesc ?str2 WHERE {
 ?item wdt:P31 wd:Q8054 .
 ?item wdt:P703 wd:Q669922 .
 ?item schema:description ?itemDesc .
 FILTER CONTAINS(?itemDesc, "proteïne in ") .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "nl" .
  }
 }
}

Proposed fix: for each affected species, manually check affected items and replace the label with "proteïne in species XYZ".
Script that produces a QuickStatement batch from above data:

from sys import *
import csv

reader = csv.DictReader(stdin, delimiter='\t')
for item in reader:
    itemstr = item.get('item')
    itemid = itemstr[itemstr.rfind('/')+1:]
    label = item.get('itemLabel')
    desc = item.get('itemDesc')
    if desc[:12] == "proteïne in " and desc[12:] == label:
        print("{}|Dnl|\"{}\"".format(itemid, "proteïne in Echinococcus multilocularis"),
            file=stdout)

Efforts after this description was made: [6]
I think these were fixed by Edoderoobot

WP Orphans (was: MS IDs on early entries) edit

mostly domains which should have been placed on IPR domain items

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P279 wd:Q8054.
  ?item wdt:P6366 ?ms.
  MINUS { ?item wdt:P31 [] }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

this was resolved by editing all orphans (2260 in en, de?), see catscan.py