User:SCIdude/Protein bugs
Bugs and their fixes of protein (Q8054) or gene (Q7187) objects
Unresolved edit
De-merge RNA and gene edit
- 88,855 inst-of ncRNA, 88,621 also inst-of gene, having gene ids and RNA ids (P639) mixed, also inst-of ncRNA
- 6,249 of them have no RefSeq RNA ID, cannot be created from the gene item alone, just remove the (4,331) P31s for now (1) and the (6,249) P279 (2). Later fix the descriptions.
Cell component duplicates from FMA edit
SELECT DISTINCT ?item ?label WHERE {
?item wdt:P1402 [].
MINUS { ?item wdt:P31 wd:Q5058355 }
MINUS { ?item wdt:P279 wd:Q5058355 }
MINUS { ?item schema:description ?desc.
FILTER(lang(?desc) = 'en') }
?item wdt:P279/wdt:P279* wd:Q66557947.
?item rdfs:label ?label.
FILTER(lang(?label) = 'en')
}
Also filter those that match names with GO entities.
Possible misplacements of MeSH ID edit
- MeSH entries are usually species-independent but there are >1,400 of:
SELECT ?item ?itemLabel ?mname
WHERE
{
?item wdt:P31 wd:Q8054.
?item p:P486 ?mesh.
OPTIONAL { ?mesh pq:P1810 ?mname. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
Reactome R-MTU imports marked as human edit
- 2019-Oct-01
- 21 proteins, complexes, sets from M.tuberculosis have got "found in taxon"-->human
SELECT ?p ?pLabel ?tLabel
WHERE
{
?p wdt:P703 wd:Q15978631 .
OPTIONAL{ ?p wdt:P31 ?t . }
?p wdt:P2888 ?url .
FILTER ( STRSTARTS(STR(?url), 'https://identifiers.org/reactome:R-MTU') ).
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
Reactome protein complex duplicates edit
- 2019-Dec-01
- Description: PathwayBot imported protein complex items but Reactome has for each location different IDs so we have lots of duplicates, e.g. https://www.wikidata.org/wiki/Special:WhatLinksHere/Q21106702.
Preproproteins (human) pt. I edit
- 2019-Jul-30
- Description: Proteins that in UniProt have the keyword "Cleavage on pair of basic residues [KW-0165]" but the WD object misses the protein precursor (Q258658) class. NOTE: reviewed UniProt (SwissProt) entries contain precursors AND their products, while in TrEMBL the products have different entries.
- Example: [1]
- Talk page(s): 1
- Reason (guess): complete UniProt import was too difficult
- Number of human proteins affected: unknown
- Code to view UniProt-associated items:
SELECT ?item ?itemLabel ?uniprotid WHERE { ?item wdt:P352 ?uniprotid ; wdt:P703 wd:Q15978631 . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } }
- Proposed fix: for each member of a given set of preproprotein UniProtIDs 1. find the corr. WD item, and 2. add the instance statement. If the corr. item does not exist, add it.
- Script that produces a QuickStatement batch from above data and the UniProt search TSV:
- Efforts after this description was made: manual work ongoing. Actually, we started to create fragment objects for every nontrivial fragment
arbitrary symbols on unlocated Entrez genes edit
There are gene entries from Entrez where only the chromosome position is known, and so no official gene name cn be given. It seems Entrez then just took symbols from OMIM, regardless if that was a gene or phenotype entry, and gave that as symbol. Example: TEC (Q26241247) from Entrez 100124696 where the symbol is from OMIM 227050 (Transient erythroblastopenia of childhood) which symbol collides with TEC (Q18031939).
Such items should be marked somehow, maybe genomic start/end ---> unknown
Practically irrelevant or misguided edit
UniProt ID but not instance of protein/peptide edit
- 2019-Aug-23 that was actually misguided and not necessary, see Wikidata_talk:WikiProject_Molecular_biology#usage_of_instance_on_genes/proteins
- 2019-Aug-15
- Code to view UniProt-associated items:
SELECT ?item ?itemLabel WHERE { ?item wdt:P352 ?uid . MINUS { ?item wdt:P31 [] } . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . } }
- Note: there may be protein groups covered by a UniProt ID
- number of hits: 23,303 (2019-Aug-15), 1 (2019-Aug-17)
- code to output QS:
from sys import * import csv reader = csv.DictReader(stdin, delimiter=',') items = {} for item in reader: iturl = item.get('item') it = iturl[iturl.rfind('/')+1:] print("{}|P31|Q8054".format(it))
- batches: #17507
Determination method on refs edit
- 2019-Jul-29
- Description: Proteins having function/process/loc statements with ref having a determination method, which triggers a scope violation. Proposed solution: Wikidata_talk:WikiProject_Molecular_biology#"determination_method"_property_on_GOA_references
- Example: https://www.wikidata.org/w/index.php?title=Q27757881&oldid=988373683
- Talk page(s): 1
- Reason (guess): unknown
- Problem recognized by bot/batch maintainer: unknown
- Bug in bot fixed: unknown
- Number of proteins affected:
- Code to view affected proteins:
#SELECT (COUNT(?item) AS ?count) WHERE { SELECT DISTINCT ?item ?stmt1 ?meth ?ref WHERE { ?item wdt:P31 wd:Q8054 . ?item p:P680 ?stmt1 . ?stmt1 prov:wasDerivedFrom [ pr:P459 ?ref ] . ?stmt1 pq:P459 ?meth . }
- Proposed fix:
- Script that produces a QuickStatement batch from above data:
- Efforts after this description was made:
Aliases edit
- see Wikidata:Bot_requests#Protein_aliases
- alias identical to label, remove
alias: "hypothetical protein", move/append to descriptionUniProt actually lists this as name...- also aliases of form Uniprot:xyz (insulin)
GeneDB ID as label edit
- 2019-Aug-02
- Description: Proteins having (en)labels identical to their GeneDB ID.
- Example: https://www.wikidata.org/w/index.php?title=Q62305547&oldid=990962483 got "EmuJ_001072400.1" instead of "hypothetical protein" (which is the name given by UniProt)
- Talk page(s): 1
- Reason (guess): unknown
- Problem recognized by bot/batch maintainer: unknown
- Bug in bot fixed: unknown
- In the following the example species is Echinococcus multilocularis. It makes sense to partition the task into species because atm it appears only from GeneDB scraped proteins are affected, and these are only for a few invertebrates. Also the query server has much to do, and we want to be nice.
- Number of proteins affected: 10659 for Emu alone...
- Code to view affected proteins:
SELECT DISTINCT ?item ?itemLabel ?itemAl WHERE { ?item wdt:P31 wd:Q8054 . ?item wdt:P703 wd:Q669922 . ?item wdt:P3382 ?str2 . ?item rdfs:label ?itemLabel . FILTER (STR(?itemLabel) = ?str2) . ?item skos:altLabel ?itemAl . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . } }
- Proposed fix: for each affected species, manually check affected items and replace the label with the alias.
- Script that produces a QuickStatement batch from above data:
from sys import * import csv reader = csv.DictReader(stdin, delimiter='\t') for item in reader: itemstr = item.get('item') itemid = itemstr[itemstr.rfind('/')+1:] label = item.get('itemLabel') al = item.get('itemAl') if (al[:9] != "expressed" and al[:9] != "conserved" and al[:12] != "hypothetical"): print("{}|Len|\"{}\"".format(itemid, al), file=stdout) print("{}|Aen|\"{}\"".format(itemid, label), file=stdout)
UniProt ID but no encoding gene edit
- 2019-Aug-15
- Code to view UniProt-associated items:
SELECT ?item ?itemLabel WHERE { ?item wdt:P352 ?uid . #?item wdt:P703 wd:Q15978631 . MINUS { ?item wdt:P702 [] } . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . } }
- number of hits: 29730 (2019-Aug-16)
- reason is mainly that the organism/gene is uninteresting. Restricting to human (331), mouse (13), rat (1)
- reasons for those: gene/protein is in doubt/unknown, UniProt on gene entry, mobile element (=no database gene entry), part of antibody (why?)
- manual fixes necessary for some of these. Remaining are: human (323), mouse (8)
instance + subclass of protein edit
- 2019-Aug-05
- Description: User_talk:GeneDBot#modeling_issue
- but see Wikidata_talk:WikiProject_Molecular_biology#bulk_statement_deletion
SELECT ?stmt
WHERE
{
?item wdt:P31 wd:Q8054.
?item p:P279 ?stmt.
?stmt ps:P279 wd:Q8054.
}
Permanent maintenance jobs edit
obsolete UniProt IDs not marked edit
- 2019-Aug-30, hits: 22,427 (3.5%) of 645,897 items with UniProt ID
- batches: #17928, #17929, #17930, #17935, #17946, #17947
- 2020-May-01: 6712 new cases, fixed via https://github.com/rwst/wikidata-molbio/blob/master/obsolete-uniprots.py so this becomes a maintenance job
obsolete GO entries edit
- 2019-Dec-11
- 89 F items have obsolete GO ids, fixed manually 2020-Jan-04
- now a maintenance bot module, see https://github.com/rwst/wikidata-molbio/blob/master/obsolete-gos.py
Duplicate UniProt IDs edit
- 2019-Jul-31
- now a maintenance bot module, see https://github.com/rwst/wikidata-molbio/blob/master/dedup-uniprot.py
inexact GO synonyms edit
- 2019-Dec-11
- the bot adds/added all synonyms, not only the exact ones. Also, not all items have their GO id as alias. This lists the aliases:
- User_talk:ProteinBoxBot#GO_"synonym"
- function/process/component aliases: 132,441 (1,816 of these GO ids), all-language: 143,607. Exact aliases in GO: 90,393. Really? What are the remaining 40k?
- also reported at https://github.com/SuLab/GeneWikiCentral/issues/131
- now part of maintenance: https://github.com/rwst/wikidata-molbio/blob/master/go-sync-exact-aliases.py
proteins with instance-of a main family edit
- potential idiotic mergers
SELECT DISTINCT ?item
{
VALUES ?class { wd:Q84467700 wd:Q67015883 wd:Q417841 wd:Q7251477 wd:Q67101749 wd:Q68461428 }
?item wdt:P31 ?class.
?item wdt:P31 wd:Q8054.
}
Fixed in database edit
Objects being instance of both gene and protein edit
- 2019-Jul-31
- Number of objects affected: 29 (2019-Jul-31), 0 (2019-Aug-01)
- Code to view affected proteins:
SELECT (COUNT(?item) AS ?count) WHERE { #SELECT DISTINCT ?item WHERE { ?item wdt:P31 wd:Q8054 . ?item wdt:P31 wd:Q7187 . }
- Efforts after this description was made: manual edits around 29 items (until 2019-aug-01)
Natural peptides without encoding genes/taxon/UniProt edit
- 2019-Aug-05
- Description: see title. Mostly they have en-WP articles with UniProt/Interpro/Pfam entries. Peptides are niche articles, were all scraped from WP anyway (14 with, 4 without en-WP article).
- can be manually resolved
SELECT DISTINCT ?item ?article ?itemLabel WHERE { ?item wdt:P31 wd:Q172847 . # ?article schema:about ?item ; schema:inLanguage "en" # FILTER NOT EXISTS { ?wen schema:about ?item ; schema:inLanguage "en" } SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } }
Gene/protein associations with different taxa edit
- 2019-Aug-01
- CAUTION: near bacterial strains may have identical proteins, these should be excluded
- Number of objects affected: 3368 (2019-Aug-01) (C.botulinum diff. strains: 2758, B.anthracis diff. strains: 606)
- Code to view affected proteins:
SELECT (COUNT(?itemp) AS ?count) WHERE { #SELECT DISTINCT ?itemp ?itemg ?taxp ?taxg WHERE { ?itemp wdt:P31 wd:Q8054 . ?itemp wdt:P702 ?itemg . ?itemp wdt:P703 ?taxp . ?itemg wdt:P703 ?taxg . FILTER (?taxp != ?taxg) }
- Efforts after this description was made: one case human/mouse edited (2019-Aug-01)
UniProt ID but no taxon edit
- 2019-Aug-15
- Code to view UniProt-associated items:
SELECT ?item ?itemLabel WHERE { ?item wdt:P352 ?uid . MINUS { ?item wdt:P703 [] } . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . } }
- number of hits: 28, manually resolved
Duped labels (proteins) edit
- 2019-Jul-26
- Description: Proteins having labels consisting of two words, differing only by case, or not at all. Usually one of the words is identical to an alias.
- Example: [4] has the label "lmo0063 lmo0063"
- Talk page(s): 1 2
- Reason (guess): unknown
- Problem recognized by bot/batch maintainer: unknown
- Bug in bot fixed: unknown
- Number of proteins affected: 11434 (2019-Jul-27), 0 (2019-Aug-17)
- Code to view affected proteins:
NOTE: this gives a query timeout now (2019-Aug-16), so should be checked differently before resolving!
#SELECT (COUNT(?item) AS ?count) WHERE { SELECT DISTINCT ?item ?itemLabel (lang(?itemLabel) AS ?itemLabel_lang) ?str1 ?str2 WHERE { ?item wdt:P31 wd:Q8054 . ?item rdfs:label ?itemLabel . FILTER CONTAINS(?itemLabel, " ") . BIND (STRBEFORE(?itemLabel, " ") AS ?str1) . BIND (STRAFTER(?itemLabel, " ") AS ?str2) . FILTER (STRLEN(?str1) = STRLEN(?str2)) . FILTER (?str1 = ?str2 || LCASE(?str1) = LCASE(?str2)) . }
- Proposed fix: for each such
(object,label)
tuple, replace the label with the uppercase version of the two words. Rationale: protein symbols are always uppercase (Ref: 1). - Script that produces a QuickStatement batch from above data:
from sys import * import csv def upcase(s): return s[:1].upper() + s[1:] reader = csv.DictReader(stdin, delimiter=',') for item in reader: itemstr = item.get('item') itemid = itemstr[itemstr.rfind('/')+1:] lang = item.get('itemLabel_lang') str1 = upcase(item.get('str1')) str2 = upcase(item.get('str2')) if (str1 == str2): print("{}|L{}|\"{}\"".format(itemid, lang, str1), file=stdout)
- Efforts after this description was made: QS #16460 #16461
- Alternative way to check database: get all protein labels
SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q8054 . ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" ) }
and run through:
from sys import * import csv reader = csv.DictReader(stdin, delimiter='\t') for item in reader: itemstr = item.get('itemLabel') if itemstr is None: continue pos = itemstr.find(' ') if pos<0: continue str1 = itemstr[:pos] str2 = itemstr[pos+1:] if (str1 == str2): print(itemstr)
Duped labels (genes) edit
- 2019-Jul-30
- Description: Genes having labels consisting of two identical words. Usually one of the words is identical to an alias.
- Example: [5] has the label "lmo0819 lmo0819"
- Talk page(s): 1 2
- Reason (guess): unknown
- Problem recognized by bot/batch maintainer: unknown
- Bug in bot fixed: unknown
- Number of proteins affected: 11560 (2019-Jul-30), 0 (2019-Aug-17)
- Code to view affected items:
NOTE: this gives a query timeout now (2019-Aug-16), so should be checked differently before resolving!
#SELECT (COUNT(?item) AS ?count) WHERE { SELECT DISTINCT ?item ?itemLabel (lang(?itemLabel) AS ?itemLabel_lang) ?str1 ?str2 WHERE { ?item wdt:P31 wd:Q7187 . ?item rdfs:label ?itemLabel . FILTER CONTAINS(?itemLabel, " ") . BIND (STRBEFORE(?itemLabel, " ") AS ?str1) . BIND (STRAFTER(?itemLabel, " ") AS ?str2) . FILTER (STRLEN(?str1) = STRLEN(?str2)) . FILTER (?str1 = ?str2) . } TIMEOUT!
- Proposed fix: for each such
(object,label)
tuple, replace the label with the single word. - Script that produces a QuickStatement batch from above data:
from sys import * import csv reader = csv.DictReader(stdin, delimiter=',') for item in reader: itemstr = item.get('item') itemid = itemstr[itemstr.rfind('/')+1:] lang = item.get('itemLabel_lang') if (str1 == str2): print("{}|L{}|\"{}\"".format(itemid, lang, str1), file=stdout)
- Efforts after this description was made: #16494
- Alternative way to check database: get all protein labels
SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q8054 . ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" ) }
and run through:
from sys import * import csv reader = csv.DictReader(stdin, delimiter='\t') for item in reader: itemstr = item.get('itemLabel') if itemstr is None: continue pos = itemstr.find(' ') if pos<0: continue str1 = itemstr[:pos] str2 = itemstr[pos+1:] if (str1 == str2): print(itemstr)
Human genes without Entrez gene ID edit
- 2019-08-19
SELECT ?item ?itemLabel ?geneid WHERE { ?item wdt:P31 wd:Q7187 . ?item wdt:P703 wd:Q15978631 . MINUS { ?item wdt:P351 [] } . ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" ) }
showed 4 hits which were resolved manually
Objects with HGNC symbol but not instance of anything edit
- 2019-Aug-19
SELECT ?item ?itemLabel ?geneid WHERE { ?item wdt:P353 ?dum . MINUS { ?item wdt:P31 [] } . ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" ) }
- Hits: 186 (2019-Aug-19), 0 (2019-Aug-20)
- wildly different objects, all have sitelinks, quick creations?
- manually resolved
Genes: missing OMIM gene entry edit
- 2019-Aug-21
- Using an OMIM/Entrez index download and the output of
SELECT ?item ?itemLabel ?geneid WHERE { ?item wdt:P31 wd:Q277338 . # ?item wdt:P31 wd:Q7187 . ?item wdt:P703 wd:Q15978631 . ?item wdt:P351 ?geneid . ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" ) }
we used the script https://gist.github.com/rwst/760ee4454d306c4f619053bf5798becd to create the QS batches #17650, #17652, #17675, #17676, #17679, #17680, #17682
diseases with "anatomical location" instead of "location" edit
- 2019-Aug-23
SELECT DISTINCT ?item ?itemLabel WHERE { ?statement wikibase:hasViolationForConstraint wds:P927-22a699be-4c01-5ee5-4295-81f6ac028a65 . ?item ?p ?statement . ?item wdt:P279+ wd:Q12136 . FILTER( ?item NOT IN ( wd:Q4115189, wd:Q13406268, wd:Q15397819 ) ) . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } . }
- 12 hits, manually resolved
Reactome pathway labels with semicolon substitution edit
- 2019-Aug-23... that's what happens when you pipe strings containing commas through csv
SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q4915012 . ?item rdfs:label ?itemLabel . FILTER( LANG(?itemLabel)="en" ) . FILTER( CONTAINS(?itemLabel, ";") ) . }
- 81 hits, script:
from sys import * import csv reader = csv.DictReader(open('t.tab', 'r'), delimiter='\t') for item in reader: iturl = item.get('item') it = iturl[iturl.rfind('/')+1:] l = item.get('itemLabel') print('{}|Len|"{}"'.format(it, l.replace(';',',')))
- #17750
superfluous subclass-of-gene without ref edit
SELECT ?item WHERE { ?item wdt:P279 wd:Q277338 . ?item p:P279 ?c . ?c ps:P279 wd:Q7187 . MINUS {?c prov:wasDerivedFrom [pr:P248 ?ref]} . }
- #17833 (201)
ncRNA with "encodes" edit
SELECT ?item ?itemLabel WHERE { ?item wdt:P279 wd:Q427087 . ?item wdt:P688 ?d . }
- 55 hits, manually resolved, some unresolvable like Q18051378
specialize subclass claims edit
general case edit
SELECT ?item ?itemLabel ?inst ?instLabel WHERE { ?item wdt:P279 wd:Q8054 . ?item wdt:P31 ?inst . ?inst wdt:P279 wd:Q8054 . ?item rdfs:label ?itemLabel . FILTER( LANG(?itemLabel)="en" ) ?inst rdfs:label ?instLabel . FILTER( LANG(?instLabel)="en" ) }
- #17740
proteins: specialized from GOA function (transporter activity --> transport protein) edit
SELECT ?item ?itemLabel ?inst ?instLabel WHERE { ?item wdt:P279 wd:Q8054 . ?item wdt:P31 wd:Q8054 . ?item wdt:P680 ?inst . ?inst wdt:P279+ wd:Q14864384 . ?item rdfs:label ?itemLabel . FILTER( LANG(?itemLabel)="en" ) ?inst rdfs:label ?instLabel . FILTER( LANG(?instLabel)="en" ) }
- #17745, #17746, #17772, #17805, #17826, #17844, #17882, #17884
subclass of proteins and encoded without instance of protein edit
s2 = set(open('wd-subc-of-prot', 'r').readlines()) s3 = set(open('wd-inst-of-prot', 'r').readlines()) s4 = set(open('wd-encoded', 'r').readlines()) for i in s2.intersection(s4).difference(s3): print(i)
- 50 cases, manually fixed 2009-Aug-31, leaving 3 valid exceptions
RefSeq protein missing valid UniProt edit
- 2019-Sep-01
from sys import * import csv reader = csv.DictReader(open('refseqp-wd.tab', 'r'), delimiter='\t') refs = {} dups = set() for item in reader: uid = item.get('refseq') iturl = item.get('item') it = iturl[iturl.rfind('/')+1:] git = refs.get(uid) if git is None or git == it: refs[it] = uid else: dups.add(it) for k in dups: refs.pop(k) reader = csv.DictReader(open('uniprot-refseq.tab', 'r'), delimiter='\t') unips = {} dups = set() for item in reader: uid = item.get('uniprot') if '-' in uid: continue ref = item.get('refseq') if ref.find('.') > -1: ref = ref[:ref.find('.')] git = unips.get(ref) if git is None or git == it: unips[ref] = uid else: dups.add(ref) for k in dups: unips.pop(k) ids = set(l.rstrip() for l in open('wd-refseq-without-uniprot', 'r').readlines()) for it in ids: r = refs.get(it) if r is not None: u = unips.get(refs[it]) if u is not None: print('{}|P352|"{}"'.format(it, u))
- hits: 4,598 all from MicrobeBot imports of Myxococcus and Chlamydia; that leaves those without valid UniProt
- batch: #17948
genes with EC number edit
?item wdt:P591 ?ec . ?item wdt:P2888 ?url . FILTER CONTAINS(STR(?url), 'ncbigene')
- 272 hits, script:
reader = csv.DictReader(open('wd-genes-with-ec.tab', 'r'), delimiter='\t') for item in reader: ec = item.get('ec') iturl = item.get('item') it = iturl[iturl.rfind('/')+1:] print('-{}|P591|"{}"'.format(it, ec))
- batch #17966
Stubs from early days edit
- 2019-Dec-02
- The following queries comes up with a lot of items dumped from enwiki people, common theme "found in taxon" "Homo sapiens"
?p wdt:P703 wd:Q15978631 . ?article schema:about ?p ; schema:isPartOf <https://en.wikipedia.org/> . MINUS { ?p p:P703 ?stmt. ?stmt prov:wasDerivedFrom [] } MINUS { ?p wdt:P31 wd:Q16521 } MINUS { ?p wdt:P31 wd:Q11173 } MINUS { ?p wdt:P31 wd:Q420927 } MINUS { ?p wdt:P31 wd:Q37748 } MINUS { ?p wdt:P31 wd:Q7187 } MINUS { ?p wdt:P31 wd:Q78782478 } MINUS { ?p wdt:P352 [] } MINUS { ?p wdt:P639 [] }
and "subclass+ protein"
?p wdt:P279+ wd:Q8047 . ?article schema:about ?p ; schema:isPartOf <https://en.wikipedia.org/> . MINUS { ?p wdt:P31 wd:Q417841 } MINUS { ?p wdt:P31 wd:Q67015883 } MINUS { ?p wdt:P31 wd:Q67101749 } MINUS { ?p wdt:P31 wd:Q49695242 } MINUS { ?p wdt:P31 wd:Q68461428 } MINUS { ?p wdt:P352 [] } SERVICE wikibase:label { bd:serviceParam wikibase:language "en,en" } }
- about 500 items were manually integrated, finished 2019-12-29
Stubs from early days II edit
SELECT DISTINCT ?item ?itemLabel
{
?article schema:about ?item ;
schema:isPartOf <https://en.wikipedia.org/> .
?item wdt:P279+ wd:Q8054.
MINUS {
?item wdt:P31 []
}
?item ?prop ?val.
FILTER (STRSTARTS(STR(?prop), 'http://www.wikidata.org/prop/direct/') && ?prop != wdt:P646 && ?prop != wdt:P279)
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
- finished 2020-Feb-19
Stubs from early days III edit
SELECT DISTINCT ?p ?pLabel ?ec
{
?p wdt:P591 ?ec.
MINUS {
?p wdt:P31 [].
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
- finished 2020-Mar-11
Duplicate exact external ids edit
GO functions edit
SELECT DISTINCT ?item1 ?item1Label ?funcLabel ?item2 ?item2Label
{
?item1 p:P680 [ ps:P680 ?func; pq:P4390 wd:Q39893449; ].
?item2 p:P680 [ ps:P680 ?func; pq:P4390 wd:Q39893449; ].
FILTER (?item1 != ?item2 && STR( ?item1 ) < STR( ?item2 )).
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}
}
- finished 2020-Feb-29
GO entities without P31 edit
SELECT DISTINCT ?item ?goid
{
?item wdt:P686 ?goid.
MINUS {
{ ?item wdt:P31 wd:Q5058355 } UNION { ?item wdt:P31 wd:Q2996394 } UNION { ?item wdt:P31 wd:Q14860489 }
}
}
}
- most of them obsolete, should be either merged or tagged; finished 2020-Jul-1
Silly nl descriptions by Edoderoobot edit
- 2019-Aug-02
- Description: Proteins having (nl)descriptions consisting of "proteïne in XYZ" where XYZ is a GeneDB protein ID, usually one of the ids of the protein item itself.
- Example: https://www.wikidata.org/w/index.php?title=Q62305547&oldid=990962798 got "proteïne in EmuJ_001072400.1"
- Talk page(s): 1
- Reason (guess): unknown
- Problem recognized by bot/batch maintainer: unknown
- Bug in bot fixed: unknown
- In the following the example species is Echinococcus multilocularis. It makes sense to partition the task into species because atm it appears only from GeneDB scraped proteins are affected, and these are only for a few invertebrates. Also the query server has much to do, and we want to be nice.
- Number of proteins affected: 10668 for Emu alone...
- Code to view affected proteins:
#SELECT (COUNT(?item) AS ?count) WHERE { SELECT DISTINCT ?item ?itemLabel ?itemDesc ?str2 WHERE { ?item wdt:P31 wd:Q8054 . ?item wdt:P703 wd:Q669922 . ?item schema:description ?itemDesc . FILTER CONTAINS(?itemDesc, "proteïne in ") . SERVICE wikibase:label { bd:serviceParam wikibase:language "nl" . } } }
- Proposed fix: for each affected species, manually check affected items and replace the label with "proteïne in species XYZ".
- Script that produces a QuickStatement batch from above data:
from sys import * import csv reader = csv.DictReader(stdin, delimiter='\t') for item in reader: itemstr = item.get('item') itemid = itemstr[itemstr.rfind('/')+1:] label = item.get('itemLabel') desc = item.get('itemDesc') if desc[:12] == "proteïne in " and desc[12:] == label: print("{}|Dnl|\"{}\"".format(itemid, "proteïne in Echinococcus multilocularis"), file=stdout)
- Efforts after this description was made: [6]
- I think these were fixed by Edoderoobot
WP Orphans (was: MS IDs on early entries) edit
- mostly domains which should have been placed on IPR domain items
SELECT ?item ?itemLabel WHERE {
?item wdt:P279 wd:Q8054.
?item wdt:P6366 ?ms.
MINUS { ?item wdt:P31 [] }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
- this was resolved by editing all orphans (2260 in en, de?), see catscan.py