Hello, I'm trying to improve the molbio part of Wikidata by manual and batch editing. Although being a software dev (main language C++), I have prepared many books for Project Gutenberg (Q22673), contributed in the years 2006-2012 to German Wikipedia (Q48183) (as User:Ayacop), and also have biocurated extensively for UniProt-GOA (Q28018111) and Reactome (Q2134522).
|Babel user information|
|Users by language|
- User:SCIdude/Protein bugs
- MeSH protein entries are usually species-independent. Check and use
- check duplicate exact molecular functions / EC / TCDB
- if TCDB fam X subclass-of TCDB fam Y --> add reference dbhierarchy heuristics
- families associated with repeats, conserved sites
- auxiliary families are not transport families
- for every GO complex, list parts and make subunit families
- families without ipr may be groups
- ipr enzyme families with skos.broad should not have MeSH, MICKEY...
- interpro and superfamily have description "InterPro Domain"
- MSCRAP, Pfam, Foundational Model of Anatomy ID without inst
- use IPR for cazypedia exact matches: eg http://www.cazypedia.org/index.php/Glycoside_Hydrolase_Family_101
- check IPR items for correct Pfam (via IPR)
- complete IPR molfunc
- instance of protein fragment with 'of"
- for all proteins having-part a domain, make them part-of the associated family
- check for new molbio WP articles by having a weekly query, diffs
- construct accessible pipe for verifying TCDB ID of proteins / families
- aliases bot
- import https://www.ebi.ac.uk/complexportal/home for linking reactome complexes
- use cpx-homo.tab and UniProt IDs to associate CPX ids with Reactome complex ids
- complexes from PRotein Ontology
- ChEBI import completion, with class hierarchy
- use "substrate of"
- UniProt protein families
- multifunctionnal enzymes?
- IUPHAR IDs without Wikidata, anyone?
- IUPHAR family IDs, anyone?
- mebranome classes https://membranome.org/
- BindingDB ids?
- dbSNP import?
- missing OMIM phenotypes, e.g. 1?
- OMIM phenotypic series, see their FAQ
- OMA orthology group ids/groups, see Property_talk:P684#not_efficient_spacewise
- next MONDO sync?
- german labels from Brockhaus
- items with dewiki but without enwiki and en-label
- industry processes don't have-parts all reactants/modifiers
In the manual attempt to create/curate WD items of cleavage products (fragments) of proteins I worked around Insulin (Q7240673), Angiotensinogen (Q267200), Ghrelin and obestatin prepropeptide (Q66216544), Glucagon (Q66310097), Proopiomelanocortin (Q418896), Cerebellin 1 precursor (Q21115606), Natriuretic peptide B (Q422288), Endothelin 1 (Q66361339), Apelin (Q2386988), Tachykinin precursor 1 (Q21123080), Secretogranin II (Q21105303), Thymosin beta 4 X-linked (Q7799643), Vasoactive intestinal peptide (Q66499176), VGF nerve growth factor inducible (Q21122290), augurin precursor (Q66535298), Chromogranin A (Q3698322), Cathelicidin antimicrobial peptide (Q411181)
What I'm doing is roughly this:
- if gene and protein is in one item, duplicate to get separate items (moving sitelinks first to the protein)
- remove wrong statements on either (e.g. no PDB/protein IDs/GOA function/localization annotations on genes), make sure the gene has at the most GO process annotations
- create/check all relevant fragment objects, move statements to the resp. item: EnsemblP should be on prepro/pro
- separate out aliases to resp. objects
- add "has part" with all fragments to prepro object
- complete "encodes/encoded by" everywhere
- add "exact match" qualifier to fragment UniProt like e.g. https://www.uniprot.org/uniprot/Q9UBU3#PRO_0000019202
- add Reactome, ChEBI, ChemBL, IUPHAR IDs to fragment if existing (Reactome labels like GENE(1-100) also to fragment aliases)
- add "part of" Reactome process or reaction if missing
- (maybe) move GOA function annotations to resp. fragment if applicable