"things, not strings"
Amit Singhal
Hello,
I'm trying to improve the molbio part of Wikidata by manual and batch editing. Although being a software dev (main language C++), I have prepared many books for Project Gutenberg (Q22673), contributed in the years 2006-2012 to German Wikipedia (Q48183) (as User:Ayacop), and also have biocurated extensively for UniProt-GOA (Q28018111) and Reactome (Q2134522).
Ralf Stephan (Q67363620)
Current ideas:
- check WP links in GO
- for every GO complex, list parts and make subunit families
- eyeball GO components with 'complex', equal to subc-of-complex?
- pfam with func carbohydrate bindng physically interacts with carbohydrate (GO: intersection_of: GO:0005488 ! binding|intersection_of: has_input CHEBI:16646 ! carbohydrate
- instance of protein fragment with 'of"
- construct accessible pipe for verifying TCDB ID of proteins / families
- complexes from PRotein Ontology
- MeSH protein entries are usually species-independent. Check heuristically and use
- connect Reactome entities with existing families
- Arabidopsis and Dictyostelium import
- incomplete ChEBI: add reference for all InChi leys, InChi, isomeric SMILES, can.SMILES
- incomplete ChEBI: add reference for all (subst is-a class)
- incomplete ChEBI: add reference for all (class subc-of class)
- ChEBI: import completion, full class hierarchy
- ChEBI: import completion, all substances
- ChEBI: check all substances are in their classes
- PMCREF: use https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&retmode=json&id=2685584
- use "substrate of"
- subc-of-enzyme inhibitor + phys. interact. with XY --> inhibitor of XY
- subc-of-agonist + phys. interact. with XY --> agonist of XY
- use subc to describe pgroups exactly
- UniProt protein families
- sync prot-->part of-->enzfam if exact molfunc is annotated
- multifunctionnal enzymes?
- some proteins encoded by same gene, mark as variants
- interpro and superfamily with description "InterPro Domain"---> really are domain superfamilies
- check IPR items for correct Pfam (via IPR), also move Pfam from other item
- GO items with changed label are suspected to be WPedians fumbling result
- if TCDB fam X subclass-of TCDB fam Y --> missing reference dbhierarchy heuristics
- Reactome candidate sets missing "has part"
- peptidases with endopep func
- IUPHAR IDs without Wikidata, anyone?
- IUPHAR family IDs, anyone?
- mebranome classes https://membranome.org/
- add "stated as" qual. to ChEBI ids of amino acids / their zwitterions; make special contraint including this
- BindingDB ids?
- dbSNP import?
- missing OMIM phenotypes, e.g. 1?
- OMIM phenotypic series, see their FAQ
- orthology group ids/groups, see bot issue
- do all ions have charge in their label?
- next MONDO sync?
- german labels from Brockhaus
- remove em-dashes from labels
- items with dewiki but without enwiki and en-label
- industry processes don't have-parts all reactants/modifiers
In the manual attempt to create/curate WD items of cleavage products (fragments) of proteins I worked around insulin (Q7240673), Angiotensinogen (Q267200), Ghrelin and obestatin prepropeptide (Q66216544), proglucagon (Q66310097), Proopiomelanocortin (Q418896), Cerebellin 1 precursor (Q21115606), Natriuretic peptide B (Q422288), Endothelin 1 (Q66361339), Apelin (Q2386988), Tachykinin precursor 1 (Q21123080), Secretogranin II (Q21105303), Thymosin beta 4 X-linked (Q7799643), Vasoactive intestinal peptide (Q66499176), VGF nerve growth factor inducible (Q21122290), augurin precursor (Q66535298), Chromogranin A (Q3698322), Cathelicidin antimicrobial peptide (Q411181)
What I'm doing is roughly this:
- if gene and protein is in one item, duplicate to get separate items (moving sitelinks first to the protein)
- remove wrong statements on either (e.g. no PDB/protein IDs/GOA function/localization annotations on genes), make sure the gene has at the most GO process annotations
- create/check all relevant fragment objects, move statements to the resp. item: EnsemblP should be on prepro/pro
- separate out aliases to resp. objects
- add "has part" with all fragments to prepro object
- complete "encodes/encoded by" everywhere
- add "exact match" qualifier to fragment UniProt like e.g. https://www.uniprot.org/uniprot/Q9UBU3#PRO_0000019202
- add Reactome, ChEBI, ChemBL, IUPHAR IDs to fragment if existing (Reactome labels like GENE(1-100) also to fragment aliases)
- add "part of" Reactome process or reaction if missing
- (maybe) move GOA function annotations to resp. fragment if applicable