"things, not strings"

Amit Singhal

Hello, I'm trying to improve the molbio part of Wikidata by manual and batch editing. Although being a software dev (main language C++), I have prepared many books for Project Gutenberg (Q22673), contributed in the years 2006-2012 to German Wikipedia (Q48183) (as User:Ayacop), and also have biocurated extensively for UniProt-GOA (Q28018111) and Reactome (Q2134522).

Ralf Stephan (Q67363620)

Authority control
Babel user information
de-N Dieser Benutzer spricht Deutsch als Muttersprache.
en-3 This user has advanced knowledge of English.
fr-1 Cet utilisateur dispose de connaissances de base en français.
la-1 Hic usor simplici lingua Latina conferre potest.
ru-0 Этот участник не владеет русским языком (или понимает его с трудом).
it-0 Questo utente non è in grado di comunicare in italiano (o lo capisce solo con notevole difficoltà).
Wd-microbio.svgThis user is a member of WikiProject Microbiology.
GeneWikidata-logo-en.pngThis user is a member of WikiProject Molecular biology.
Nuvola apps edu science.svgThis user is a member of WikiProject Chemistry.
Wikiproject COVID-19 - logo.svgThis user is a member of WikiProject COVID19.
Users by language

Current ideas:

Illustration of Wikidata gene items properties (2019-08).svg
Illustration of Wikidata protein items properties (2019-08).svg

Current TODO list:

  • add refs to CoV-2 literature main subjects that we did
  • add missing GO entries
  • add refs to rotavirus literature main subjects / uses that we did


  • instance of protein fragment with 'of"
  • construct accessible pipe for verifying TCDB ID of proteins / families
  • complexes from PRotein Ontology
  • MeSH protein entries are usually species-independent. Check heuristically and use
  • connect Reactome entities with existing families
  • Arabidopsis and Dictyostelium import
  • PMCREF: use https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&retmode=json&id=2685584
  • use "substrate of"
  • subc-of-enzyme inhibitor + phys. interact. with XY --> inhibitor of XY
  • subc-of-agonist + phys. interact. with XY --> agonist of XY
  • use subc to describe pgroups exactly
  • UniProt protein families
  • sync prot-->part of-->enzfam if exact molfunc is annotated
  • for every GO complex, list parts and make subunit families
  • multifunctionnal enzymes?
  • some proteins encoded by same gene, mark as variants
  • interpro and superfamily with description "InterPro Domain"---> really are domain superfamilies
  • check IPR items for correct Pfam (via IPR), also move Pfam from other item
  • GO items with changed label are suspected to be WPedians fumbling result
  • if TCDB fam X subclass-of TCDB fam Y --> missing reference dbhierarchy heuristics
  • Reactome candidate sets missing "has part"
  • peptidases with endopep func
  • IUPHAR IDs without Wikidata, anyone?
  • IUPHAR family IDs, anyone?
  • membranome classes https://membranome.org/
  • add "stated as" qual. to ChEBI ids of amino acids / their zwitterions; make special contraint including this
  • BindingDB ids?
  • missing OMIM phenotypes, e.g. 1?
  • OMIM phenotypic series, see their FAQ
  • orthology group ids/groups, see bot issue
  • do all ions have charge in their label?
  • next MONDO sync?
  • german labels from Brockhaus
  • remove em-dashes from labels
  • items with dewiki but without enwiki and en-label
  • industry processes don't have-parts all reactants/modifiers

In the manual attempt to create/curate WD items of cleavage products (fragments) of proteins I worked around preproinsulin (Q7240673), angiotensinogen (Q267200), Ghrelin and obestatin prepropeptide (Q66216544), proglucagon (Q66310097), Proopiomelanocortin (Q418896), Cerebellin 1 precursor (Q21115606), Natriuretic peptide B (Q422288), Endothelin 1 (Q66361339), Apelin (Q2386988), Tachykinin precursor 1 (Q21123080), Secretogranin II (Q21105303), Thymosin beta 4 X-linked (Q7799643), Vasoactive intestinal peptide (Q66499176), VGF nerve growth factor inducible (Q21122290), augurin precursor (Q66535298), Chromogranin A (Q3698322), Cathelicidin antimicrobial peptide (Q411181)

What I'm doing is roughly this:

  • if gene and protein is in one item, duplicate to get separate items (moving sitelinks first to the protein)
  • remove wrong statements on either (e.g. no PDB/protein IDs/GOA function/localization annotations on genes), make sure the gene has at the most GO process annotations
  • create/check all relevant fragment objects, move statements to the resp. item: EnsemblP should be on prepro/pro
  • separate out aliases to resp. objects
  • add "has part" with all fragments to prepro object
  • complete "encodes/encoded by" everywhere
  • add "exact match" qualifier to fragment UniProt like e.g. https://www.uniprot.org/uniprot/Q9UBU3#PRO_0000019202
  • add Reactome, ChEBI, ChemBL, IUPHAR IDs to fragment if existing (Reactome labels like GENE(1-100) also to fragment aliases)
  • add "part of" Reactome process or reaction if missing
  • (maybe) move GOA function annotations to resp. fragment if applicable


{{section resolved|~~~~}} {{Q|21105303}}