Hello, I'm trying to improve the molbio part of Wikidata by manual and batch editing. Although being a software dev (main language C++), I have prepared many books for Project Gutenberg (Q22673), contributed in the years 2006-2012 to German Wikipedia (Q48183) (as User:Ayacop), and also have biocurated extensively for UniProt-GOA (Q28018111) and Reactome (Q2134522).

Ralf Stephan (Q67363620)

Authority control
Babel user information
de-N Dieser Benutzer spricht Deutsch als Muttersprache.
en-3 This user has advanced knowledge of English.
fr-1 Cet utilisateur dispose de connaissances de base en français.
la-1 Hic usor simplici latinitate contribuere potest.
ru-0 Этот участник не владеет русским языком (или понимает его с трудом).
it-0 Questo utente non è in grado di comunicare in italiano (o lo capisce solo con notevole difficoltà).
Wd-microbio.svgThis user is a member of WikiProject Microbiology.
GeneWikidata-logo-en.pngThis user is a member of WikiProject Molecular biology.
Nuvola apps edu science.svgThis user is a member of WikiProject Chemistry.
Users by language

Current ideas:

Illustration of Wikidata gene items properties (2019-08).svg
Illustration of Wikidata protein items properties (2019-08).svg
  • https://www.wikidata.org/wiki/Special:Contributions/GeneDBot
  • User:SCIdude/Imports
  • User:SCIdude/Protein bugs
  • User:SCIdude/Modeling
  • MeSH protein entries are usually species-independent. Check and use
  • check duplicate exact molecular functions / EC / TCDB
  • if TCDB fam X subclass-of TCDB fam Y --> add reference dbhierarchy heuristics
  • families associated with repeats, conserved sites
  • auxiliary families are not transport families
  • for every GO complex, list parts and make subunit families
  • families without ipr may be groups
  • ipr enzyme families with skos.broad should not have MeSH, MICKEY...
  • interpro and superfamily have description "InterPro Domain"
  • MSCRAP, Pfam, Foundational Model of Anatomy ID without inst
  • use IPR for cazypedia exact matches: eg http://www.cazypedia.org/index.php/Glycoside_Hydrolase_Family_101
  • check IPR items for correct Pfam (via IPR)
  • complete IPR molfunc
  • instance of protein fragment with 'of"
  • for all proteins having-part a domain, make them part-of the associated family
  • check for new molbio WP articles by having a weekly query, diffs
  • construct accessible pipe for verifying TCDB ID of proteins / families
  • aliases bot
  • import https://www.ebi.ac.uk/complexportal/home for linking reactome complexes
    • use cpx-homo.tab and UniProt IDs to associate CPX ids with Reactome complex ids
  • complexes from PRotein Ontology
  • ChEBI import completion, with class hierarchy
  • use "substrate of"
  • UniProt protein families
  • multifunctionnal enzymes?
  • IUPHAR IDs without Wikidata, anyone?
  • IUPHAR family IDs, anyone?
  • mebranome classes https://membranome.org/
  • BindingDB ids?
  • dbSNP import?
  • missing OMIM phenotypes, e.g. 1?
  • OMIM phenotypic series, see their FAQ
  • OMA orthology group ids/groups, see Property_talk:P684#not_efficient_spacewise
  • next MONDO sync?
  • german labels from Brockhaus
  • items with dewiki but without enwiki and en-label
  • industry processes don't have-parts all reactants/modifiers

In the manual attempt to create/curate WD items of cleavage products (fragments) of proteins I worked around Insulin (Q7240673), Angiotensinogen (Q267200), Ghrelin and obestatin prepropeptide (Q66216544), Glucagon (Q66310097), Proopiomelanocortin (Q418896), Cerebellin 1 precursor (Q21115606), Natriuretic peptide B (Q422288), Endothelin 1 (Q66361339), Apelin (Q2386988), Tachykinin precursor 1 (Q21123080), Secretogranin II (Q21105303), Thymosin beta 4 X-linked (Q7799643), Vasoactive intestinal peptide (Q66499176), VGF nerve growth factor inducible (Q21122290), augurin precursor (Q66535298), Chromogranin A (Q3698322), Cathelicidin antimicrobial peptide (Q411181)

What I'm doing is roughly this:

  • if gene and protein is in one item, duplicate to get separate items (moving sitelinks first to the protein)
  • remove wrong statements on either (e.g. no PDB/protein IDs/GOA function/localization annotations on genes), make sure the gene has at the most GO process annotations
  • create/check all relevant fragment objects, move statements to the resp. item: EnsemblP should be on prepro/pro
  • separate out aliases to resp. objects
  • add "has part" with all fragments to prepro object
  • complete "encodes/encoded by" everywhere
  • add "exact match" qualifier to fragment UniProt like e.g. https://www.uniprot.org/uniprot/Q9UBU3#PRO_0000019202
  • add Reactome, ChEBI, ChemBL, IUPHAR IDs to fragment if existing (Reactome labels like GENE(1-100) also to fragment aliases)
  • add "part of" Reactome process or reaction if missing
  • (maybe) move GOA function annotations to resp. fragment if applicable


{{section resolved|~~~~}} {{Q|21105303}}