SCIdude

"things, not strings"

Amit Singhal

Hello, I'm trying to improve the molbio part of Wikidata by manual and batch editing. Although being a software dev (main language C++), I have prepared many books for Project Gutenberg (Q22673), contributed in the years 2006-2012 to German Wikipedia (Q48183) (as User:Ayacop), and also have biocurated extensively for GOA (Q28018111) and Reactome (Q2134522).

Ralf Stephan (Q67363620)

Authority control

Babel user information

de-N	Dieser Benutzer spricht Deutsch als Muttersprache.

en-3	This user has advanced knowledge of English.

fr-1	Cet utilisateur dispose de connaissances de base en français.

la-1	Hic usor simplici lingua Latina conferre potest.

ru-0	Этот участник не владеет русским языком (или понимает его с трудом).

it-0	Questo utente non è in grado di comunicare in italiano (o lo capisce solo con notevole difficoltà).

This user is a member of WikiProject Microbiology.

This user is a member of WikiProject Molecular biology.

This user is a member of WikiProject Chemistry.

This user is a member of WikiProject COVID19.

Current ideas:

Current TODO list:

add refs to rotavirus literature main subjects / uses that we did
use P10228 (facilitates flow of)

Also:

instance of protein fragment with 'of"
construct accessible pipe for verifying TCDB ID of proteins / families
complexes from PRotein Ontology
MeSH protein entries are usually species-independent. Check heuristically and use
connect Reactome entities with existing families
Arabidopsis and Dictyostelium import
PMCREF: use https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&retmode=json&id=2685584
use "substrate of"
subc-of-enzyme inhibitor + phys. interact. with XY --> inhibitor of XY
subc-of-agonist + phys. interact. with XY --> agonist of XY
use subc to describe pgroups exactly
UniProt protein families
sync prot-->part of-->enzfam if exact molfunc is annotated
for every GO complex, list parts and make subunit families
multifunctionnal enzymes?
some proteins encoded by same gene, mark as variants
interpro and superfamily with description "InterPro Domain"---> really are domain superfamilies
check IPR items for correct Pfam (via IPR), also move Pfam from other item
GO items with changed label are suspected to be WPedians fumbling result
if TCDB fam X subclass-of TCDB fam Y --> missing reference dbhierarchy heuristics
Reactome candidate sets missing "has part"
peptidases with endopep func
IUPHAR IDs without Wikidata, anyone?
IUPHAR family IDs, anyone?
membranome classes https://membranome.org/
add "stated as" qual. to ChEBI ids of amino acids / their zwitterions; make special contraint including this
BindingDB ids?
missing OMIM phenotypes, e.g. 1?
OMIM phenotypic series, see their FAQ
orthology group ids/groups, see bot issue
do all ions have charge in their label?
next MONDO sync?
german labels from Brockhaus
remove em-dashes from labels
items with dewiki but without enwiki and en-label
industry processes don't have-parts all reactants/modifiers

In the manual attempt to create/curate WD items of cleavage products (fragments) of proteins I worked around preproinsulin (Q7240673), angiotensinogen (Q267200), Ghrelin and obestatin prepropeptide (Q66216544), proglucagon (Q66310097), proopiomelanocortin (Q418896), Cerebellin 1 precursor (Q21115606), Natriuretic peptide B (Q422288), Endothelin 1 (Q66361339), Apelin (Q2386988), Tachykinin precursor 1 (Q21123080), Secretogranin II (Q21105303), Thymosin beta 4 X-linked (Q7799643), Vasoactive intestinal peptide (Q66499176), VGF nerve growth factor inducible (Q21122290), augurin precursor (Q66535298), Chromogranin A (Q3698322), Cathelicidin antimicrobial peptide (Q411181)

What I'm doing is roughly this:

if gene and protein is in one item, duplicate to get separate items (moving sitelinks first to the protein)
remove wrong statements on either (e.g. no PDB/protein IDs/GOA function/localization annotations on genes), make sure the gene has at the most GO process annotations
create/check all relevant fragment objects, move statements to the resp. item: EnsemblP should be on prepro/pro
separate out aliases to resp. objects
add "has part" with all fragments to prepro object
complete "encodes/encoded by" everywhere
add "exact match" qualifier to fragment UniProt like e.g. https://www.uniprot.org/uniprot/Q9UBU3#PRO_0000019202
add Reactome, ChEBI, ChemBL, IUPHAR IDs to fragment if existing (Reactome labels like GENE(1-100) also to fragment aliases)
add "part of" Reactome process or reaction if missing
(maybe) move GOA function annotations to resp. fragment if applicable

misc edit

{{section resolved|~~~~}} {{Q|21105303}}

https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Data_model