User:ProteinBoxBot/Microbial gene and protein items

Introduction edit

The ProteinBoxBot maintains information about Genes, Diseases and Drugs in Wikidata. The entities in these three domains are maintained by different corresponding sub-processes of the main bot.

The objective of the Microbial gene and protein sub-process is to add and update Wikidata with information about genes and proteins of microbial origin. A discussion has been initiated on the [Project Molecular Biology Talk Page ]

 
Figure 1. A microbial gene item in Wikidata (blue) and the structure of its linkage (through QIDs and properties) to the organism item of origin (green) and the protein item it encodes (orange). Solid black lines indicate WD Properties and dashed black lines indicate WD Property Qualifiers.

Current Scope edit

The set of entities maintained by this bot are determined based on their presence in the expert-curated NCBI Entrez Gene database.

At present, the bot is limited to genes and proteins from bacteria and will be expanded to include microbial genes of non-bacterial origin.

Items maintained by this bot edit

  • Bacterial Genes. Lists them all with a query for items with taxon bacteria and some value for Entrez Gene ID:

Gene properties planned for this bot edit

Property Description Datatype Expected value

(if not listed, see property definition)

P279 subclass of Item Should always include gene (Q7187)
P351 Entrez Gene ID String Should exist for EVERY item processed by this bot. Property will include concurrent Entrez IDs for each strain of bacterial species
P644 Genomic start String Should exist for EVERY item processed by this bot. Property will include concurrent Genomic starts for each strain of bacterial species
P645 Genomic end String Should exist for EVERY item processed by this bot. Property will include concurrent Genomic ends for each strain of bacterial species
P703 found in taxon Item Currently should only include bacteria Q10876
P353 Gene symbol String
P688 encodes Item

The 'encodes' property links gene items to items specifically about the protein, RNA, or other 'product' of the gene. A single gene corresponds to a particular region of a genome that is related to some set of functions. These functions are carried about by the gene's products. Different products may perform vastly different functions. Hence we separate functional information from the gene item itself, and attach this information to the product items wherever possible. (See discussion.)

Protein properties Planned for this bot edit

Property Description Datatype Expected value

(if not listed, see property definition)

P279 subclass of Item One of: Protein (Q8054), RNA (Q11053), non-coding RNA (Q427087), ..
P702 encoded by Item Should exist for EVERY item processed by this bot
P352 UniProt ID String Should exist for EVERY item processed by this bot
P638 PDB ID String
P637 RefSeq Protein ID String
P705 Ensembl Protein ID String
P681 Cell Component Item
P682 Biological Process Item
P680 Molecular Function Item

Data sources edit

The bot will retrieve its content from the following trusted sources:

References edit

  1. http://www.ncbi.nlm.nih.gov/pubmed/23175613