User:ProteinBoxBot/201410 sprint
Overall summary
editOur goal is to make Wikidata the canonical resource for referencing and translating these identifiers. The goal of this sprint is to:
- continue adding all identifiers for human genes and proteins from User:ProteinBoxBot/201408 sprint.
- Create a generic workflow for additional resources
- Create an update bot
Participants
editBackground info
editHuman
editIn the last two sprints we focused on getting the genes of the human genome into Wikidata. Finally, the process consisted of two bot tasks. The first being a stub creator, where for each entrez gene entry, a stub was created. The stub contained, the title of a gene, its aliases, its entrez gene identifier and that claims that that entry was a subclass of a gene and that it was from the human species. The second bot, enriched each entry with related identifiers and chromosomal positions. The overal process was quite slow, due to multiple api calls that made. To add a claim, the bot needs to check if that claim already exists (1 call), if not a claim needs to made (2nd call), after which a reference was made (3rd call).
Check if it exists if not create an entry, add label, add aliases, state its a gene from the human species and its entry gene identifier, plus a reference to Entrez for each claim, which is 10 individual api call. For each subsequent identifier added, 3 api calls were needed. So a gene with 5 identifiers would results in a process of at least 25 api calls. In total processing all entrez human entries took us 6 weeks in total.
Mouse
editOn the Wikidata mailing list it was suggested to use a different api call "wbeditentities", where the whole datamodel of a Wikidata entry is submitted in one single api call. It is the objective of this sprint to adapt the ProteinBoxBot to use this "wbeditentities" call to increase its performance. As a testcase the genome of the house mouse will be added.
Disease ontology
edit- create stubs for all DO classes on Wikidata (Action Andra)
- create stubs for all SYMP classes on Wikidata (Action Andra)
- Manually complete the entry for Chapare hemorrhagic fever (Action Elvira)
Links
editOriginal source data files
edit- Entrez Gene: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
(all information coming in via http://mygene.info )
Game plan
editOverview
edit- adaptation to wbentityedit API
- add mouse
- automation on schedule
- import DO
adaptation to wbentityedit API
edit- Refactor the existing ProteinBoxBot to use the wbeditentity api call.
add mouse
edit- test on 10 genes
- test on 100 genes
- test on 1000 genes
- full run
automation on schedule
edit- investigate automation by Sentry
- investigation integration with the update cycle of mygene.info
import DO
edit- confirm all necessary properties exist
- list what is already captured in wikidata
- prototype an example disease by hand
- write the bot
- test on 2 diseases
- test on 10 diseases
- full run