User:ProteinBoxBot/2020 complex portal
Overall summary edit
Build a bot that creates Wikidata pages for each Complex Portal entry.
There are already 22 existing entries that should act as examples. 11 of these are for SARS-CoV-2 and were created during the virtual Covid-19 BioHackathon in April 2020 using OpenRefine followed by some manual curation. Preliminary ShEx were also developed (see below).
Considerations edit
- update methods - Complex Portal releases are roughly every 2 months
- location: Wikidata or EBI end?
Status edit
Kickoff meeting edit
We had an initial kickoff meeting (minutes). Moving forward:
- Complex portal is available with a CC-BY 4.0. We assume that since we are not importing all of complex portal, but creating references and pointers to the original content, this is eligble for inclusion into Wikidata. EBI Terms of Use
- The bot will be managed by the complex portal team, but build by the members of this sprint group
- One of the next steps is to finalize the semantic model (Entity Schema)
- This semantic model will then drive the bot development which will be in Python hosted primarily on Github.
Participants edit
Gameplan edit
- Define and write up when two items are the same, needed to determine if a new items needs to be created (done)
- Update EntitySchema for Macromolecular complex & Complex Portal entity * Andra/Jose *
- Create a draft bot to populate Wikidata with information from Complex Portal (done)
- Run the bot on a single complex: CPX-5742 SARS-CoV-2 polymerase complex ("missing" SARS-CoV-2 complex) (done)
- Adapt the bot to handle other complexes - first other coronavirus complexes, then yeast (as publication in preparation)
Properties edit
Property label | Property ID |
---|---|
instance of (P31) | P31 |
found in taxon (P703) | P703 |
has part(s) (P527) | P527 |
.. | .. |
Property label | property id |
---|---|
Complex Portal accession ID (P7718) | P7718 |
RNACentral ID (P8697) | P8697 |
.. | .. |
Proposed edit
Entity Schema edit
- E186 Macromolecular complex
- E194 Complex Portal entity
- Complex Portal accession ID (P7718)
Bot development edit
In progress
Example complexes edit
- SARS-CoV-2 primase complex (Q90012271) - manually curated after Openrefine import (SARS-CoV-2 primase complex)
- Pyruvate dehydrogenase E1 heterotetramer (Q50265809) - created by pathwaybot (Pyruvate dehydrogenase E1 heterotetramer (human))
- Mitochondrial respiratory chain complex I (Q50265911) - created by pathwaybot (Mitochondrial respiratory chain complex I)
Example non-coding RNA edit
- long non-coding RNA NONMMUT046978.2 (Q99841998) - created by andrawaag and bmeldal for property proposal Wikidata:Property_proposal/Natural_science#RNACentral_ID
Results edit
in progress
WikiPathways SPARQL query to list yeast complexes edit
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX dct: <http://purl.org/dc/terms/> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX wp: <http://vocabularies.wikipathways.org/wp#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> SELECT DISTINCT (STR(?label) AS ?complex) ?wpIdentifier ?pathway ?page WHERE { ?complex a wp:Complex ; dct:isPartOf ?pathway . OPTIONAL { ?complex rdfs:label ?label } ?pathway dc:title ?title ; foaf:page ?page ; dc:identifier ?wpIdentifier ; wp:organismName "Saccharomyces cerevisiae"^^xsd:string . } ORDER BY ?wpIdentifier