Till Sauerwein

Master's thesis - Expose edit

High-throughput analysis of genomes and transcriptomes has generated vast amounts of data and increased our knowledge about the molecular foundations of life dramatically within the last decade. Currently, the gathered knowledge is distributed over several databases as well as located in primary literature. Thus, the linkage of the different data sets and the assignment within a single data set still has a high potential of optimization – especially in terms of machine-readablity and community driven data curation. This gap could be closed by Wikidata, a semantic and open database developed by the Wikimedia Foundation. So far, only a few projects in the field of biology have made use of storing biological data within Wikidata (Burgstaller-Muelbacher et al., 2015). The data structure of Wikidata is based on generating certain objects (“items”). These “items” then can be described by so-called “statements”. For example an “item” would be a gene, a transcript or an organism. A “statement” consists mainly of a “property” and an assigned “value”. A “statement” of a gene could be for example its transcription starting site or the transcript that is encoded by the gene. It should be mentioned that the “values” of a “statement” can also be other “items” instead of simple strings. That is the way of how different “items” can be connected to each other. This master's thesis aims to present genome annotation data as well as information about interaction of molecules (e.g. transcription factors, RNA-RNA and RNA-protein interactions) via Wikidata, as well as to develop programs that can query and visualize this data from Wikidata for relevant features. As a first step, a data model for the biological data, based on Wikidata's data structure has to be developed. The data model should include basic genomic features of genes and transcripts decoded by them and also information about interactions between genes, transcripts and proteins. It will be build on the already existing structure of “properties” and “items” used by the WikiProject Molecular biology (https://www.wikidata.org/wiki/Wikidata:WikiProject_Molecular_biology). There are also first examples of bacterial genomic data e.g. of Chlamydia trachomatis which can be used as a guideline (Putman et al., 2015). In addition, the data model should contain information about the belonging of genes to operons, transcription units and regulons. In the second step a few example genes, transcripts and proteins together with the features of the data model should be exemplary implemented in Wikidata to give the Wikidata community room for discussing the kind of features and the way of presenting the features with the given Wikidata structure. The need of adding new “properties” will also be discussed within the community. In the third step, bots that are able to parse locally stored bacterial genome data and submit the information to Wikidata are going to be developed. In the fourth step a simple front-end web-page that can perform simple queries about the data on Wikidata will be created. At the fifth step the web-page will be improved further by adding the capability to visualize the results of queries that have been executed. An advantage of Wikidata's structure is coexisting of conflicting data due to the possibility of adding multiple “values” with different references to one given “property”. Another advantage is Wikidata's powerful API that provides a good machine-readability and writability. Furthermore, new datasets can be easily added to Wikidata when the bots are already available. Also, the data model can be modified by simply adding new “properties”. Last but not least, Wikidata is editable by everyone, which allows storing new knowledge without delay. There is a provisional list of genomic features and interactions below that will be presented by the data model. Given that the Master's thesis is limited in time and the implementation of new “properties” heavily depends on the Wikidata community and can be possibly time-consuming, the implementation of the web-page of step four and five is optional.

Provisional list of genomic features to be included edit

Basic genomic features:

- TSS / Transcripts (for some cases only the transcriptional start site is known)

- Processing sites

- UTRs

- Terminators

- sRNAs

Interaction features:

- Transcription factor gene interactions

- Transcription factor gene interaction locations (Binding sites)

- RNA-RNA interactions

- RNA-RNA interaction locations (Binding sites)

- RNA-protein interactions

- RNA-protein interaction locations (Binding sites)

Operons, Regulons, Transcription units

References

Burgstaller-Muehlbacher, S.; Waagmeester, A.; Mitraka, E.; Turner, J.; Putman, T. E.; Leong, J.; Pavlidis, P.; Schriml, L.; Good, B. M. & Su, A. I. Wikidata as a semantic framework for the Gene Wiki initiative. bioRxiv, 2015, doi: http://dx.doi.org/10.1101/032144

Putman, T.; Burgstaller, S.; Waagmeester, A.; Wu, C.; Su, A. I. & Good, B. Centralizing content and distributing labor: a community model for curating the very long tail of microbial genomes. bioRxiv, 2015, doi: http://dx.doi.org/10.1101/031286