Wikidata talk:WikiProject Molecular biology/Archive 1

species

I read "human is a priority", at Wikidata level shouldn't we avoid this type of approach and try to be open to all organisms since the beginning? or you should rename the task force in "Human Molecular biology task force" ;-) --Chandres (talk) 14:12, 4 March 2013 (UTC)

Yes, I think this task force should eventually target all organisms. We suggested starting with human since that nicely overlaps with our interests and expertise. But we are also interested in the Long Tail of microbial genomes too ([1]), so making sure the models/properties/tools are generalizable is definitely on our mind. And more to the point, if you and/or others are interested in other organisms, then by all means, let's actively move forward on multiple fronts! Cheers, Andrew Su (talk) 19:17, 4 March 2013 (UTC)

Let's get started?

Now that the string datatype has been implemented, seems like things are ready to move on getting some of our proposed properties approved. Anyone want to take the lead on proposing a few over on the property proposal page? Andrew Su (talk) 17:44, 7 March 2013 (UTC)

I had a little time and requested a couple of properties. All they need now is a few reviews. --Tobias1984 (talk) 10:39, 27 May 2013 (UTC)
See currently proposed properties. These properties require some specialized knowledge, but they are just string-type identifiers, and as such should not cause major structural problems. I think that if they have not been reviewed in a few days, they can safely be created. --Zolo (talk) 10:00, 29 May 2013 (UTC)
Although it would be nice if at least one person would check over the proposals and support them ;). Where are all the people of this task force? --Tobias1984 (talk) 21:14, 1 June 2013 (UTC)

Scope?

I read that you want to at least add data for 10k human genes, but we could easy imagine adding data for millions of genes in UniProt (for example). Is this the place to discuss the scope of the project? Cheers, --Dan Bolser (talk) 15:58, 12 March 2013 (UTC)

Hi Dan, absolutely, we want to scale up to all known genes and proteins in all organisms. Our local database here (a newer version of what's available at http://mygene.info) has loaded all ~8.7 million genes from NCBI's gene_info file ([2]), plus all known links to Ensembl, UniProt, PDB, GO, etc etc. So all of our infrastructure is converging on a complete loading of all knowledge. As a pilot project though, we are proposing to start with the 10k human genes that encompass the Gene Wiki, just because our team has interest and bandwidth to make sure all of that is high quality. As mentioned above, if there are eyeballs and hands to help us broaden our initial pilot, by all means, we're open to it... Cheers, Andrew Su (talk) 17:58, 12 March 2013 (UTC)

User:Ricordisamoa/AuthorityControl.js

As external links are probably important here, you may want to use and expand User:Ricordisamoa/AuthorityControl.js that automatically creates links for some string properties. --Zolo (talk) 10:15, 9 March 2013 (UTC)

Hi Zolo, just to confirm, the tool above only changes how string properties are rendered in the web interface, not how the data are represented in wikidata, correct? Assuming I'm understanding correctly, sounds great! Once we get a few of our proposed properties created, we'll add our identifiers. (For others, instructions on how to add this user script are at Wikidata:Tools#AuthorityControl.js.) Cheers, Andrew Su (talk) 23:04, 12 March 2013 (UTC)
Yes, it only changes the way things are dipslayed. In the longer term, it might make sense to store external IDs differently from other strings, as full statements with sources and qualifiers may not make much sense for them. Apparently, this is something the development team is thinking about, but if something get done, it will probably not be in the next few months. --Zolo (talk) 08:15, 13 March 2013 (UTC)
Note: it is now called MediaWiki:Gadget-AuthorityControl.js and is activated by default for all users. --Zolo (talk) 17:31, 2 April 2013 (UTC)

Distinguishing between genes and proteins

On Wikipedia, proteins are covered in the same article as their corresponding gene; see BRCA1, APOE4, etc. What are others' thoughts on separating proteins and their corresponding genes into separate items on Wikidata, e.g. such that 'BRCA1' would be a subclass of gene, and 'breast cancer type 1 susceptibility protein' would be a subclass of protein? And how about distinguishing between homologous genes and proteins in different organisms? Emw (talk) 02:41, 19 March 2013 (UTC)

Great discussion point. With the Gene Wiki, we did make the decision (with en:WP:MCB) to lump all the data about the gene and the corresponding protein product(s) into a single page. That was mostly because most pages were sufficiently underdeveloped such that splitting them would have been non-productive fragmentation. I'm torn about whether to take this same strategy with Wikidata. On the one hand, splitting them into two items is probably the more "accurate" way of representing reality. However, I think consumers of the data generally won't care about the difference, so it would be one more step with doing integrative queries. For example, suppose I want to find all genes on chromosome 2 whose protein products have a kinase domain. If the gene and protein items were distinct, presumably it would require another "join" in the query. Anyway, very interested to hear others' thoughts... Cheers, Andrew Su (talk) 04:59, 19 March 2013 (UTC)
From a semantic point of view seperating genes really makes a lot of sense to me. Especially since you can think of properties proteins have (function etc...) which do not really apply to genes. I can understand that persons who want to use the data don't really care if they look at a gene or a protein, but it doesn't seem like you would want to modify the raw data for it. Wouldn't it be possible to change the way genes and proteins are presented here on wikidata trough an extension or something like that. --TWillemsen (talk) 09:56, 25 May 2013 (UTC)
I am conflicted on this one. In this case, we have a natural immutable hierarchy that is a result of the central dogma (gene → protein → function). As Andrew has pointed above, in the vast majority of cases, we have a single article about the gene and protein encoded by that gene. This situation IMHO is unlikely to change since the subject of genes and proteins is so interrelated. Separate gene and protein articles, where they exist, have slowly been merged over the time. Furthermore, I am not aware of a single example of a gene/protein article has been split into two. On the other hand, gene and protein are distinct entities ... Boghog (talk) 10:59, 23 June 2013 (UTC)
Hum, the central dogma is wrong, there is a lot of example where one gene can give different proteins (in term of sequence and sometimes even the name change). you can aso find example of proteins with differents functions depending of , for example, phosphorylation status. The rational solution would be to have protein in the same item than the gene., But, when the case of multi protein from one gene or different function (not multiple function), it can be really difficult to link the different data. --Chandres (talk) 20:27, 23 June 2013 (UTC)
I also think that we can and should part ways with Wikipedias structure in certain areas. We could for example manage the sitelinks in the gene-items and create protein-items that usually don't have sitelinks but can hold statements independent from the genes. Maybe we should have an RfC about this topic and then we have a guidline how to manage similar cases. --Tobias1984 (talk) 21:00, 23 June 2013 (UTC)
Of course one gene can code for more than one protein (alternative splicing, post translational modification, etc.). This does not however invalidate the central dogma and the hierarchy (a one to many relationship). Boghog (talk) 21:05, 23 June 2013 (UTC)
Physically, the gene and proteins are separate entities, but I think it makes more sense to put them on one page. For the related properties it's clear if they belong to the gene or the protein. In an ideal world, we would have data on all the splice and PTM forms, but this is not the case and so pooling all the evidence (and perhaps adding qualifiers as to which form is meant) seems more manageable. MichaK (talk) 08:06, 24 June 2013 (UTC)
As some here have noted, genes and proteins are different things. They're strongly related, of course, but they're not the same thing. Wikidata is about things--its items are not encyclopedic articles. While it makes sense for Wikipedia to cover a gene and its proteins in a single article because of convenience for the humans that read the article, the same thing doesn't make sense in Wikidata. We can have as many items as we want, we can link them however we want, and we can display the data however we want. Wikidata's current user interface is not the final word, and better (and even domain-specific) interfaces can be built to display the data--in this case to put the data about genes and their proteins in one place. For an example of a different kind of Wikidata user interface, see the Reasonator. Silver hr (talk) 21:03, 24 June 2013 (UTC)
The two discussions about how Wikidata should handle gene-protein distinctions and how we should handle ortholog distinctions involve the same question. How should biological sequence information be partitioned on Wikidata? Should it be divided into many items such that one item represents one discrete type of biological entity, or divided into fewer items such that one item represents one gene product, including information like which entities the gene products derives from, etc?
Separating genes and their encoded proteins into separate items, and separating those by which organisms express them, seems like it would be more unwieldy initially. For an article like reelin, it would entail at least four Wikidata items: one for the RELN gene in humans, one for the RELN gene in mice, one for the reelin protein in humans, and one for the reelin protein in mice. To be consistent with this approach, Wikidata would need separate items for RNA (in humans and mice). And even when an EC number mapped to only one gene product, Wikipedia would need a separate item for the that enzyme class (see here and here for more detail). So Wikidata would need six items (and quite possibly more) to represent the information in one PBB template.
On the other hand, putting these distinct biological types into separate Wikidata items seems like it would make it easier to assign properties to each specific type. It's technically possible to distinguish between claims that apply to a gene or its product (or to its orthologs) in one large item, but that simple offloads the unwieldiness onto the claims. In other words, claims would need to be bloated with qualifiers in order to unambiguously specify which statement applied to which biological entity.
My impression is that it would be better to divide the articles in question such that one item represents one discrete type of biological entity. This seems like it might make the initial mapping of Wikidata items onto the PBB template less straightforward, but -- in time -- allow for greater expressivity about each biological entity within that template. Emw (talk) 23:38, 24 June 2013 (UTC)
Thank you for this summary, and you are right of course that the gene/protein question and the ortholog/paralog issue are highly related. It seems to me that the consensus is forming around separating all these conceptual entities into separate items (right?). I lean this way to, where my only hesitation comes from the fact that I'm largely ignorant of how the Wikidata querying system will work. Emw states that this design would "make the initial mapping of Wikidata items onto the PBB template less straightforward". I'm fine with "less straightforward" as long as it could be done (integrating data from multiple wikidata items into a single wikipedia template). Can someone more knowledgeable confirm that this is true? Cheers, Andrew Su (talk) 07:26, 25 June 2013 (UTC)
I forgot that I had a call with Denny this morning, so I asked the question I posed above. The answer is that currently only one-to-one mappings between wikipedia and wikidata are allowed (meaning we would not be able to import data from multiple wikidata items into a single wikipedia template), but that support for this feature is definitely planned and will likely come near the end of the year. So with that context, I think we separate out genes from proteins, and also the various orthologs. Long term that seems like the best solution to me. I'm going to make a specific proposal to this effect below... Cheers, Andrew Su (talk) 17:18, 25 June 2013 (UTC)

Human/mouse/... ID

How do we distinguish between IDs for humans and mice? Present Reelin P351 value is for humans, but the property does not describe that. Should a qualifier be used? (I know not much about non-human bioinformatics) — Finn Årup Nielsen (fnielsen) (talk) 14:55, 17 April 2013 (UTC)

Ooops, sorry I missed this post way back when. Yes, we need to better model how orthologs are handled. Personally, rather than putting it in a qualifier, I'd propose creating a separate topic for the mouse gene, and then relating them via a new property called "ortholog". There are too many functional differences to lump all orthologs into a single topic. Your thoughts? Cheers, Andrew Su (talk) 00:27, 25 May 2013 (UTC)
I just updated RELN (Q414043) with some information and I think that the P89 (P89) qualifier (mouse/human) works pretty well. We should still thing about how the information should be sourced. --Tobias1984 (talk) 21:06, 13 June 2013 (UTC)
I'm still not 100% sure I agree with having the human and mouse genes represented in the same item. Many of a gene/protein's properties might be species-specific. For example suppose (completely hypothetical here) that reelin interacts with VLDL receptor in both human and mouse, but interacts with APP only in human. How would we model that? What happens if there are disagreements on what the true ortholog relationships should be? Again, I tend to favor creating a separate topic for each species-specific gene, and then linking them via an 'ortholog' property. Other thoughts? Cheers, Andrew Su (talk) 06:38, 21 June 2013 (UTC)
Different interaction in different species can be expressed with qualifiers, but it seems more tricky for disagreements over orthologs. I am no biologist, but I feel that separate items is a more manageable long-term solution. If so, should RELN (Q414043) be taken to mean "human reelin", or should "human reelin" have its own item ? I do not know if there would be much to say about reelin in general, perhaps things like its evolutionary history. --Zolo (talk) 07:12, 21 June 2013 (UTC)
Yes, I would agree that RELN (Q414043) would specifically refer to the human version (which would be noted with P89 (P89) and human (Q5)), and we would then create a new item for "mouse reelin". As far as evolutionary history, I think those reciprocal "ortholog" relationships can be encoded as statements on both items. Make sense? Cheers, Andrew Su (talk) 08:34, 21 June 2013 (UTC)
Hi Andrew! I made an example of how qualifiers could be used to show different interactions for different species (http://www.wikidata.org/w/index.php?title=Q4115189&oldid=51778169). The bottom two example of VLDL protein show that additional qualifiers could be added to further describe the interaction (my weird example: Sandbox-item has physical interaction with VLDL protein in Mice, but only in 1980). --Tobias1984 (talk) 09:11, 21 June 2013 (UTC)
Hi Tobias, yes point well taken that qualifiers could be added there too. My concern though is that almost all of the properties would end up having species-specific qualifiers. For example, in addition to the ones already shown, I think regulates (molecular biology) (P128), HGNC gene symbol (P353), and RefSeq protein ID (P637) would all be candidates. In that case, it might be easier just to separate them. And on a philosophical level (even though I hate philosophical arguments), I tend to think that human reelin is in fact a different thing than mouse reelin. Your thoughts? Others thoughts? Cheers, Andrew Su (talk) 16:52, 21 June 2013 (UTC)
The question of what would be the easier solution is difficult to answer at the moment. The problem has multiple-dimensions too. What is easier for people to view and edit; What is easier for bots to edit; What is easier to query? Just recently most people agreed that certain editions of books should have their own item. But then a query for "books written by author" will return all the editions that link to the author. That means that the query has to be more complicated than in the one-item solution. If there is something nature hates, it is parceling of information ;). Do you have time to look at Wikidata:Property_proposal/Term#RefSeq so the GSoC student can get up and running? --Tobias1984 (talk) 17:20, 21 June 2013 (UTC)
Yeah, good points. I'm personally not so concerned about the views/edits, but query complexity is definitely a downside to over-fragmentation. But I also worry that if we decide fragmentation is better later, it will be a big pain to "fix" things after we've already created 10000+ items. Ugh, no perfect solution... (And yes, I did add my support to Wikidata:Property_proposal/Term#RefSeq. Thanks for the reminder...) Cheers, Andrew Su (talk) 17:55, 21 June 2013 (UTC)
At the moment, I do not see what it would be made harder to query. Presumably, most queries will be species-specific anyway.
Yes, I suppose it makes sense to link ortholog genes through an "ortholog" property, though of course, it we have many species, and link all ortholog pairs across all species, we get something almost as redundant as the old interwiki system that Wikidata is supposed to avoid ;). --Zolo (talk) 07:36, 23 June 2013 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── Interesting discussion. I think this issue boils down to whether it is better to have a hierarchal vs relational data model for this type of data. A hierarchal model might make sense if the hierarchy were immutable and not subject to change over time. The ortholog hierarchy is a product of evolution which implies that there may be some ambiguities. In cases where the corresponding orthologs within two species only have a single paralog within each of the species, there is no debate on that these two protein are orthologs. However if there is more than one paralog of a given protein in either of the species, things can get complicated particularly if the similarity between orthlogs is comparable to the similarity between paralogs. I think looking at the HomoloGene database is instructive. Assignment of orthologs is based on a clustering algorithm. The exact assignment of orthologs may and do change slightly over time as more protein sequences from different organisms are added. Hence from a maintenance standpoint, as Andrew has already suggested, I think a relational model is better in this case (separate database entries for orthologs from different species linked by a new ortholog property). Boghog (talk) 09:41, 23 June 2013 (UTC)

Adding the paralog layer is important for the debate. In the future the system will be used for other species,when we will start the plant world, we will have ortholog, with paralogs, that have the same function in different localization, or different function in the same compartment. I really think that one entry per gene per specie is more relevant at a long term view. --Chandres (talk) 20:15, 23 June 2013 (UTC)
For small protein families, putting proteins of different species on one page might look feasible, but I don't think it scales well. So each protein should have its own page. Regarding the ortholog links: As we add more species, the number of necessary links explodes. I'm not really sure if there's a good representation of gene trees in the WikiData model. But in the beginning, adding mouse--human orthologs to the respective protein pages makes a lot of sense. MichaK (talk) 08:00, 24 June 2013 (UTC)
Andrew Su has alerted me to this discussion. We will have the Quest for Orthologs meeting in a few weeks, and I will try to get feed-back then. In the meantime, my feeling is that creating ortholog groups makes the most sense: for a given taxonomic level (e.g., LCA of human and mouse) you get the orthologs plus any in-paralogs. This is notably implemented in a clear way in OMA, I'm sure that they can help you if needed.Marcrr (talk) 15:23, 1 July 2013 (UTC)

Hi, when you are done with this decision, could you add a word about it in Help:Modeling#Molecular_biology ? This help page is precisely intended to index and discuss (it probably will be split in the future) such modelisation decision. It would be nice to have some usecase items, to list relevant properties, and even detail an example. And/Or a link to the page on this project where this decision will be documented. TomT0m (talk) 14:03, 30 June 2013 (UTC)

Great idea. Though is there a formal way to describe a data model? We've so far been doing everything by example (e.g., RELN (Q414043)), but something more formal would certainly be good too... Cheers, Andrew Su (talk)
If we'd like to formalize the data model, I think a good approach would be to keep working out basic properties as we have been, then express that in OWL. OWL is the lingua franca for speaking about ontologies for the Semantic Web, which in my opinion is what we're building (often unwittingly) here and throughout Wikidata. Some relevant literature and resources on OWL and biological ontologies:
One of high-level points I've taken from the literature is that the Gene Ontology's "is-a" relationship is modeled with the OWL property rdfs:subClassOf. Wikidata has a property explicitly based on that W3C recommendation: subclass of (P279). My initial impression is that it would work for all the gene and protein items to be created to have the statements "subclass of (P279) gene (Q7187)" and "subclass of (P279) protein (Q8054)". Emw (talk) 17:29, 30 June 2013 (UTC)
Good idea to try to formalize the model. I'll start a RfC to try to establish a common language on Wikidata on how to use instance of and subclass to establish a formal type model, this will be a support for this discussion. I have a few ideas but it might take a few days before it is really on. I think this will be a good point in the discussion you try to start for quite a while :) TomT0m (talk) 10:48, 1 July 2013 (UTC)
Concerning OWL and XML, please note that their is a workgroup of the Quest for Orthologs trying to establish standards to represent orthology relations: http://questfororthologs.org/standards Marcrr (talk) 15:25, 1 July 2013 (UTC)

Pathways

Hi everyone! New member of the community here (writing from the Amsterdam Wikimedia Hackathon! I was wondering if I could propose some properties not only for genes and proteins, but also pathways. I really think this could be a good addition. A lot of pathway data is already available, but not very structured. What are your opinions on this? If you might think it is a good idea, I can create some examples here.TWillemsen (talk) 18:37, 24 May 2013 (UTC)

I think pathway information would be fantastic! And there are a few structured resources for pathways -- Pathway Commons, Wikipathways, KEGG, etc. You're probably already familiar with them, but just to be sure... Let us know how it goes! Cheers, Andrew Su (talk) 00:25, 25 May 2013 (UTC)
Yeah, I've worked with those pathways. But in my experience those resources are not really data-oriented, for good reasons ofcourse. Anyway. I'll be proposing some properties for pathways here this weekend. I'm still very new to wikidata, so please let me know if I don't follow guidelines :) --TWillemsen (talk) 09:41, 25 May 2013 (UTC)

Re: Support for Property Creations

Hi everyone!, For the GSOC Gene wiki project[[3]], we have proposed a set of properties[[4]] to capture fields of infobox[[5]]. Kindly take part in property proposals[6] through your comments and/or support. --Chinmay26 (talk) 19:28, 18 June 2013 (UTC)

I can create the property as soon as there is more support. Maybe a lot of people don't have watch-list-email-notifications turned on. --Tobias1984 (talk) 10:15, 21 June 2013 (UTC)

Proposal for handling genes and proteins, and species-specific orthologs

In an attempt to summarize the consensus that I think we're reaching here and here, I propose (well, mostly reiterating and formalizing EMW's proposal) that the data from a single PBB template on Wikipedia be separated out into four Wikidata items: the human gene, the human protein, the mouse gene, and the mouse protein. Later, I think we can consider separating out the RNAs as well, I don't think this is justified at the moment since there are few (if any) RNA-specific statements. Please lodge your support or opposition below... Cheers, Andrew Su (talk) 17:35, 25 June 2013 (UTC)

Support

Oppose

More comments

Please make sure you've reviewed the comments already made here and here.

Just to make sure it is clear, please note that User:Chinmay26 is a GSoC intern this summer. So even if you think this basic model will need to be tweaked in the future, getting consensus around the plan above will allow Chinmay to get started building the basic pieces of infrastructure for uploading and maintaining genomic data... Cheers, Andrew Su (talk) 17:35, 25 June 2013 (UTC)

Distinguishing enzymes and gene products

There's a question about the nature of EC enzyme number (P591): Property_talk:P591#Distinguishing_enzymes_and_gene_products. Feedback is welcome! Emw (talk) 00:37, 27 June 2013 (UTC)

Using RefSeq (P656)

I've mocked up the use of RefSeq (P656) over on reelin (Q13569356) and RELN (Q414043), but I'm not convinced this is the best way of structuring things. Any thoughts? Cheers, Andrew Su (talk) 20:58, 29 June 2013 (UTC)

Is there a reason we need GenBank accessions for Wikidata items? The GenBank accessions used to derive a given RefSeq accession are noted in the latter's 'COMMENT' field, see e.g. NM_011261. So my impression is that it's probably extraneous and unnecessary to include GenBank accessions in 'RNA ID' or 'Protein ID' claims, in which case those properties could be renamed to 'RefSeq RNA ID' and 'RefSeq protein ID' and we could do away with RefSeq (P656) (which is currently used as a qualifier). Emw (talk) 05:36, 30 June 2013 (UTC)
Yeah, I'm not 100% sure that we need it either. Perhaps for organisms that don't have strong RefSeq support. Perhaps we shouldn't worry about that case for now? Are there other sources of non-RefSeq RNA and protein sequences that are important? Where should we put the Ensembl transcript IDs (ENST*) and Ensembl protein IDs (ENSP*)? Not sure about this... Cheers, Andrew Su (talk) 15:49, 30 June 2013 (UTC)
I agree that we probably don't need to worry about organisms that don't have strong RefSeq support for now. If we'd like Ensembl IDs for transcripts and proteins, then I think it would make sense to have separate properties for each of those. Emw (talk) 16:56, 30 June 2013 (UTC)
(And shouldn't RELN (Q414043) and reelin (Q13569356) be switched? The Wikipedia article is about the protein, but the Wikidata item with the sitelinks is about the gene.) Emw (talk) 05:49, 30 June 2013 (UTC)
Well, here's where Wikipedia's semantic ambiguity makes things difficult. The WP article combines information about the gene and the corresponding protein. That was a conscious decision that we made at WP:MCB. Ultimately the infobox template for reelin will need to draw from all four reelin-related wikidata items (human/mouse gene/protein). So I think the link as it stands is fine, but certainly open to more discussion on how best to handle things... Cheers, Andrew Su (talk) 15:49, 30 June 2013 (UTC)
I also think that the items should be switched. Even though Wikipedia infoboxes are both about the gene and the protein, the textual content is primarily about the protein, and that is true for all languages. Actually fr:Reelin hardly even mentions the gene. --Zolo (talk) 22:00, 1 July 2013 (UTC)
I moved the sitelinks to Q13569356. --Zolo (talk) 15:31, 3 July 2013 (UTC)

New properties needed

Given the format that has been agreed upon, we need some addtional properties:

  • ortholog of
  • encoded by and the symmetric property "encodes". Or just one of them ?
  • any others ?

--Zolo (talk) 15:34, 3 July 2013 (UTC)

I think we need "Taxonomy ID" to note that human (Q5) is 9606. And yes, I do think we should have the reciprocal links. The brother and sister properties are used on both ends of that relationship, so that's a similar situation, right? Cheers, Andrew Su (talk) 23:06, 3 July 2013 (UTC)
I added them to Wikidata:Property proposal/Term#Biochemistry and molecular biology / Biochemie und Molekularbiologie / Biochimie et biologie moléculaire. -Zolo (talk) 07:26, 4 July 2013 (UTC)

Sourcing requirements for bots

There's an RFC that's relevant for the GSoC project: Wikidata:Requests_for_comment/Sourcing_requirements_for_bots. Emw (talk) 17:18, 6 July 2013 (UTC)

gene, RNA, and protein identifiers

All, as we move forward with modeling gene and protein items, I want to be sure we have consensus. Right now, we are generally following a model that has database-specific properties:

However, I think there is an argument to have just three properties ("Gene ID", "RNA ID", and "Protein ID"), where the different types of identifiers are differentiated by the "Source". For example, RELN (Q414043) could have a property "Gene ID" --> "5649" (Source: National Center for Biotechnology Information (Q82494)), and also "Gene ID" --> "ENSG00000189056" (Source: Ensembl genome database project (Q1344256)). I tend to like the simplicity of this system because it will prevent explosion of the number of properties, especially as we move to other organisms (Flybase, Wormbase, Pombase, RGD, etc...). Thoughts? Cheers, Andrew Su (talk) 18:27, 9 July 2013 (UTC)

There is a RfC about this topic running right now: RfC:How to classify items. --Tobias1984 (talk) 19:43, 9 July 2013 (UTC)
Yes, in order to prevent an explosion of properties, I think it would make sense to have fewer, more fundamental properties. We would of course need to agree on a standard "base" name. For genes, the most natural would be Human Genome Project (Q192446) and proteins, UniProt protein ID (P352). Boghog (talk) 20:05, 9 July 2013 (UTC)
The Classification RFC linked above seems mostly unrelated to this discussion. That discussion concerns whether we should use many domain-specific "type of" properties or two such properties to construct instance relations and subsumption hierarchies. (I happen to think we should use the latter approach.) This discussion seems to be about how to handle identifier properties, not basic membership properties.
For identifier properties, there seems to be more precedent on Wikidata to have each separate ID as its own property, rather than grouping identifiers by type and then individuating them with qualifiers. For example, we've got several very popular properties VIAF ID (P214), GND ID (P227), Library of Congress authority ID (P244), NDL Authority ID (P349), and Bibliothèque nationale de France ID (P268). Each of those properties is really describing the same type of thing: an authority file ID. However, instead of having a single "authority ID" property with one qualifier per authority (e.g. OCLC, GND, LC, NDL, BNF), you can see in e.g. On the Origin of Species (Q20124) that each identifier has its own property.
That said, type-specific organization of properties as suggested by Andrew seems like it could be a good idea. Here are a few questions and comments:
  1. How would identifiers from different databases operated by the same organization be qualified? For example, the gene ID properties Entrez Gene ID (P351) and HomoloGene ID (P593) come from different databases -- Gene and HomoloGene -- operated by the same organization, National Center for Biotechnology Information (Q82494). So sourcing at least gene ID strictly by organization doesn't seem feasible. In cases where a biological sequence (e.g. a gene, RNA or protein) has multiple identifiers from the same organization, should that qualifier simply point to the database instead of the organization, for example Gene and HomoloGene (Q468215)? (The problem doesn't necessarily go away if we don't considered "HomoloGene ID" to be a "gene ID". If we wanted to represent both reelin's NCBI Gene ID 5649 and NCBI RefSeq ID NG_011877.1 with a "Gene ID" property, how would we do that?)
  2. Is having "RNA ID" and "Protein ID" on a gene item redundant with the proposed "encodes" property? This seems like it could have implications on how we organize our ID properties for biological sequence items.
I'll end my comment by pointing out that Andrew's proposal, or something like it, might enable an even simpler way of handling IDs. If we were to organize these sequence identifier properties by type, then we might be able to designate one "preferred ID" among the set of IDs for each property with Wikidata's upcoming claim ranking feature. This would allow the statement with the preferred ID to be shown to all users and displayed in Wikipedia infoboxes by default. The various ID properties are probably all of equivalent accuracy in themselves (so our usage of ranking would deviate slightly from the feature's official description), but this might be a nice added benefit. Emw (talk) 04:26, 10 July 2013 (UTC)
A trivial solution to question #1 above is to assign "source" to a specific database instead of an organization. For example:
Thanks for pointing out the "claim ranking" feature which looks very useful. To rephrase what I stated above, IMHO the "preferred id" sources for genes and proteins should be Human Genome Project (Q192446) and UniProt protein ID (P352) respectively. It is not so clear what the "preferred id" would be for mRNA however. Boghog (talk) 09:32, 10 July 2013 (UTC)
Sorry for the "slow" reply (only relative to you all!)... Just to get caught up quickly in bullet point style...
  1. Tobias, the linked RfC is interesting indeed. However, I personally find it to be too abstract. I'd propose that we focus on coming up with the best solution for this particular corner of Wikidata, and worry about the implications/relationship to the rest of Wikidata later. Otherwise, we run the risk of paralysis. Sound reasonable?
  2. I agree with Boghog that we should just create an item for every unique source.
  3. Boghog, I'm not sure I 100% understand what you mean by "base name" / "preferred ID". Especially as it relates to Human Genome Project (Q192446). Can you clarify?
  4. Regardless, I think Boghog and Emw are supportive of the alternate plan above. Tobias (and anyone else who's interested), do you have any objections or refinements? In tangible terms, I think the game plan would involve:
  • creating a "Gene ID" property (we already have RefSeq RNA ID (P639) and RefSeq protein ID (P637))
  • creating items for any database identifier providers that don't already exist (to start, "HUGO Gene Nomenclature Committee (HGNC)", "NCBI Entrez Gene")
  • migrate all the uses of the DB-specific properties (e.g., UniProt protein ID (P352), HGNC ID (P354)) to the new system.
  • eventually propose deletion of the DB-specific properties
Any thoughts/refinements/dissent? Cheers, Andrew Su (talk) 01:18, 11 July 2013 (UTC)
A few comments:
  • I think we should consider how the proposed "encodes" property relates to this alternate plan. It seems the alternate plan would have statements for "Gene ID", "RNA ID" and "Protein ID" in each gene item. However, isn't that information redundant with "encodes"?
  • Since Human Genome Project (Q192446) isn't a sequence database I don't think it makes sense to use it as a source for identifier properties. (Yes, HGP is a high-level project that has generated much of the underlying biological sequence data for the human genome, but it's not the entity asserting that, say, reelin has any particular identifier.)
I agree with Boghog's statement that these ID properties should be sourced to a biological database, not an organization. I think for our purposes we can consider items with Wikipedia pages using 'Infobox biodatabase' to be valid sources for these ID properties. Emw (talk) 02:07, 11 July 2013 (UTC)
Responding to Andrew's question above, what I meant by "base name" / "preferred ID" is for situations where multiple databases provide an equivalent data field (e.g., gene name), and where the value (e.g., reelin) may not be identical between databases. Therefore we should indicate which database provides preferred values for a given data field. For example, many databases provide gene names (HUGO, NCBI gene, etc.). Furthermore HUGO provides an approved gene name, which most other databases including NCBI gene replicate. However not all databases may use the currently approved HUGO gene name. Hence the need to specify a "preferred ID" (if I have interpreted Emw correctly). Does that make sense? Boghog (talk) 04:31, 11 July 2013 (UTC)
Regarding the "Encodes" / "Encoded by" proposed properties, I still think those are relevant for linking the gene item to the protein item. I didn't mean to suggest that statements for "Gene ID", "RNA ID" and "Protein ID" would all appear in the gene items. Rather, I think "Gene ID" would show up on gene items, and "Protein ID" would show up on protein items. (RNAs of course are the sticky one. I'd propose that in general, "RNA ID" shows up under the gene object, unless the RNA has a defined function in which case we break it out as its own item. But these would be edge cases.) Everything else you both stated makes sense to me, and I agree... Andrew Su (talk)
Thanks for the clarification -- that works for me. One very minor note, though: if "Gene ID" would only be used on gene items and "Protein ID" on protein items, then would it be simpler to just say "Sequence ID" when referring to the ID of the current gene or protein item? This seems like it would be similar to the approach most sequence databases take. For example, when referring to the ID of the "current" sequence on a given record, RefSeq, GenBank and UniProt simply say "accession" for the sequence ID, rather than "gene accession" or "protein accession" (see here, here and here). If this seems like it has notable drawbacks, then "Gene ID" and "Protein ID" seem fine to me. Emw (talk) 05:32, 11 July 2013 (UTC)


There are basically two extremes Wikidata could use (correct me if I'm wrong)

  • (1) Use a property "ID" for all identifiers.
  • Pro: Very few properties
  • Contra: No lists for constraint violations, no way of finding out if each item has every identifier, hard to query for Wikipedia infoboxes, hard to construct links to those databases
  • (2) Create properties for all identifiers.
  • Pro: Lists of constraint violations, Lists that show if each item has each property, easy for Wikipedias to get infobox information, easy to construct URL form "base-URL" + "identifier"
  • Contra: Very many properties

A possible solution would be to allow for properties to be nested too. So all the gene properties could be a subclass of "Gene ID" and "GeneID & RNA-ID & ProteinID" would be a subclass of "sequence ID" and "sequence ID" would be a subclass of "ID". I don't know if there are any plans to implement this here, but I think somebody once said in the ProjectChat that that was the way SemanticWeb handles these problems.
I personally think that the identifiers are not that important. The true potential of Wikidata is the item and number datatype that will allow us to create an unbelievable mesh of interlinked data that will in the end be more important than the 50+ identifier-properties that each item will receive sooner or later. --Tobias1984 (talk) 09:59, 11 July 2013 (UTC)

Tobias, you raise some good points about the limitations of the proposed system. I think (hope) some of them will end up being non-issues, but they are issues at the moment. So since this is not an undeniably positive move, let's just continue with the status quo and use database-specific properties. We can always make a change later. To help things move along then, I will:
I think that will cover all the identifier properties needed for the current gene infoboxes. Please discuss more if anyone disagrees with any of these changes! Cheers, Andrew Su (talk) 16:30, 11 July 2013 (UTC)

Gene and protein labels and descriptions

Since we're talking about identifiers for genes and proteins, I thought it'd be fitting to also discuss their labels and descriptions.

The proposal / guideline at Help:Label#Disambiguation says "When an article title includes disambiguation in it, either by placing it after a comma or by placing it in parenthesis, the disambiguation should be left out. Disambiguation information should instead be placed in the description field". Some of our items don't follow this styling, e.g. we've got items labeled "reelin (human gene)", "reelin (human protein) and "Reln (mouse gene)".

Proposed label and description format:

  • Human genes:
Label: HGNC gene symbol, e.g. RELN
Description: human gene
  • Mouse genes:
Label: MGI gene symbol, e.g. Reln
Description: mouse gene
  • Human proteins:
Label: HGNC full name, e.g. reelin
Description: human protein
  • Mouse proteins:
Label: MGI name, e.g. reelin
Description: mouse protein

For convenience, the HGNC entry for RELN is here and the MGI entry for Reln is here. What do others think? Emw (talk) 03:22, 11 July 2013 (UTC)

I worry a bit that the label won't be interpretable to a non-scientist, but this is a pretty minor worry. The item is really defined by its statements, so the label and description (I think) are really there just for convenience only... So bottom line, I like this proposal... Cheers, Andrew Su (talk) 05:05, 11 July 2013 (UTC)

Hi, I'm not sure if this topic has been discussed before. The few times I made gene or protein databases for projects I used ENSEMBL IDs as primary identifiers as they are usually quite complete and convenient to parse from the data files. However, one problem with ENSEMBL IDs is important to keep in mind. Such an ID is rather meaningless without the information to which ENSEMBL database *version* the ID belongs. ENSEMBL IDs change quite frequently over the history of the database versions and to keep a local database up-to-date to ensure that a used ID actually still refers to the gene/protein it was originally assigned to isn't trivial (although Ensembl maintains tables which record any such changes). I don't see any mentioning of a 'version' with the ENSEMBL IDs in wikidata. The 'source' info of the 'ENSEMBL GENE ID' property links only to 'Ensembl' in general, not a specific database version and also the ID itself links to the general Ensembl entry, which is the latest version. Would that imply that there is a bot who updates the ENSEMBL IDs in the wikidata database regularly to resolve possible arisen conflicts? Cheers, Optimale (talk) 11:42, 6 August 2013 (UTC)

Genome assembly database?

Hi,

I'm collecting some data on WP here: http://en.wikipedia.org/wiki/List_of_sequenced_plant_genomes

Can we store all those values in WD? --Dan Bolser (talk) 13:48, 10 July 2013 (UTC)

Yes! Putting genome metadata onto Wikidata is a great idea. Each genome should probably be represented as an item. How are these genomes represented, e.g. are most of them assemblies, sequence maps, or?
Here's a possible mapping for fields in the table at list of sequenced plant genomes; some based on relevant existing properties:
Organism strain: P89 (P89) (perhaps we should create a property "strain" to support assigning sub-species information through qualifiers)
Family: this is extraneous, since we should be able to deduce it from the above field
Relevance: maybe unnecessary?
Genome size: we should probably use a generic "length" property with units "Mbp" (or whichever order of magnitude of "bp" is most appropriate)
Number of genes predicted: we might want to propose a new property for this
Organization: new property, like above
Year of completion: I would suggest using publication date (P577) to specify the most precise date possible
Assembly status: what does this mean?
@Emw: Status of the assembly using a controlled vocabulary, described here: wikipedia:Talk:List_of_sequenced_plant_genomes --Dan Bolser (talk) 12:03, 20 May 2014 (UTC)
There are more interesting genome properties to consider, but this is a start. Emw (talk) 02:50, 11 July 2013 (UTC)
Agreed, these would be cool data to add! In addition, I can think of two things that might be nice to include that isn't already in your table. First is the sequence identifier for the sequenced genome. Second is the NCBI taxonomy ID (P685). Possible to add those columns? Cheers, Andrew Su (talk) 04:53, 11 July 2013 (UTC)
Hello! I started work on this today with lots of help from User:Magnus Manske (I was a bit clueless before). We jumped in and proposed one of the properties in the table (and suggested by User:Emw too), here: https://www.wikidata.org/wiki/Wikidata:Property_proposal/Natural_science#Genome_size
I totally agree that GCA and tax_id would be valuable properties to add to the table (in addition I'm planning some form of quality ontology). Are there already properties for these two identifiers? Cheers, --Dan Bolser (talk) 12:39, 15 April 2014 (UTC)

property support or comments please

Hello all, there are a few proposed properties that could use your support or comments please. Starting with this link, the properties in question are "Encoded by", "Ensembl Transcript ID", and "Ensembl Protein ID". Please contribute your thoughts! Cheers, Andrew Su (talk) 18:01, 16 July 2013 (UTC)

I've supported those properties, which all make sense to me. Are there properties that the GSoC project needs but doesn't have yet? Emw (talk) 03:22, 17 July 2013 (UTC)
In general we still need a lot of item-properties that link genes and proteins to other items. For example "mutation in gene 1 causes disease A". We could also have more links to neurological and physiological functions. --Tobias1984 (talk) 08:33, 17 July 2013 (UTC)
I think new proposals can be done on a separate WD:Property proposal/Biology or WD:Property proposal/Science, as WD:Property proposal/Term is annoyoingly slow to load, and so diverse, that there does not seem to be much point in having everyhthing together. --Zolo (talk) 10:36, 17 July 2013 (UTC)
This discussion could took place in project chat or in a new RfC, and be generalised ... TomT0m (talk) 10:52, 17 July 2013 (UTC)
The reasoning behind the limited amount of subpages is that people should actually also review properties from other scientific fields and find properties that overlap. But I can see the problem with the page being huge at the moment (usually 100+ proposals since March). Maybe we should make a subpage for biology and life sciences? --Tobias1984 (talk) 10:56, 17 July 2013 (UTC)
To avoid huge number of proposal, I would support grouping properties if they make sense together or if they are similar and vote for a group of properties instead of property by property. TomT0m (talk) 11:02, 17 July 2013 (UTC)
I trimmed down the page to about 80 proposals. That should help with the load times. It would help to get some votes for Wikidata:Property_proposal/Term#Medicine_.2F_Medizin_.2F_M.C3.A9decine. --Tobias1984 (talk) 20:30, 17 July 2013 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── Infobox neuron is next. Please vote for the 6 proposals at Wikidata:Property_proposal/Term#presynaptic_connection_.28afferent.29 --Tobias1984 (talk) 09:39, 23 July 2013 (UTC)

Did these properties get accepted? If so can you link? Cheers, --Dan Bolser (talk) 12:42, 15 April 2014 (UTC)
@Dan Bolser: A hopefully complete list of the existing properties is at Wikidata:WikiProject_Molecular_biology/Properties Tobias1984 (talk) 15:24, 15 April 2014 (UTC)

New ProteinBoxBot Bot flag request

Hi everyone!, I have submitted a request for bot flag here. The test runs are here. Kindly chime in with your thoughts. Chinmay26 (talk) 18:05, 19 July 2013 (UTC)

A week has passed. Ymblanter is waiting for final comments so the bot can be approved. --Tobias1984 (talk) 15:35, 26 July 2013 (UTC)
If you have experience with bots, please also review Wikidata:Requests_for_permissions/Bot/Chembot. We have a lot of overlapping properties with the chemistry task force. --Tobias1984 (talk) 14:22, 29 July 2013 (UTC)

ProteinBoxBot edits

Hi, i have run ProteinBoxBot for first 3 proteins under | human proteins. The bot does not handle EC Classification , Gene_atlas image, Alias yet . sample item -- www.wikidata.org/wiki/Q411507. The bot also does not update appropriate qualifiers yet(working on it). I just wanted to run this by the community to confirm/clarify if there are any issues regarding the edits. Chinmay26 (talk) 22:32, 29 July 2013 (UTC)

Thanks Chinmay. I looked over ProteinBoxBot's recent contributions and it seems like things are coming along. A few comments on Cyp21a1 (Q14358793) that generalize to other gene/protein items:
  • The description field reads "Mouse Gene". This should be lowercase -- "mouse gene". Same for "human gene", "human protein", and "mouse protein", since none involve proper nouns.
  • The RefSeq RNA ID (P639) claims should only be used for RefSeq accessions. Per the "Distinguishing Features" section of http://www.ncbi.nlm.nih.gov/refseq/about/, RefSeq accessions all contain underscores -- for example, NM_009995 is a RefSeq accession, but AI323066 is not (it's a GenBank accession).
Below is a template I would suggest we use for these various RefSeq properties. The "valid accession prefixes" constraints are derived from the official table mapping RefSeq accession numbers to molecule types.
Property Valid accession prefixes Molecule type Should be used on Wikidata items that are subclasses of... Example usage in reelin items
RefSeq (P656) NG_, NT_, NC_, AW_, NW_, NS_, NZ_ genomic DNA gene (Q7187) RELN (Q414043) RefSeq (P656) NG_011877.1
(per http://www.ncbi.nlm.nih.gov/gene/5649, Ctrl-F for "NG_")
RefSeq RNA ID (P639) NM_, NR_, XM_, XR_ RNA gene (Q7187) RELN (Q414043) RefSeq RNA ID (P639) NM_005045.3
(per http://www.ncbi.nlm.nih.gov/gene/5649, Ctrl-F for "NM_")
RefSeq protein ID (P637) NP_, AP_, YP_, XP_, ZP_ protein protein (Q8054) reelin (Q13569356) RefSeq protein ID (P637) NP_005036.2
(per http://www.ncbi.nlm.nih.gov/gene/5649, Ctrl-F for "NP_")
Overall things seem to be making good progress! Emw (talk) 03:48, 30 July 2013 (UTC)
Small point, but could you also add description in other language:
en:human gene
de:menschliches Gen
es:gen humano
fr:gène humain
it:gene umano
zh:人类基因
I would like to add (Sarilho1 (talk) 14:36, 30 July 2013 (UTC)):
pt:gene humano
pt-br:gene humano

Properties

physically interacts with (P129) is still missing examples, constraints and a description of its scope. We should decide if this proposal (Wikidata:Property_proposal/Term#drug_interaction_.28en.29_.2F_Arzneimittelwechselwirkung_.28de.29) needs to be a separate property or not. --Tobias1984 (talk) 18:23, 31 July 2013 (UTC)

Another proposal needing attention drug action altered by --Tobias1984 (talk) 12:54, 2 August 2013 (UTC)

Discussion about drug-drug interaction qualifiers

I would like to invite the participants of this project to give their opinion in this discussion: Wikidata_talk:Medicine_task_force#drug-drug_interaction. --Tobias1984 (talk) 19:39, 6 August 2013 (UTC)

reelin and RELN

Human Gene:

Mouse Gene:

Human

Mouse:

Wouldn't it be good idea to connect reelin (human) and reelin (mouse) to reelin (in general). The sitelinks could go into that general item and statements that are true for both could only go with the parent item. Same could be done for the gene. --Tobias1984 (talk) 11:33, 26 August 2013 (UTC)

Sorry I missed this post. I personally think we should not have items for "in general". Human reelin and mouse reelin are real and tangible things, and I think the abstraction will create more problems than it's worth. For example, what is true for "in general" undoubtedly differs depending on what species you're considering. One claim that is true for human, mouse and rat may not be true for fly, and of course reelin is probably not present in the genomes of many lower organisms. My two cents... Cheers, Andrew Su (talk) 23:04, 10 September 2013 (UTC)

Duplicates ProteinBoxBot

We probably need to discuss some items that ProteinBoxBot is currently creating. Some items might be duplicates we can merge, others are distinct concepts. Merging needs to be done carefully, in order not to break all the links. The first duplicate I found is: human ageing (Q14330657), human ageing (Q332154). In my opinion those two could be merged. Any opinions? --Tobias1984 (talk) 17:57, 10 September 2013 (UTC)

Hi Tobias, you raise a very good point. Chinmay has put in quite a few checks to avoid creating duplicates. Of course, no system is perfect (especially with the somewhat incomplete query API at the moment), so letting us know when you run across them is very useful. Speaking on the those two examples in particular... I think the ageing/aging issue would be difficult to detect. The spelling difference prevents us from doing an exact string match, although I think that if it was added as an alias then Chinmay's program would have detected the existing item. (In theory, we may have been able to use the MeSH ID to make the match, and if that becomes a common theme then we can add that feature.) I think the second example for endoplasmic reticulum exists because Q14327640 was created before the redundancy checking was in place... I think that is not a problem in any of the most recent runs. Unless you have any objection, I'll go try to use the merge tool to fix these? Cheers, Andrew Su (talk) 23:00, 10 September 2013 (UTC)
(Edit conflict) I was about to say much the same. I added "aging" (American English) as an alias of the item labeled "ageing" (British English). Gene Ontology terms use American English, but the Wikipedia articles on a good proportion of biomedical subjects use British English titles. Adding the bolded text in article leads as an alias seems like it would solve this problem in one fell swoop, but given the GSOC time constraints I suspect we'll need to do more manual data input. Emw (talk) 23:13, 10 September 2013 (UTC)
@Andrew Su: Will it mess up the bot if we delete them now, or will it use the other item automatically? --Tobias1984 (talk) 07:01, 11 September 2013 (UTC)
Another one? cell-cell signaling (Q14758911) and cell signaling (Q210973) --Tobias1984 (talk) 12:03, 11 September 2013 (UTC)
We should probably always keep the item with the lower number and we also have to change all the incoming links:
Changing the links is now pretty easy with User:BeneBot*/movelinks.js. --Tobias1984 (talk) 15:02, 9 December 2013 (UTC)

ProteinBoxBot progress

Hi all, in case anyone is interested in how Chinmay's PBB project is going, you can see his recent efforts at https://www.wikidata.org/wiki/Special:ApiSandbox#action=wbsearchentities&format=json&search=entrez%3A&language=en&type=item&limit=50 or https://www.wikidata.org/w/index.php?title=Special:Contributions/ProteinBoxBot&offset=&limit=500&target=ProteinBoxBot. We're in the home stretch on his GSoC project (which is why we haven't been more active in the discussions) so the current priority is to make sure the code that he's written is robust to many different gene examples. Cheers, Andrew Su (talk) 23:08, 10 September 2013 (UTC)

Hormones and biological process (P682)

Couple of questions. Would it be ok to connect hormones with biological process (P682) to their functions or do we need another property? Example: progesterone (Q26963) = human pregnancy (Q11995). Also we need a property for "where the hormone is made in the body". Does anybody know what to call it or is there a property we can recycle? --Tobias1984 (talk) 17:53, 12 September 2013 (UTC)

I like this suggestion. It seems there are different kinds of relationships we'd want to use - inhibits, activates. This might tie in with a general discussion on representing pathways in wikidata, Note that CHEBI includes a link between progesterone and the role 'contraceptive drug' (which might itself be linked to GO:pregnancy). As for where substances are made, in GO we include RO:occurs_in links between processes such as 'progesterone biosynthetic process' and structures or tissues in Uberon (we don't have a particular link for progesterone, but this could be added, but we have strict criteria). In addition we have links between 'progesterone biosynthetic process' and CHEBI:progesterone. These could be chained together to get a chemical to tissue link. Cmungall (talk) 23:01, 5 February 2015 (UTC)

Endorse funding for wikidata query tool and more?

There is a proposal for funding to build a wikidata toolkit for developers. If you like, it, please head over there and let them know by adding an endorsement. https://meta.wikimedia.org/wiki/Grants:IEG/Wikidata_Toolkit#Endorsements:

Updates and additions to URL mappings for molecular biology properties

I've added a request to improve URL mappings for several of the ID properties we use in gene and protein items at MediaWiki_talk:Gadget-AuthorityControl.js#Updates_and_additions_for_molecular_biology_properties. That link also describes an interesting bug with the URL mapping for Ensembl gene ID (P594) -- it goes to a page expecting a human gene ID even when it's used on claims for mouse genes. Please take a glance over that and note any comments or questions there. Thanks, Emw (talk) 12:51, 15 October 2013 (UTC)

Property proposal: chromosome

See Wikidata:Property_proposal/Natural_science#chromosome. Emw (talk) 20:56, 14 December 2013 (UTC)

protein binding

Hi, I have trouble labelling and sorting out this item, which is in the list of most used items without french label : protein binding (Q14633864). Is it a duplicate for peptid bond ? Is it a (class of) bond beetwen proteins ? TomT0m (talk) 22:52, 10 February 2014 (UTC)

The concept protein binding (Q14633864) comes from the Gene Ontology (GO), specifically http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0005515. It is a type of binding, which in turn is a type of molecular function. It isn't a synonym of 'peptide bond' nor a subclass of bond between proteins. More information is available in the GO link; more about GO in general is here: http://www.geneontology.org/GO.doc.shtml. Emw (talk) 00:44, 11 February 2014 (UTC)
I saw that link, I am more confused and curious because protein amino acid binding is cited as synonim, is it a bond beetween too proteins by bonding some of their peptides ? TomT0m (talk) 17:47, 11 February 2014 (UTC)
Let's compare definitions and a few subclasses of relevant GO terms:
Definition: Interacting selectively and non-covalently with peptides, any of a group of organic compounds comprising two or more amino acids linked by peptide bonds.
Subclasses: beta-amyloid binding, oligopeptide binding, peptide hormone binding
Definition: Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules).
Subclasses: apolipoprotein binding, heat shock protein binding, p53 binding
The difference between 'protein binding' and 'peptide binding' is the difference between proteins and peptides: size. Page 85 of Lehninger: Principles of Biochemistry (4th edition) says "molecules referred to as polypeptides generally have molecular weights below 10,000 and those called proteins have higher molecular weights." And that's the case with the children of 'peptide binding' and 'protein binding'. Beta amyloid has 36-43 amino acids (for comparison, cytochrome C has 104 residues and a weight of 13,000 residues), and so is a peptide. Apolipoprotein E, an apolipoprotein, has 317 amino acids, and so is a protein. Beta amyloid would be an object of peptide binding, and apolipoprotein E would be an object of protein binding.
So both 'protein binding' and 'peptide binding' would involve binding amino acids, just amino acids in a bigger or smaller molecule. Importantly, the definitions for both terms explicitly note that they are non-covalent. A peptide bond is a type of covalent bond, so is would be quite incorrect to synonymize "peptide binding" with "peptide bond". Another important thing to consider is that, per GO, 'binding' is a type of activity, not a bond; 'binding' is a process, not an object. Emw (talk) 02:55, 12 February 2014 (UTC)
Thanks, my high school courses are far away, and in non english :) So to sum up, protein and peptide are made of bonds beeteen anime acids, whereas petide and proteins bindings forms respectively peptide and proteins complex (that may include other kinds of molecules). TomT0m (talk) 11:55, 12 February 2014 (UTC)
Yup, pretty much. Emw (talk) 12:16, 12 February 2014 (UTC)

Wikidata Infobox on Czech Wikipedia

Czech Wikipedia is currently interested in adding protein data to their articles. This would be our chance to test our data in the field, by building a Lua-Infobox that uses Wikidata-data. By adding the infobox one at a time we can slowly work out problems, add sources and gather experience for further deployments. @Hypothalamus: would be our go-to person. Hypothalamus could choose an example page (ideally a page that isn't visited to much, so mistakes can be fixed without anybody noticing). We also still have to find somebody with some Lua-infobox experience. --Tobias1984 (talk) 11:26, 25 March 2014 (UTC)

Yes, the original suggestion was targeted to User:Andrew Su here. I suggest we experiment with infoboxes on a short article on Hepcidin. Is there anything I can do at this point? Do you want me to ask at Czech wiki's Village pump if there is anyone with Lua programming skills willing to help? Hypothalamus (talk) 19:35, 7 May 2014 (UTC)
  WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. Tobias1984 (talk) 21:03, 7 May 2014 (UTC)
@Hypothalamus: - If you could find a lua-programmer on Czech wiki that would be great. There are some working examples of infoboxes that would just need adjustment e.g. taxobox on hebrew-wiki [7] and the module [8]. This bug still puts some limitations on pulling data from other items than the item that is connected to the wikipedia article [9]. But I think that the most important information can be pulled from the item itself. Tobias1984 (talk) 11:34, 8 May 2014 (UTC)
I'd be happy to help. Can read Czech and Lua fine but am lousy at producing them properly. --Daniel Mietchen (talk) 12:05, 9 May 2014 (UTC)
@Daniel Mietchen, Hypothalamus: - I think another example is at https://fr.wikipedia.org/wiki/Module:Infobox and https://en.wikipedia.org/wiki/Wikipedia:Lua - I still have to read up on how this all works. Especially the modules that are required by Wikipedia still confuse me. Tobias1984 (talk) 18:48, 9 May 2014 (UTC)
@Daniel Mietchen, Hypothalamus: - Another example: fr:Module:Infobox/Composé_chimique used on fr:Undéc-1-ène - Daniel if you have time it would be great if you could help out Hypothalamus and czech-wiki. I still have trouble reading the code and don't really understand what goes where. Tobias1984 (talk) 13:20, 10 May 2014 (UTC)
This is a great initiative and I would be happy to help. I am a native Czech, molecular biologist, but cannot write in Lua. --Vojtěch Dostál (talk) 21:58, 13 May 2014 (UTC)
@Vojtěch Dostál: Welcome to the WikiProject. Help is of course always appreciated and needed at every corner of Wikipedia and Wikidata :) - I see that you didn't make any edits yet to Wikidata. You could start by just looking at some items of your favorite articles and see what kind of information you can add to them. It is good to understand the data structure, so when the infoboxes are put onto czech-wiki you will already know how to fix mistakes. Ping me if you need any help! -Tobias1984 (talk) 22:31, 13 May 2014 (UTC)
@Tobias1984: I am user Vojtech.dostal but have just renamed my account and did not merge all of them :-). Thank you for your kind offer - as long as I do not dig into the programming - I think I'll be all right. --Vojtěch Dostál (talk) 22:39, 13 May 2014 (UTC)

I can't work out how to use the Wikidata:WikiProject Molecular biology page

Sorry if it's just me, but I wanted to add my new proposed property (genome size) there but couldn't work out how to do it. --Dan Bolser (talk) 12:45, 15 April 2014 (UTC)

@Dan Bolser: Do you have trouble with the markup language? The template of the property list together with the table syntax can be confusing at times. I can help you if you describe your problem. Tobias1984 (talk) 15:26, 15 April 2014 (UTC)
@User:Tobias1984: Syntax is fine, I just don't know what properties to use with what templates. --Dan Bolser (talk) 11:33, 20 May 2014 (UTC)

Gene alias

Hello to everyone! I am a beginner in to Wikidata, so I ask you to be patient with me :) I was wondering if it has already been taken into account to add alias (aka) in the names of the genes (eg aliases reported by geneCard or other similar database). As example I tried to edit the page FOSL2, but obviously it is not a task that can be carried one-by-one by hand. Have you thought about developing a bot for this purpose? I am at your disposal! Amicobromo (talk) 10:38, 31 July 2014 (UTC)

@Amicobromo: - @Andrew Su: has run the previous bot edits for this project. Maybe he can incorporate aliases on the next run. But it also depends on, if the data is available to us. - If you like, you can also program your own PyWikiBot. But there is also still much work to be done by humans. For example checking constraint violations on properties. -Tobias1984 (talk) 10:07, 1 August 2014 (UTC)
@Tobias1984: Thank you for your reply. I will prepare a tsv (or csv) file with all gene alias I can find in public databases, so @Andrew Su: may use this information to add this task to his bot. Let me know, thank you! 79.38.219.90 10:14, 1 August 2014 (UTC)
@Amicobromo: Thanks for your interest in biological data on Wikidata! Gene aliases are of course very important. The primary source databases for that information are NCBI Entrez Gene and Ensembl, both of which provide alias information in downloadable files. As @Tobias1984: mentioned, we are writing a bot (currently under the care of @Andrawaag:), and we are taking those data sources into account. If you'd like to help, let's try to figure out to get you involved with our bot development! I'll ask Andrawaag to chime in here too... Cheers, Andrew Su (talk) 18:07, 1 August 2014 (UTC)

ProteinBoxBot August 2014

  WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. Pinging because I hadn't seen this page, and maybe some other people didn't either: User:ProteinBoxBot/201408 sprint. The bot has done a few runs in August. Please check its work and update the constraint violation templates (e.g. Property talk:P639). New violations will show up after 1 or 2 days.

Some constraint violations also need fixing:

Don't hesitate to ping me for help with the constraint templates. -Tobias1984 (talk) 12:29, 27 August 2014 (UTC)

Yes, thank you Tobias1984! We welcome any and all input. The primary goal right now is to reimplement the infrastructure we previously created so that it is more robust. So far we've only done a few test edits. We will post here when we need feedback from the larger community. Hopefully soon! Cheers, Andrew Su (talk) 23:59, 28 August 2014 (UTC)

ProteinBoxBot September 2014

Hi all. The last month I have been refactoring the ProteinBoxBot. The progress is reported in User:ProteinBoxBot/201408 sprint. The workflow is as follows:

  1. for a gene label check if it is already covered in WikiData,
  2. if so obtain the WikiData ID
    1. Check if the WikiData entry all ready contains a subclass subclass property -- if it's anything other than what the bot would have added ("gene" or "protein") then it should throw a warning and skip that gene
  3. If not create a new page
  4. With the gene identifier extract related identifers from http://mygene.info
  5. Add the related identifiers as statements to WikiData

The Refurbished proteinBoxBot has been tested on about 2000 Entrez genes (see https://www.wikidata.org/wiki/Special:Contributions/ProteinBoxBot) and if no objections are raised I hope to launch the ProteinBoxBot on the remaining 40000 entires later this week. - AndraWaag (talk)

We have revised the bot approach a bit. Adding Entrez gene and its related identifiers are now added in two steps. A first step where only a stub page is created. This stub contains a label, a symbol, synonyms, the entrez gene identifier, and its species. In a second step the information form mygene.info is obtained and added as related identifiers. The first step (the stubs) is finished now and the second step is currently running. --Andrawaag (talk) 22:37, 19 September 2014 (UTC)

Royal Society of Chemistry - Wikimedian in Residence

Hi folks,

I've just started work as w:Wikimedian in Residence at the w:Royal Society of Chemistry. Over the coming year, I'll be working with RSC staff and members, to help them to improve the coverage of chemistry-related topics in Wikipedia and sister projects.

You can keep track of progress at w:Wikipedia:GLAM/Royal Society of Chemistry, and use the talk page if you have any questions or suggestions.

How can I and the RSC support your work to improve Wikipedia? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:17, 24 September 2014 (UTC)

The key thing is to make sure we coordinate ongoing chemistry (especially drug-related) tasks (mainly bots) within wikidata. Check out the work of User:AlepfuBot as an example. Curious what you are planning to do? One thing that would be really helpful in general is to produce some examples of how to get wikidata content into Wikipedia.. --Genewiki123 (talk) 16:30, 24 September 2014 (UTC)

Strategy to merge duplicates in WikiData

  WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. Hi All,

We are prosperously adding entries from Entrez gene to Wikidata. Both the house mouse genome and the human are processed once and we are in the process of designing an update bot to keep the data up-to-date. The workflow is as follows. When entering a new entry from Entrez gene, the bot checks if an entry already in WIkidata exist or not. Here the description of a gene entry is key and the symbols and Synonyms are added as aliases. However, Wikidata entries exist where the symbol is key. This has let to some interesting duplicates (e.g. Q15316328 and Q18254780). The question would be what strategy to apply here.

My proposition would be to keep the description as key and add the symbol and synonyms as aliases. Currently if no description exists (i.e. a dash, "-"), the entry is ignore by ProteinBoxBot, but I would propose to use a symbol here in the next iteration of the bot. The question is then how to deal with merges like: Q18046951 into Q1521757. Here the description got removed in the merge process. In my opinion the merge should been in the other direction, where Q1521757 would be merged into Q18046951.

Any thoughts, objections, approval on both the workflow where the description is key and the the proposition to merge duplicates in the direction of the entry that contains this description? Andrawaag (talk) 10:57, 17 October 2014 (UTC)

Let's Do SNPs!

i.e. dbSNP: http://www.ncbi.nlm.nih.gov/projects/SNP/

We could also add the GWAS catalog: http://www.genome.gov/gwastudies/

With all human genes now in wikidata, we could link each of those SNPs both to the gene they're in, and also to the disease the SNP is associated with.

In terms of properties, we'd need:

property: orientation, values "plus" and "minus" (depending on what strand is read, the value for the SNP might be reversed) property: in gene, to show what gene the SNP is in.

property: implicated in (implicated in disease or trait) to show what diseases or trait was implicated in.

genotype: values AA, AG, AT, AC, GA, GG, GT etc.and then each genotype could be annotated with a value: i.e. .2, and then a description of that value "diopters" to indicated that the SNP was associated with a a change in .2 diopters in the case of myopia...

OR

nucleotide: value A, G, T or C.

This is actually the hardest bit, deciding how to represent the different values for the SNP. So in GWAS, we only have a single nucleotide associated with each disease, that's relatively easy to represent within one wikidata entry. The problem is that some studies report only the association between one nucleotide (.i.e "the presence of T") with the magnitude of the effect, and others look at both nucleotides (i.e. AA, AT, or TT) and associate each genotype with an effect.

The difference, of course, is that you can have a dose responsive effect or dominance could be involved... so just doing the single nucleotide, as opposed to the genotype, doesn't give you the complete picture

And then the other issue is that a single SNP may have effects on multiple traits/diseases, so we need to make sure that say ".2" and "diopters" is connected because the same snp might also cause say 200% increase in glaucoma or something.

From a DB design perspective the normal way to do this is to put all associations into its own table. I.e. have a wikidata entry for the SNP, and then have an entry for each genotype that points to the SNP. But I'm not sure what the WD way of doing things would be in this instance.

Any thoughts? Mvolz (talk) 21:02, 25 October 2014 (UTC)

Sample SNP entry here: rs8176058 (Q18341737). Did what I could with more general properties! Mvolz (talk) 21:35, 25 October 2014 (UTC)
One I found that was already here: rs267601217 (Q15304616)

While I'm all for it, I assume everyone has seen SNPedia? --Magnus Manske (talk) 23:48, 26 October 2014 (UTC)

Mvolz, thanks for starting this conversation. Structured data about genetic variation is clearly essential if we want to discuss the connection between genes and diseases on Wikidata. As you may know, dbSNP deals not only with SNPs, but also other types of small variants, e.g. insertions, deletions, indels, multiple nucleotide variants, microsatellites, etc. Starting with small genetics variants (i.e. variants generally < 50 bp in length), even perhaps just single nucleotide variants, seems like a good start.
Even that task is huge. The current version of dbSNP (dbSNP build 142) has over 112,000,000 RS's for human. As you note, each RS often represents multiple alleles -- the reference allele and one or more variant alleles. As can be seen in e.g. http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=429358, each RS is also mapped to two genome assemblies -- GRCh38 and GRCh37.p13. Representing clinical assertions about genetic variations is also complex, see e.g. http://www.ncbi.nlm.nih.gov/clinvar/?term=rs429358 and http://www.ncbi.nlm.nih.gov/clinvar/variation/17872/#clinical-assertions. This genetic variation and clinical assertion data is updated regularly.
I suggest starting off with a narrow subset of genetic variation: variants of clinical significance like those in SNPedia (thanks Magnus), which has data on some 57,000 genetic variants.
I also think we should adopt standards and conventions in the controlled vocabulary used to discuss genetic variations. Relevant ontologies, standards, etc.:
Regarding the the proposed properties for representing SNPs:
  • orientation: Seems good to me
  • in gene: Good idea. It's worth noting that many clinically relevant variants occur upstream (or downstream) of a particular gene.
  • genotype: Let's clarify that "genotype" entails the sum of alleles at a given locus, e.g. two sequences for humans. I'd recommend using a slash to delimit each allele, which helps when considering alleles involving multiple nucleotides. That is, I recommend using "A/A" rather than "AA".
  • nucleotide: This should probably be called "allele".
  • implicated in: Implicated in and has effect (P1542) differ in evidentiality, but they both deal with causation, and I think statement ranks like "preferred" or "deprecated" could be used to capture that difference in sense. See also Help:Modeling_causes#Malaria, immediate cause of (P1536) and contributing factor of (P1537). In other words, my initial impression is that we should use one of those properties, not a new one.
  • clinical significance: would augment cause of (or contributing factor of or immediate cause of) statements with ACMG-recommended values noted in previous list -- pathogenic, likely pathogenic, unknown significance, likely benign, benign
I'd also suggest including which transcript(s) the variant occurs in. And let's also keep in mind that many medically relevant variants are structural variants, e.g. copy number variations. dbVar and DGVa store information on that. Finally, let's be sure to get our provenance statements reasonably precise, so we can track not only the organization making a particular statement, but also which build or release the statement is sourced in. Emw (talk) 04:27, 27 October 2014 (UTC)
Thanks Mvolz! This is something that has been on our group's radar for a while now and I'm happy to see the discussion starting here at wikidata. A while back we hacked a gene-snp-disease smeantic media wiki with data extracted from Wikipedia and SNPedia. It might be useful to have a play with that to see how you like the structure. See http://genewikiplus.org/wiki/Main_Page . Also note that Chunlei Wu is leading the development of a service called http://myvariant.info that should be a great way to programmatically gain access to SNP annotations. If it all works out, we ought to be able to use it much like we are using http://mygene.info now to feed bots to populate wikidata with this information. I agree with Emw though - we will want to stage this in a way that puts some useful content in here first without completely flooding the system with human variation data. (Though in the long run I would hope to see few if any limits on the amount of content like this that get into wikidata.) --Genewiki123 (talk) 16:56, 27 October 2014 (UTC)

Should these be merged?

KEL (Q1738190) and KEL (Q18028243)?

The English Wikipedia article mostly talks about the gene (and has a gene infobox) but technically the antigen system and the gene itself are not the same thing...

Thoughts?

A better way to link the two?

Mvolz (talk) 22:00, 25 October 2014 (UTC)

Infobox enzyme

Nobody has gathered the identifiers from Infobox enzyme yet:

5044 transclusions would make it well worth the effort. -Tobias1984 (talk) 20:12, 27 October 2014 (UTC)

General genes, specific diseases

In discussing a draft import of Disease Ontology classification, Andra explained that the claim "subclass of: disease" was added to all diseases because that's what was done for genes. Having suggested the "subclass of gene" approach for genes like RELN (Q414043) but complained about an analogous "subclass disease" approach for diseases like Alzheimer's disease (Q11081), I'd like to examine these approaches.

Consider how knowledge about diseases and genes are organized in sources. For example, Alzheimer's disease is said to be "subclass of tauopathy" in Disease Ontology (DO) and "subclass of other degenerative diseases of the nervous system" in ICD-10. The DO entry has 5 ancestors between "Alzheimer's disease" and "disease".

Now consider how genes are classified in the Ontology of Genes and Genomes (OGG), a modern ontology that aligns with major works like Gene Ontology. RELN is said to be "subclass of protein-coding gene of Homo sapiens" in OGG. There are 3 ancestors between "RELN" and "gene". This structure seems reasonable to me -- one layer accounts for the type of gene product (protein) and the remaining two account for organismic taxonomy (human genes, eukaryotic genes).

To compare apples to apples, it's necessary to account for the fact that DO does not include non-human diseases, but OGG does include non-human genes. If DO accounted for non-human diseases like OGG, that would add 2 ancestors to Alzheimer's disease. Non-human diseases are relevant for Wikidata -- e.g. rust (Q4273292) and scrapie (Q170102). So, Alzheimer's disease would have 6 or 7 ancestors, and RELN would have 2 or 3.

This is mostly just an exploration of how different domains do classification. I wouldn't oppose using OGG's approach; it may even be helpful, since OGG aligns with other major biomedical ontologies. Emw (talk) 19:02, 30 November 2014 (UTC)

I don't know enough about the wikidata technology stack to answer authoritatively here, but I would be a bit wary of directly replicating a realist OWL model in Wikidata triples. Liberal use of OWL classes (genes, proteins, diseases, pathways, chemicals) makes a certain amount of sense when your stack is based around DL reasoners. And it's also highly defensible from an ontological/philosophical point of view. But there are also good reasons for a more OWL-individual centric approach, some obvious, some subtle. The wikidata datamodel and stack may push things further towards individual-centric modeling. I think some kind of formal mapping to the OBO world could still be maintained, along the lines of prototype theories. This feels like a big issue that spans multiple WikiProjects. Cmungall (talk) 22:41, 5 February 2015 (UTC)
Cmungall, you mention there are also good reasons for a more OWL-individual centric approach, and that the Wikidata data model and stack may push things towards individual-centric modeling. Could you elaborate? Also, what makes you wary of adopting ontological realism in Wikidata?
Briefly: think of basic queries like 'how many genes in human?'. You'd end up having to bolt on some kind of metaclass system to constrain results to the desired hierarchical rank/layer. Also, does WD intend to commit to the same model theoretic semantics as OWL? How are existential restrictions mapped? I'm guessing these are all irrelevant to WD, in which case you don't end up buying anything with a class representation, you just make certain things harder. As for realism, it can lead to ontological hair-splitting distinctions; these may be v useful for precise reasoning and modeling, but I would guess WD users would prefer to see disease-qua-disposition, disease-qua-process, disease-qua-disorder etc lumped into an overarching concept. In terms of Ceusters and Smith's 3-level Granular Partition Theory of reality (http://www.jbiomedsem.com/content/1/1/10), ontologies are good for representing L1, but I'd argue WD would more naturally represent L2. (with the caveat that I am new to WD and may have misunderstood some of the goals) Cmungall (talk) 23:11, 7 February 2015 (UTC)
Are you aware of the Wikidata RDF exports? As seen in e.g. the 2015-01-26 dump, it includes OWL dumps for Wikidata's class hierarchy (wikidata-taxonomy.nt.gz) and instance layer (wikidata-instances.nt.gz). Emw (talk) 00:23, 6 February 2015 (UTC)

Launch of WikiProject Wikidata for research

Hi, this is to let you know that we've launched WikiProject Wikidata for research in order to stimulate a closer interaction between Wikidata and research, both on a technical and a community level. As a first activity, we are drafting a research proposal on the matter (cf. blog post). It would be great if you would see room for interaction! Thanks, --Daniel Mietchen (talk) 01:39, 9 December 2014 (UTC)

Replacing P643 with P1057

  WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.
Per discussion in Property_talk:P643, Pasleim is planning to replace P643 (P643) (Genloc Chr) with chromosome (P1057). Genloc Chr (P643) has been deprecated for over a year and this has been an outstanding maintenance task.

Pasleim, please be sure to only add P643 claims when there is no P1057 claim. Many genes, proteins etc. already have P1057 statements and we don't want two redundant P1057 claims on those items. Once there are no items with a P643 claim and without a P1057 claim, we should be able to delete the deprecated property. Thanks for taking this on! Emw (talk) 15:59, 27 December 2014 (UTC)

FYI, I deleted now P643--Pasleim (talk) 13:12, 30 December 2014 (UTC)

Identifier Syntax

I have a question and a mild paranoia about the identifier syntax. I have seen confusion and wasted effort before when people can't seem to agree how to decompose an identifier into a prefix and local portion. For example, consider MGI genes. I consider these to be a "local" part which is numeric, and the prefix "MGI". Some people understood "identifier" to mean the number, and others understood the "identifier" to include the prefix. This lead to "identifiers" being created that were MGI:MGI:nnnnn. I would like to ensure this doesn't happen again. So for a field like "Gene Ontology identifier", are we meant to enter the numeric portion, or the full ID. The GO's position on identifiers is here: http://wiki.geneontology.org/index.php/Identifiers I would also like to make sure the rules for translating wikidata identifiers to OBO PURLs are clear, and that each identifier property is linked to a source such as identifiers.org. I can help with this but not sure where to start. Cmungall (talk) 21:04, 7 February 2015 (UTC)

Gene Disease Interactions

All genes and diseases have now been put into WD and they are kept up-to-date by bots. This is a great development. The next step is to introduce relations between genes and diseases. For this, we have built an OWL ontology with the classes 'gene' and 'disease' and several properties. It can be found here: [10]. We would like to put this out for community discussion now.

The general idea behind this approach would be to first collaboratively build an OWL based ontology (e.g. with webprotege.standford.edu) which represents all classes and properties necessary in order to represent a certain relation/topic/part of reality in WD. This could then be proposed as a whole for WD property creation, so all required properties are getting created via one request. The ontology created could then also serve as a basis for data export from WD and would enable partial or complete export of a certain topic covered by the ontology. The ontology would therefore serve as the relational scaffold for a certain part of WD, mediating import and export processes or just giving a user a quick overview of how things relate. For the start, it would be important to discuss the gene-disease interaction properties and get them created, based on the ontology shown in the link above. Sebotic (talk) 21:12, 3 April 2015 (UTC)

  Notified participants of WikiProject Medicine

  WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.

New properties

Just created decreased expression in (P1910), increased expression in (P1911), gene deletion association with (P1912), gene duplication association with (P1913), gene insertion association with (P1914), gene inversion association with (P1915), gene substitution association with (P1916), posttranslational modification association with (P1917), altered regulation leads to (P1918). --Tobias1984 (talk) 10:26, 30 May 2015 (UTC)

I'm noticing that a lot of articles on the English Wikipedia pages are linked to wikidata items about proteins, rather than to wikidata items about genes. For example the Wikipedia entry on VIPR1 , which links to the protein item Q6594360 rather than to the corresponding gene item Q18255322. Discussions here above have settled on the approach of keeping gene product information (e.g. protein, RNA) in separate wikidata items from genes. It seems to me that Wikipedia articles such as this should be linked to wikidata items about genes (which can in turn be linked in a structured way to all the content attached to the protein/other gene product entries). As we begin to contemplate making use of wikidata information in the Wikipedia infoboxes, this is an important issue to resolve. Thoughts?   WikiProject Molecular_biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. --Genewiki123 (talk) 19:04, 29 June 2015 (UTC) Ping @Boghog:@Emw:

Hi Genewiki123. Fascinating. Using Wikidata for the GNF protein boxes may eventually help with producing reciprocal links with external databases like IUPHAR easier, but only after better tools for querying Wikidata become available. Concerning GeneWiki articles, the vast majority of these are about both the gene and gene product. I am still trying to get my head around Wikidata, so I do not yet have a strong opinion how GeneWiki data should be organized in Wikidata. Is there any down side to having both gene item and protein item linked to the same Wikipedia article? Boghog (talk) 18:22, 1 July 2015 (UTC)
Hi Boghog! Wikidata items can only be linked to one Wikipedia article. It needs to be one to one. We should be able to get all of the content attached to protein records (images, GO terms, etc.) to render on a Wikipedia article about the gene. This would happen through queries that originate at the gene, request protein items connecting to it via the 'encodes' property (of which there could be multiple), and then getting the information from those items. This kind of query is now straightforward within the context of wikidata. On English Wikipedia, we can currently only display content that is available on the wikidata item that is linked to the Wikipedia page. However, when "arbitrary access" arrives on EN (as it already has on e.g. Dutch Wikipedia), we will be able to get to all the content we need. To make this technically feasible, we really need to unify the Wikipedia article to Wikidata entry mapping pattern in a predictable way and I think the gene items are the right way to do it. --I9606 (talk) 20:08, 1 July 2015 (UTC)
Since a protein may consist of subunits that are encoded by different genes, I think it makes more sense to link a Wikipedia page about that gene to a gene item in wiki data. For example, [Hemoglobin,_alpha_1] and [Hemoglobin,_alpha_2] are different genes with different Wikipedia pages and different protein boxes, but they have the same [Uniprot ID]. Yes, there are two different items for the HBA1 and HBA2 as well in wikidata, but that's one less property to disambiguate the two protein items (4 left). Makes more sense (to me anyway) to go with the item with more properties and more specific properties (in this case genes). Gtsulab (talk) 00:47, 2 July 2015 (UTC)
@Gtsulab: This is a both very important and very annoying example. I think this is a relic of times where researchers where focused more on proteins because work with genes was not doable and it should actually be split into two different Uniprot IDs, simply because of genomic variation. If you look at the varation section on this Uniprot page, it is not possible to discriminate between variation in HBA1 and HBA2, although the protein sequence is the same for both, the UTRs are not, which can have significant effects on their expression and therefore the effect of variants/mutations in one of them can have different effects, strongly dependent on the gene where the variation sits. I therefore strongly favor using genes as the Wikidata item which should be linked to the appropriate (GeneWiki) pages in Wikipedia. Moreover, although not all genes encode proteins, all proteins are in some form (even for trans-splicing) encoded by genes, so they still form the uppermost layer in the Central Dogma of Biology. Sebotic (talk) 18:23, 2 July 2015 (UTC)
Hi Ben. OK, if there can only be a one-to-one mapping of Wikidata items and Wikipedia articles, I would agree that the gene should be linked to the article, not the protein. Cheers. Boghog (talk) 10:01, 2 July 2015 (UTC)
I'm a little confused here. Are you advocating for linking wikipedia articles that currently discuss both genes and their products to items in the class 'gene' on wikidata as I proposed above (apologies for the username change mid-stream here)? Its not really the case that genes have more or more specific properties. For example, the Gene Ontology annotations and structural data/images really belong to the proteins. Its more important that this information can be represented in a way that is logical, consistent and useful for applications - with Wikipedia being the overwhelmingly most important near-term application to support. Nice example by the way.. I usually think of the opposite problem, typified by CDKN2A which codes for multiple different proteins (P16, P14arf) with different functions.. --I9606 (talk) 03:47, 2 July 2015 (UTC)
Yes. I think the gene wiki articles should be linked to the gene item in wikidata since the gene item has more properties tying it to different databases and especially because of the issues of one protein being coded by multiple genes and one gene coding multiple isoforms of a protein (like FAS receptors). Makes more sense to me to tie the Wikipedia page to the gene item in Wikidata, and then draw the protein info from wikidata links between protein items and gene items. Also, makes it easier to be consistent if/when doing other gene products in the future like microRNA's where the precursors all have their own entrez gene id, and the fully matured MIR may not. Just my two cents Gtsulab (talk) 14:25, 2 July 2015 (UTC)


Proposal for bringing microbial genome, gene, and protein items to Wikidata

  WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. Greetings Project Molecular Biology!

I am new to the community and working with Andrew Su and the GeneWiki team. I am a microbial genomicist by training, and am here to to build a centralized model organism database for microbial genome, gene and gene product information. This project will create in Wikidata a structured database that will include current genetic information for all microbial organisms, and make that data accessible for making the kinds of associations currently being made with human genes, diseases and drugs. To do this, pending community support, we will modify the previously established ProteinBoxBot infrastructure to aggregate the wealth of knowledge of microbial genetics into the Wikidata project, starting with bacteria.

The project will consist of two major stages:

1) Develop ProteinBoxBot to populate Wikidata with microbial gene models/annotations, genome features and gene product information, from reliable public repositories.

2) Create a generic genome browser for all microbial organisms that will take advantage of the consistent and computer readable format of Wikidata genetic items.

Data resources

It is a great time to bring microbes to Wikdata! In addition to the reliable and robust MyGene.info, the May 17th 2015 release of the updated NCBI bacterial RefSeq repository , Release 70, provides an excellent initial data source. From the ~6400 re- or newly annotated bacterial genome assemblies included, 3268 high quality reference and representative genomes (i.e. those that represent strain groupings for a species) have been selected to populate the bacterial genome database maintained by NCBI Bacterial RefSeq. This greatly reduces the noise of the redundant nature of microbial genomics, by presenting a representative set of genomes for each species of bacteria, and provides an ideal initial framework for our project. Initially, we will include 120 reference sequences, curated by NCBI, as the highest quality and most validated sequence data covering a wide range of microbes. Those genome, their gene counts, and the total number of gene items being proposed for creation are viewable here microbe genome and gene item table.

Gene item development

The first task (that we would greatly appreciate community input on) has been designing what microbial gene and protein items will look like. The task page for the project PBB/Microbial gene and protein items presents the structure of our candidate microbial gene item, displayed here in Figure 1.

 
Figure 1. A microbial gene item in Wikidata (blue) and the structure of its linkage (through QIDs and properties) to the organism item of origin (green) and the protein item it encodes (orange). Solid black lines indicate WD Properties and dashed black lines indicate WD Property Qualifiers.

This diagram displays the properties and QIDs that define an item, along with the linkage between the gene item, the organism it is found in, and the product it encodes. This structure contains the basic properties and associations that are relevant to a microbial gene and gene product item.

Links to the prototype gene item displayed in Figure 1 (a gene in the bacterial species Chlamydia trachomatis), and the related items in the structure are as follows:

Relative items

(organism) Chlamydia trachomatis Q131065,

(representative strain A) C. trachomatis L2/434/BU Q20800254,

(representative strain B) C. trachomatis D/UW-3/CX Q20800373,

(gene item) translocated actin-recruiting phosphoprotein Q20797449,

(protein item) translocated actin-recruiting phosphoprotein Q17126483,

Properties

This structure takes advantage of existing gene properties and (at this point) requires no new properties be created. However, because of the multiple strain nature of microbial species, multiple values for a few single value properties are necessary. Attached to each value will be the qualifier “found in taxon” that points to the representative strain item that the gene or protein item originates from.

The properties that require multiple values include:

Gene items

Entrez Gene ID P351,

genomic start P644,

genomic end P645,

Protein items

UniProt ID P352

I am really curious to see what the community thinks of our model and would really appreciate feedback, so please add your suggestions or comments below. 
 Cheers,

Putmantime (talk) 23:58, 19 August 2015 (UTC)

Update

This project is now operating under the MicrobeBot user account. Current contributions in the microbial genetic data space are as follows:

Reference Genome Genes Proteins
Chlamydia trachomatis D/UW-3/CX (Q20800373) 884 888
Chlamydia trachomatis 434/BU (Q20800254) 850 852
Buchnera aphidicola str. APS (Acyrthosiphon pisum) (Q21065226) 569 568
Borrelia burgdorferi B31 (Q21065227) 1319 1282
Pseudomonas aeruginosa PAO1 (Q21065234) 3981 3570
Helicobacter pylori 26695 (Q21065231) 1421 1410
Thermus thermophilus HB8 (Q21065233) 2267 2268
Francisella tularensis subsp. tularensis SCHU S4 (Q21065232) 1488 1473
Total 14060 12311
Total Items 26371

These 8 bacterial strains are a subsample of the mentioned 120 NCBI reference genomes that will be the initial group of bacterial genomes loaded. We have started with this small group to show that our data model is solid, all of these gene and protein items are linked together via the 'encodes/encoded by' properties, and all statements are referenced to standard from either NCBI or Uniprot. While we have approval to move forward, I wanted to bring our progress to the community to make sure that it is fully understood what we are doing in terms of notability , referencing and scope. The latter will reach ~940,000 items when all 120 reference genomes are I want to make sure it is fully understood by the MB community before we load on this scale. Any comments or suggestions are welcome.

Cheers,

Putmantime (talk) 19:49, 6 January 2016 (UTC)

  WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.

New properties

medical condition treated (P2175) and drug or therapy used for treatment (P2176) are ready. Some discussion about labels and descriptions still needed. --Tobias1984 (talk) 12:30, 5 October 2015 (UTC)

substrate of (P2414) is ready now. --Tobias1984 (talk) 16:55, 21 December 2015 (UTC)

Wikimania 2016

Only this week left for comments: Wikidata:Wikimania 2016 (Thank you for translating this message). --Tobias1984 (talk) 12:00, 25 November 2015 (UTC)

Question

What's the difference between Forkhead box P2 (Q227241) and FOXP2 protein (Q21142761)?--Kopiersperre (talk) 19:52, 13 February 2016 (UTC)

Just duplicates, I think. FOXP2 protein (Q21142761) corresponds to the uniprot entry B7ZLK5, which itself seems to be an unreviewed duplicate of O15409 “FOXP2_HUMAN” (to which Forkhead box P2 (Q227241) corresponds). —Tinm (d) 00:24, 15 February 2016 (UTC)

Adding transcripts to bacterial genome annotations on Wikidata

Hi,

My name is Till and I am studying biology at Julius-Maximilians university in Würzburg, Germany. In the course of my master's thesis, I want to store bacterial annotation data consisting of sRNAs derived from transcriptome studies and NCBI-derived gene and protein annotations in Wikidata. Furthermore, interaction between genomic entities should be deposited as well (A more detailed description of my thesis can be found on my userpage User:Till_Sauerwein within the section Master's thesis - Expose). Right now, I am working on a data model, that fits the data structure of Wikidata and builds on the work by Burgstaller-Muelbacher et al., 2015 (http://dx.doi.org/10.1101/032144) and Putman et al., 2015 (http://dx.doi.org/10.1101/031286).

In the following I will present my initial suggestions how to store the genomic information in Wikidata and I am open for feedback regarding this.

The model should be composed of three main levels: Genes, transcripts and gene products. Beside basic informations like the genomic start and end of genes and transcripts, the model should also contain interactions between and within these three levels, so that questions like “Where does transcription factor X bind to the DNA?” (interaction between gene product and genes), “Which sRNA interacts with transcript Y?” (interaction between gene product and transcript) or “Which proteins can bind to protein X?” (interaction between gene products and another gene product) can be answered afterwards. Because a single prokaryotic gene can be part of different transcription units, I don't wanted to associate a gene with its transcripts and a transcript with its genes. Whether a gene lies within the boarders of a transcript and vice versa can be answered via positional information later on.

 
Figure 1: Data model proposal for bacterial genome annotation including transcripts on Wikidata

The figure 1 below shows a first sketch of the model. The Wikidata items of the three levels are blue for transcripts, green for genes and orange for gene products. Below each item there is a preliminary list of Wikidata statements that are planned to be included. Whether, e.g. “Terminator” will be an own item or we generate properties like “genomic start of terminator” and “genomic end of terminator” can be discussed later. I would be pleased if you could tell me what you think of the basic structure of the model. If this is not the right place to discuss the model, please let me know.

best regards,

Till Sauerwein (talk) 12:50, 26 February 2016 (UTC)

Also a short "Hello" from my side. Till is doing this project with me and we had several discussions how to model the described relationships. Clearly there is more than one way to do this. We tried to follow NCBI's approach but this seems not to cover everything that we are intending to include. We would appreciate any feedback! --Konrad Foerstner (talk) 21:52, 26 February 2016 (UTC)

  WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.

@Konrad Foerstner, Till Sauerwein: Sorry for the late answer. I can just say one thing: WD is not a reference so don't expect to have more information about how you can do a classification. WD like WP don't want to be some authoritative contact. You can only find some experts which will provide some personal advice. So start your classification based on personal work and external references and once you get your complete view you can propose it to WD as classification scheme. Snipre (talk) 13:10, 25 April 2016 (UTC)

Adding Drug-Target Information to Wikidata

Hi,

I work with the Dumontier Lab in the Biomedical Informatics Research Program at Stanford University and we are interested in incorporating drug-target information from DrugBank to Wikidata. Specifically, we would like to add to Andrew Su's work on importing drug/proten data. Also, because DrugBank contains information about drugs and their targets we would be able to add relational data to both items.

I posted on the Wikidata Project chat and there seems to be some concern about the accuracy of the data in DrugBank. I'll be looking into this shortly.-Crowegian

References tab --> presentations

Just FYI, I changed the link in the header bar from Wikidata:WikiProject_Molecular_biology/References (which doesn't seem to be used) to Wikidata:WikiProject_Molecular_biology/Presentations. I hope we can use the latter to organize links to all of our bio-wikidata talks and posters... Andrew Su (talk) 17:05, 26 April 2016 (UTC)

Adding protein family and domains

Hey all, I'm part of Andrew Su's group at TSRI and we're proposing to add protein family and domain information for all proteins. This information would come from the InterPro Database and will build upon the ongoing work of incorporating all genes and proteins into wikidata. Adding protein family information would allow several new use cases and would allow linking classes of proteins together across species; for example: finding all 5-hydroxytryptamine receptors (across any species), or finding all human proteins that are G-protein coupled receptors. Furthermore, protein domain information would be added, which would allow for example, finding all proteins containing an Immunoglobulin-like fold.

An example of how this would look for Human HTR2A is shown in Figure 1. A new property "InterPro ID" would be required. Human HTR2A and mouse HTR2A are 'subclass of' the 5-Hydroxytryptamine 2A receptor family, which itself is a subclass of the 5-hydroxytryptamine receptor family, and so on. Both human and mouse 5HTR2A (only human shown for simplicity) contain a G protein-coupled receptor, rhodopsin-like domain which is located from residue 91-380. Is it appropriate to use genomic start and stop for this?

 
5-hydroxytryptamine protein families and domains

Links

WikiData: Human HTR2A Mouse HTR2A

InterPro Families: 5-Hydroxytryptamine 2A receptor (IPR000455) 5-hydroxytryptamine receptor family (IPR002231) G protein-coupled receptor, rhodopsin-like (IPR000276)

InterPro Domains: GPCR, rhodopsin-like, 7TM (IPR017452)

The InterPro page for this human protein is here

New Items Added:

Initially, we would just annotate human, and then move on to mouse and the microbial items that Putmantime is adding. Each InterPro family/domain would be added as a new item and then referenced to its related proteins.

Total number of InterPro families: 19788 (7615 of which are found in at least 1 human protein)

Total number of InterPro domains: 8439 (4125 of which are found in at least 1 human protein)

Items to be modified:

There are 27486 human proteins in wikidata, of which 14976 have at least one InterPro Family assigned, 9828 have at least one InterPro domain (but no families), and 24804 have either.

Comments and suggestions welcomed.

Protein length, mass and sequence

Hi, does the creation of the properties "protein length, protein mass and protein sequence" make sense? And a subsequent import of the values from Uniprot for all items that have a Uniprot-Number (P:352)? They would be used in the german version of the infobox protein (see here). I can help with the data collection (table). Cheers, --Ghilt (talk) 19:45, 20 June 2016 (UTC)

@Ghilt: As mentioned, protein sequence will be difficult because the string datatype in Wikidata has a limit of 400 characters and there is a substantial number of protein which are larger than 400 amino acids. The follow up question here would be, if the AS sequence should be imported, why not also the nucleic acid sequences for genes? My guess would be that if somebody needs the sequence, they can follow the Uniprot ID or NCBI gene id and retrieve the sequences from the original source. For length and mass properties, I think they should be reasonable property proposals, if required for infoboxes. Sebotic (talk) 01:29, 28 June 2016 (UTC)

Relations between gene, protein, and pharmaceutical drug (or "general" item?)

Hello, everyone.

I've read the above discussions and understood that there are 4 items, (human, mouse)×(protein, gene), arranged in Wikidata, corresponding to Wikipedia's 1 article.

In case of en:Oxytocin, however, in addition to Oxytocin/neurophysin I prepropeptide (Q11938629) (human protein), OXT (Q14820911) (human gene), Oxytocin (Q14820920) (mouse protein) and Oxt (Q14820914) (mouse gene), there is one more item oxytocin (Q169960) (pharmaceutical drug or hormone). Wikipedia sitelinks are in oxytocin (Q169960). But there is no link between oxytocin (Q169960) and the others. How, and with what propaty, can we describe these relationship?

On the other hand, en:Vasopressin has only arginine vasopressin (Q183011) (human protein), AVP (Q14820686) (human gene), Arginine vasopressin (Q14820693) (mouse protein) and Avp (Q14820688) (mouse gene). Sitelinks are in AVP (Q14820686), but chemical or pharmacological properties, like DrugBank ID (P715), CAS Registry Number (P231) or PubChem CID (P662), are in arginine vasopressin (Q183011). This is inadequate and also ugly.

I think we should take en:Oxytocin for a model. oxytocin (Q169960) may be equivalent to what @Tobias1984 called "in general" on this page (11:33, 26 August 2013 (UTC)). How about restarting the discussion about "in general" items? It has already been pointed out that, if there are "general" items, sitelinks can be simply placed there. There may also be demands for "general" items in the context of medicine such as endocrinology and physiology, because people in that field often use animal models in order to describe the mechanisms of human diseases or physiologic functions, assuming that the functions of human gene (products) and mouse gene (products) are almost the same. For example, arginine vasopressin (Q183011) is not suitable for describing syndrome of Inappropriate antidiuretic hormone secretion (Q959457), I suppose. --Okkn (talk) 17:39, 20 August 2016 (UTC)

@Okkn: Usually, if a biotech drug and a in-vivo protein have the same sequence, protein annotation and drug annotation should go to the same item. If they have a different sequence -> different items. Certainly, semantically, drug and protein are 2 differend concepts. Sebotic (talk) 20:58, 31 August 2016 (UTC)

Protein-protein interactions

  WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. Dear all, I think it is high time that we import protein-protein interactions into Wikidata. Is someone working on it? @ProteinBoxBot, Sebotic:? If not, let's discuss how it should be achieved. There are many public databases that we can use, but I see danger in blindly importing all measured protein-protein interactions without accounting for the trustworthiness. All in all, all proteins interact with each other to some extent (the measure of which is the dissociation constant). Does anyone have a suggestion how to overcome this problem? --Vojtěch Dostál (talk) 19:31, 6 October 2016 (UTC)

@Vojtěch Dostál: I am working on protein-protein interaction import. I agree that we need to be careful what to import so that the data stays useful. We definitely want, as I did for Gene Ontology, a qualifier stating the determination molbio/biochemistry method used. We could rank on methods, e.g. I would be careful with importing too many of the yeast-two-hybrid assay hits, were a ton of open data exists for. My starting point resource is BioGrid, because they have an enormouse amount of well-curated data and its CC0, several other great resources, e.g. STITCH are open, but CC-BY and therefore can actually not be used without special permission. The other great thing about BioGrid is that there is a PubMed ID for almost every interaction they have in their dataset. Any resources you have in mind for an import? Sebotic (talk) 20:14, 6 October 2016 (UTC)
@Sebotic: Yes, Biogrid is my favourite and I agree that it is well documented (not sure about the curation - do they somehow curate individual interactions or filter the data?). Still, the data comes from mainly 2-3 large MS studies and the Biogrid's dataset will continue to grow enormously if more large-scale MS studies are added there. You are not afraid that protein items will turn into lists of hundreds of interacting proteins over time? But if we want to do it, let's do it. Tell me if you'd like any assistance from my side, or keep us updated here if you want to do this on your own :). I am looking forward to querying protein protein interactions! --Vojtěch Dostál (talk) 21:28, 6 October 2016 (UTC)
As one of the people behind STRING and STITCH, I wonder if it is really useful to put PPI into WikiData. I think it's a good idea to curate information on the individual proteins and chemicals, but the O(N^2) space of interactions? What would be the advantage over dedicated resources like STRING or IntAct, or semantic web approaches that have imported the PPI? Sorry for being somewhat contrarian! MichaK (talk) 14:15, 13 October 2016 (UTC)
@MichaK To me, PPIs are interesting linked data and their presence in Wikidata would enable cool querying across different layers of information. For example, we could search for interactions which occur between two proteins of identical biological process (P682) or search for all protein-protein interactions occuring between GTPases and GAPs etc etc... I don't doubt that a computational biologist like you could run such queries using existing tools on the net but mortals like me are happy to have learned SPARQL and don't go any further in our bioinformatics efforts :-).
And last but not least, we could generate PPI data on Wikipedia from Wikidata - a much more flexible approach than having to manually update them. --Vojtěch Dostál (talk) 22:26, 29 October 2016 (UTC)
Return to the project page "WikiProject Molecular biology/Archive 1".