Wikidata talk:WikiProject Molecular biology

Active discussions
This is the talk page for discussing improvements to WikiProject Molecular biology.
Use the "Add topic" button in the upper righthand corner to begin a new discussion, or reply to one listed below.

GO Term ProvenanceEdit

Andrew Su
Marc Robinson-Rechavi
Pierre Lindenbaum
Michael Kuhn
Boghog
Emw
Chandres
Dan Bolser
Pradyumna
Chinmay
Timo Willemsen
Salvatore Loguercio
Tobias1984
Daniel Mietchen
Optimale
Mcnabber091
Ben Moore
Alex Bateman
Klortho
Hypothalamus
Vojtěch Dostál
Gtsulab
Andra Waagmeester
Sebotic
Mvolz
Toniher
Elvira Mitraka
David Bikard
Dan Lawson
Francesco Sirocco
Konrad U. Förstner (talk)
Chris Mungall (talk)
Kristina Hettne
Hardwigg
i9606
Putmantime
Tinm
Karima Rafes
Finn Årup Nielsen
Jasper Koehorst
Till Sauerwein
Crowegian
Nothingserious
Okkn
AlexanderPico
Amos Bairoch
Gstupp
DePiep
Was a bee
SarahKeating
Muhammad Elhossary
Ptolusque
Netha
Damian Szklarczyk
Kpjas
Thibdx
Juliansteinb
TiagoLubiana
SCIdude
Photocyte
Yusra Haider
  Notified participants of WikiProject Molecular biology

Hey all, I'm planning on adding/updating GO annotations to protein items using a new provenance pattern that preserves references to the original curator and the journal article the claim was sourced from. The pattern is as follows:

GO annotations will be referenced in a manner similar to how they are displayed in QuickGO (example. Format described in detail here and here). Data from the "With" column is not captured at this time. Each GO term statement should have qualifier stating the determination method (P459). A statement can have multiple determination methods and multiple references. The reference should include the following properties:

An example item is RNA-binding protein POP5 YAL033W (Q27553062) (right). Comments and suggestions welcome. Some notes:

Improper aliasesEdit

There're more than 300k items of individual proteins with "protein" as alias, like [1]. They make no sense. Maybe a bot can remove them.--GZWDer (talk) 14:18, 6 February 2017 (UTC)

See also this bot request. --SCIdude (talk) 14:44, 27 July 2019 (UTC)

Subclass of -> Instance of for Genes and ProteinsEdit

Its very useful for application building and querying to be able to know what an entity "is" without having to traverse class hierarchies. For each ontology term we maintain, we add an appropriate instance of relation. For example blindness (Q10874) is a subclass of retinal disease (Q550455) and instance of disease (Q12136). If there are no objections we (ProteinBoxBot) will move the "subclass of" gene (Q7187) or protein (Q8054) to "instance of" for gene and protein items. Proteins that are in protein families will be a subclass of that family (e.g. Succinyl-CoA:glutarate-CoA transferase (Q21124586)) Gstupp (talk) 20:23, 2 April 2017 (UTC)

Soliciting suggestions of new data sourcesEdit

Dear all, we on the Gene Wiki / ProteinBoxBot team are doing some planning and prioritization of future biomedical data sets to load, and we'd like to solicit suggestions from the broader Wikidata community. Historically, the scope of our bot loading effort has revolved around genes, proteins, drugs, diseases, and microbes. And more recently we've also helped related groups load data on genetic variants and pathways. We would welcome suggestions of either other related entity types that should be systematically loaded, or data sources that describe relationships between these entity types. Obviously, availability of a high-quality, CC0-licensed data source is essential. Please let us know if you have any suggestions. (Cross posting to WD:MB, WD:MED, and Wikidata:WikiProject_Chemistry.) Best, Andrew Su (talk) 20:03, 23 June 2017 (UTC)

Hi @Andrew Su:. How about "cytogenetic location" data? (e.g. ABO gene located at "9q34.2" [2]). When I was making Template:Genetics properties, I found that cytogenetic location data does not exist yet. As you know, all (or almost?) genes already have genomic start (P644)  , genomic end (P645)   (basepair location in specific GRCh version) Thank you for your effort for that! --Was a bee (talk) 10:00, 5 July 2017 (UTC)
Currently there is the property proposal (Wikidata:Property proposal/Cytogenetic location). No opinion comes yet (too much technical...?) --Was a bee (talk) 10:20, 5 July 2017 (UTC)
@Was a bee: I created an issue for it here: https://github.com/SuLab/GeneWikiCentral/issues/38. I think it should be relatively straightforward to add, but I did tag it as "low priority". If there are compelling use cases or queries that would benefit from adding this info, let us know and we can look at upping the priority. Thanks for the suggestion! Best, Andrew Su (talk) 16:21, 5 July 2017 (UTC)
@Andrew Su: Yesterday, I've tried adding new column into Infobox_gene (en:Module_talk:Infobox_gene#Gene_location_column_added). Although I don't know what do you think about that column addition, what I'm thinking now is that it would be useful for general readers if band information is accessible through that column. What do you think?--Was a bee (talk) 05:03, 19 August 2017 (UTC)
@was a bee: Bravo, I love it! Added my support for the property proposal... If you create/enhance the visualization to include cytogenetic location, we will load and maintain the data using our bot. Nice work! Best, Andrew Su (talk) 19:20, 21 August 2017 (UTC)
Hi @Andrew Su:. How about "Open Targets" data? (e.g. for the F12 gene the current version of the Open Targets Platform (which is free to use, no need for registration) shows the association of that gene with 192 diseases [3]). The association is based on different types of information (or evidence) such as genetics (somatic or germline), drug information, text mining, affected biochemical pathways, RNA differential expression and mouse models. The opposite is also true: one can start from the disease point of view and find which genes are associated with that disease (e.g. there are 3206 genes - or targets - associated with Alzheimer's)[4]. Wikidata could also link to a profile page of a gene (e.g F12 [5] or disease [6]. --Rejancar (talk) 10:00, 13 December 2017 (UTC)

classification of propertiesEdit

I created Wikidata property to identify proteins (Q42415644) and Wikidata property to identify proteins (Q42415644) to organize all properties that uniquely identify (see as part of Wikidata:Identifier). This does not include genomic start (P644) and genomic end (P645) because they only identify if used together. Could you please have a look at this list (unless it have become empty) and also classify these properties? If the properties do not identify individual genes or individual proteins, they must be put into Wikidata property for authority control (Q18614948) or another of its subclasses. -- JakobVoss (talk) 15:03, 30 October 2017 (UTC)

HaplogroupsEdit

Hi, this may not be directly within the scope of this project. However, this project may still be the best place for asking for help. I would like to convert Template:Infobox haplogroup (Q10562645) in fiwiki to use wikidata. In template level, I can do it but I need help with choosing the correct wikidata properties to save the data and if someone could store information from w:en:Haplogroup N (mtDNA) infobox to Haplogroup N (Q118710) wikidata item so it is in line with current practices of this wikiproject for an example then it would be great. --Zache (talk) 09:43, 3 December 2017 (UTC)

Ok, I made wikiproject page for haplogroups Wikidata:Haplogroups and I tried to populate the Haplogroup N (Q118710). So no I have some questions:

All other suggestions/comments are welcome too --Zache (talk) 10:13, 8 December 2017 (UTC)

Some times, people confuses between Y-DNA and mtDNA haplogroups. So how about different from (P1889)? --Was a bee (talk) 13:00, 8 December 2017 (UTC)
Added different from (P1889), thanks. --Zache (talk) 13:15, 8 December 2017 (UTC)

Help needed merging Gene Wiki pagesEdit

En:C1S has been merged into En:Complement component 1s. I now need to merge the corresponding Wiki data items (Q17854065 and Q5156403 respectively), but given there are corresponding articles in other languages, I am not sure how to go about merging the Wiki data items. Do the corresponding articles in other languages also need to be merged? Any pointers would be greatly appreciated. Cheers. Boghog (talk) 07:41, 20 December 2017 (UTC)

Boghog I merged the items. There weren't any conflicting articles in the same language (unlike English), so I just moved the corresponding wiki links over to one item and then merged it. Gstupp (talk) 20:42, 20 December 2017 (UTC)

Thanks for merging and for the pointers. Much appreciated. Boghog (talk) 20:47, 20 December 2017 (UTC)

Classification of the entities managed by thes project in WikidataEdit

Hi, the project ontology detected problems on the classification of gene, proteins and processes, mainly that those entities are both subclass and instances of the same item, which is ontologically a problem. There is also inconsistent usage of sources like the gene ontology with the actual statement created on Wikidata by ProteinBotBox. The issue has been raised on Project chat in a discussion about issues in our class tree amongst over and a discussion started there with the bot owners. They ask for a community consensus for a proposed solution, you’re welcome to comment there. If that’s too confusing maybe I can write a subpage for comment here, please ask. author  TomT0m / talk page 11:55, 8 February 2018 (UTC)

Describing a molecular biological processEdit

Hi,

I'm wondering how to describe a molecular biological process on Wikidata. I.e. :

  1. Substance T release
  2. Binds to substance T receptors on Microglias
  3. Microglia activation
  4. Release alphaTNF

It could be an item that store the process but it would also be interesting to have the list of receptors and produced molecules for cells. Before that I miss-use the properties, what would be the good practices ? I guess there may be examples or guides somewhere I missed ?

Regards

-- Thibdx (talk) 00:33, 23 December 2018 (UTC)

It's not exactly clear to me the type of statements you'd like to add. Can you give a few examples? Best, Andrew Su (talk) 05:51, 18 January 2019 (UTC)
I think Reactome and biocyc/metacyc are the resp. bio databases that have years of experience building ontologies for these kind of concepts. Please check first how they do it. --SCIdude (talk) 14:00, 26 July 2019 (UTC)

Possible merge requiredEdit

Is Cell cycle regulator of NHEJ (Q21438637) a duplicate of CYREN (Q18045925)? w:en:C7orf49 would like to be attached to a wikidata item, but it's not clear to me which of these two is its friend. --Tagishsimon (talk) 04:57, 18 January 2019 (UTC)

@Tagishsimon:: those two items should not be merged. One describes the human gene and the other describes the human protein, and they have reciprocal statements based on encodes (P688) and encoded by (P702). The convention that is primarily used is to link the WP page to the gene item (CYREN (Q18045925) in this case). Hope that helps! Best, Andrew Su (talk) 05:48, 18 January 2019 (UTC)
Excellent; thank you, Andrew. --Tagishsimon (talk) 06:08, 18 January 2019 (UTC)

Molecular Reaction?Edit

In the property(https://www.wikidata.org/wiki/Wikidata:WikiProject_Molecular_biology/Properties) subsection named Proposed Properties linking genes to genes you have a property that is named reaction. I think that is very interesting and here (https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry#Modelling_a_chemical_reaction?) they are also talking about that. Could you describe me please what the description is because I could not figure that out on https://string-db.org Juliansteinb (talk) 20:07, 10 April 2019 (UTC)

A possible Science/STEM User GroupEdit

There's a discussion about a possible User Group for STEM over at Meta:Talk:STEM_Wiki_User_Group. The idea would be to help coordinate, collaborate and network cross-subject, cross-wiki and cross-language to share experience and resources that may be valuable to the relevant wikiprojects. Current discussion includes preferred scope and structure. T.Shafee(evo&evo) (talk) 02:36, 26 May 2019 (UTC)

"determination method" property on GOA referencesEdit

Hello, GOA refs provide statements backed by determination methods (DM); the certainty of statements about genes/proteins can be deduced from the determination method (P459) values provided by one or several references. The problem is that the same property determination method (P459) is both on the statement and its reference which makes no sense semantically---consequently there are scope violations. I'd propose a new property applicable to references, e.g. "uses determination method" or "provides evidence" that has the same value as the determination method (P459) at the moment. Batch changing this would be no problem I think. Comments? --SCIdude (talk) 13:42, 29 July 2019 (UTC)

Yeah, I agree that it's a bit semantically imprecise as it's currently implemented. In my opinion, the two options are to create a new property as you suggest (advantage: semantically precise, resolves scope violations; disadvantage: proliferation of highly-related properties may cause confusion) or to expand the scope of the existing property (advantage: simpler data model that still resolves scope violations; disadvantages: formalizes a semantically imprecise solution). I'm not passionate either way. We can certainly modify the bot to match whatever consensus emerges here... Best, Andrew Su (talk) 16:29, 29 July 2019 (UTC)

prepro propertyEdit

There appears to have been no discussion/proposal about a property for precursor proteins, e.g. Insulin (Q39798) has precursor Insulin (Q7240673). The connected type would be protein precursor (Q258658). Possible labels:

  • "precursor"
  • "has precursor"
  • "cleaved from"

Possible qualifier: "catalyst" or "cleaving enzyme", another property?

Please comment on this, I'd like to make it my first property proposal. --SCIdude (talk) 05:53, 30 July 2019 (UTC)

I would support such a property. A number of precursor proteins serve as biomarkers for diseases eg- PIIINP in Vascular Ehlers Danlos and miscellaneous fibrosis diseases or Pro-SFTPB for idiopathic pulmonary disease. My guess is that you'll need to justify why 'has part' or 'part of' isn't sufficient, so having an example ready will likely help. -- Gtsulab (talk) 18:25, 31 July 2019 (UTC)
To stay with insulin, the end product is a hexamer of two parts of the prepro. Also there may be posttransl. modifications. --SCIdude (talk) 17:27, 1 August 2019 (UTC)

WD UniProt duplicates/fragments policyEdit

From https://www.uniprot.org/help/redundancy:

  • UniProtKB/TrEMBL: one record for 100% identical full-length sequences in one species;
  • UniProtKB/Swiss-Prot: one record per gene in one species;

UniProt protein ID (P352) has distinct and single value constraints which means with:

  • UniProtKB/TrEMBL: fragments have different IDs, no violation, but identical proteins from different bact. strains (same species) get the same ID, violate the constraints
  • UniProtKB/Swiss-Prot: fragments of the same gene product have the same ID as the prepro, violate the constraints

To adapt the constraints to be more lax one would need to check that the two proteins have a precursor relationship (=PASS), or that, if taxa are different, the common ancestor of the two taxa is species or below (=PASS). Is this possible? If not, the identical bact. proteins could be merged but not the different fragments of the same prepro. --SCIdude (talk) 17:12, 1 August 2019 (UTC)

In their 2019-Nov release UniProt has made changes that allow several statements to be made specifically on isoforms/fragments/cleavage products associated with a full UniProt "protein" entry sequence, see https://www.uniprot.org/news/2019/12/18/release.
Clearly, many UniProt entries must now be considered as containers of subentries with associated statements, creating the need for separate WD items, which can be recognized by being instances of isoform (Q609809), protein variant (Q77030030), or protein fragment (Q78782478). I expect Gene Ontology moving to annotation of subentries in the future, as well. --SCIdude (talk) 16:49, 4 February 2020 (UTC)

Problems with PDB and GOA from UniProt importsEdit

As mentioned UniProt/SwissProt entries contain all gene products from a gene, also cleavage products. That means PDB and GOA statements get lumped together at the UniProt entry, and so they are imported and dumped into the first item that carries the UniProt ID. So we had for example 3D structures and GOA statements about angiotensin-2 (the main peptide hormone) in the angiotensinogen item. This must be sorted out manually but worse is it will be re-dumped with the next bot run. This is a general heads-up for the issue. The cause for the GOA facet is at GOA because apparently it's not a rule to specify which peptide is referred to, just the UniProt ID. The PDB mixing happens at UniProt. If you're bot dev, I know you cannot handle this easily, so I suggest to STOP the import of these statements FOR UniProt entries that have curated fragments (there are 883 for human with keyword "Cleavage on pair of basic residues [KW-0165]" of what? 50k human entries, so the percentage is less than 2%). --SCIdude (talk) 09:35, 6 August 2019 (UTC)

@SCIdude: Thanks for digging into this issue and also suggesting a practical solution -- much appreciated! Please bear with us if it takes us a little bit to implement the proposed solution or if it takes a couple iterations to get it precisely right. I'm going to jot down some notes here just to be sure I have it right. You specifically mention Angiotensinogen (Q267200) and I see you've removed a bunch of PDB IDs that appear to correspond to the peptide hormone. (Should 5E2Q also be removed then?) The Uniprot ID is P01019, and in the "Names & Taxonomy" section I see that it is cleaved into 8 fragments. So in that case, you are proposing that the bot would not PDB IDs. Can you tell us how you did your query to get the 883 proteins you mention? Also, I see you removed some of the GO annotations. Anything we should we aware of there? (EDIT: of course, it's the same issue of annotations on the peptide being transferred to the parent protein. Got it...) Best, Andrew Su (talk) 16:01, 6 August 2019 (UTC)
Yes I added 5E2Q to angiotensin II (Q412999) but forgot to remove the original. The UniProt query was "human AND keyword:KW-0165" but I just see that "OS:9606 AND keyword:KW-0165" gives 308 hits (297 reviewed)---my query apparently added a bunch of "human" virus proteins---so even less affected entries. As to GOA, I copied ALL GO annotations from Angiotensinogen (Q267200) to angiotensins (Q65963433), removed those on Angiotensinogen (Q267200) that referred specifically to one of the peptides, and removed those on angiotensins (Q65963433) that referred specifically to Angiotensinogen (Q267200). I'm the author of most GOA annotations of M.tuberculosis so I'm feeling somewhat confident in my work. Nevertheless, I refrained from further detailing the angiotensins (Q65963433) annotations to the specific peptides. --SCIdude (talk) 16:39, 6 August 2019 (UTC)
(tangent)I have some of a peptide/protein-centric view, I mean GO annotations are practically about those entities. They are what has function, and what localizes somewhere. So I think the annotations should associate with those WD items that represent those entities. --SCIdude (talk) 16:46, 6 August 2019 (UTC)
I just see that the KW-0165 keyword is not on AGT so I have the numbers wrong. You can't trust UniProt to put that keyword everywhere needed. --SCIdude (talk) 17:24, 6 August 2019 (UTC)
@SCIdude: On second look at this issue, I'm actually not sure what the best solution here is. Originally I thought that GO annotators were annotating the peptide (eg angiotensin II (Q412999)) and that the gene-centric our bots (or the protein-centric nature of uniprot) was corrupting that somehow. But actually it looks like it's the annotators themselves that are associating the annotations with the parent protein (eg [7]). While I can see that the changes you've made are more correct biologically, the referencing then becomes wrong. Technically, the original bot edits are correct from a Wikidata standpoint because they accurately reflect a source's claim (like the statement that the earth is flat). The ideal long-term situation I think is to get the GO annotators to be more precise on exactly what they're annotating, which would obviate the need for our bot to make any judgement calls. Your thoughts (or others' thoughts)? Best, Andrew Su (talk) 05:07, 8 August 2019 (UTC)
GO annotators associate UniProt IDs with PMIDs. When I curated I didn't realize there were sub-IDs like P01019#PRO_0000032457 (I think UniProt prefers PRO_0000032457 as ID, from their help page). Certainly the annotation tool's input mask wouldn't have allowed that, but that was 8 years ago. Of course technically it's all correct, but then no database can be trusted 100%, so we need to edit the remaining 2%. And I have seen lots of plain wrong GO annotations in UniProt entries too, apart from all electronic annotations (IEA) and traceable author statements (TAS) that never get tested in the lab... --SCIdude (talk) 05:59, 8 August 2019 (UTC)
As to the referencing becoming wrong, why do you say that---the annotation still associates with the same UniProt ID, just a subobject of that ID. Can you please give an example? Regards --SCIdude (talk) 06:09, 8 August 2019 (UTC)
However, I think your bot is buggy when it places all annotations in one protein item when there are several that have the same UniProt ID. I'd suggest in that case use the gene. Maybe also the PDBs. What do you say? --SCIdude (talk) 16:04, 8 August 2019 (UTC)
Sorry, I wasn't as precise as I should have been regarding referencing. You added the UniProt identifier P01019 to the new item for angiotensins (Q65963433) (and added it to angiotensin II (Q412999)), which is the same as for Angiotensinogen (Q267200). I definitely see your rationale, but I worry a bit about putting that same identifier on those related-but-different concepts. In my mind, the uniprot identifier should _only_ apply to Angiotensinogen (Q267200). If we removed that identifier from angiotensins (Q65963433) (and added, for example, PRO_0000032457 in a new property), then the reference URL (P854) that points to the Uniprot entry doesn't make sense any more.
On the more general point of database accuracy, I absolutely agree that no DB can be trusted 100%. But under the idea that wikidata is a collection of claims and not a database of "truth", I believe that our bot should simply reflect what the GO annotators say as precisely as possible, right or wrong. For whatever that is wrong, I think the ideal scenario is to 1) notify the GO annotators so it can be fixed at the source, and 2) add a Wikidata statement but use a relevant scientific article in the reference. Your thoughts?
(and replying to your last comment, which came after I wrote the above) I think that zeroes in on the issue -- should there be several items that have the same UniProt ID? I think you say yes while my gut says no (but again can see the rationale). Other folks have opinions here? Best, Andrew Su (talk) 16:15, 8 August 2019 (UTC)


Specifying PTM TypeEdit

I'm interested in adding a qualifier on the type of post-translational modification (phosphorylation, N-linked glycosylation, sumoylation, etc.) when using Property:P1917. Does anyone have suggestions on the best property to use for such a qualifier? For example, the phosphorylation of alpha-synuclein at S129 is associated with Parkinson's disease. In this case, I would create a statement on the page for alpha synuclein (Q288591): Post-translational modification associated with(P1917), value: Parkinson's disease (Q11085). Qualifier: property (need suggestions for this), value: protein phosphorylation (Q7251493). Thanks -- Gtsulab (talk) 19:54, 6 August 2019 (UTC)

A PTM property on the main object could be modeled as "has part" with an amino acid position added. So why not add the disease as qualifier to the PTM? --SCIdude (talk) 04:28, 7 August 2019 (UTC)
Thanks for the suggestion. I think the property that I'm using posttranslational modification association with (P1917) is already intended to link the protein subject with a disease object, so it doesn't make sense to add the disease as a qualifier when it would already be the main value. I think I can use your suggestion for 'has part' in the qualifier for this statement to link 'protein phosphorylation'. I don't see a good way for including the amino acid residue modified, since adding it as a value for a property would require that it first be added as an entity, no? I don't fancy creating entities for all amino acid residues that are modified. Gtsulab (talk) 20:08, 7 August 2019 (UTC)
RefSeq has a way to specify "Site"s, e.g. https://www.ncbi.nlm.nih.gov/protein/NP_000930.1?from=87&to=87. You could use a qualifier on the 'has part'-'protein phosphorylation' with the property connects with (P2789) giving the link as value. That was the only property I found, is there a better? --SCIdude (talk) 16:25, 8 August 2019 (UTC)
Ah there is also applies to part (P518).. no, both need items, sigh --SCIdude (talk) 16:29, 8 August 2019 (UTC)
Thank you so much for your helpful suggestions, and thanks for proposing a property that would solve this issue. I'll wait for the community to decide on the best way to handle this before proceeding any further. Gtsulab (talk) 20:23, 13 August 2019 (UTC)

GOA "P" (process) annotations on genesEdit

Genes participate in biological processes so I was surprised when seeing the type constraint to proteins of biological process (P682). Maybe bots expect that annotations go into a WD object representing a gene together with its products? But if there exists a separate object for the gene then certainly biological process (P682) should be there as superset of all annotations of its peptide/protein products. So I'm adding "gene" to the type constraints. Does this make sense? --SCIdude (talk) 07:28, 7 August 2019 (UTC)

-I suspect this is because the biological process (P682) was derived from Gene Ontology biological processes, in which the "Annotations represent the normal functions of gene products" according to the principles of GO annotations. That said, I think it would makes sense to expand the constraints to genes if there are GO annotations for long non-coding RNA genes and MicroRNA genes (like NEAT1 (Q18054071) or MIR155 (Q17553105). Gtsulab (talk) 19:45, 7 August 2019 (UTC)

specifying aa position in a proteinEdit

(see also Wikidata:Property_proposal/Natural_science#amino_acid_position,_amino_acid_start_position,_amino_acid_end_position)

There is no way at the moment to define the position and extent of anything that is a part of a protein. Generally, this could be one amino acid (aa), an aa chain, or a set of non-overlapping aa chains. The reason for this generality is to be able to apply this to all of e.g. position of a mutation, span of protein domains, or position of cleaved peptides spanning one or more subchains. One important definition would be the start offset, which usually is 1 here, i.e. the number given to the first aa in a protein. The model would then have the use cases:

1. simple position property, e.g. "amino acid position" expecting an integer>0
2. single span
   a. "amino acid start position" expecting an integer>0
   b. "amino acid end position" expecting an integer>0
3. set of spans, i.e. multiple pairs of start/end
  • Examples
Phenylalanine hydroxylase (Q420604):
    has part-->protein phosphorylation (Q7251493)
        amino acid position--->16
    has part--->ACT domain (Q24745293)
        amino acid start position--->36
        amino acid end position--->114
    protein variant associated with--->phenylketonuria (Q194041)
        amino acid position--->39

Of course, when the site is not on protein parts then it's all simple.

  • Challenge example: defining a disease related mutation on a subchain. NOTE that the mutation position is usually given as position on the prepro (here 48 on preproinsulin) but the prepro is not part of the peptide, so it makes more sense to give it as the position on the sub-chain:
insulin:
    posttranslational modification association with (P1917)--->type 2 diabetes mellitus
        amino acid position--->24
        applies to part (P518)--->insulin B chain

The semantic problem in the above is that the position is not really associated with the B chain, both are with the statement about insulin

insulin A chain:
    has part--->disulfide bond
        amino acid start position--->7
        connects with (P2789)--->insulin B chain
        amino acid end position--->7
(need not be reciprocally defined?)

The semantic problem here is that the end position is not really associated with the B chain, both are with the statement about the A chain. I'm not sure if these are real problems, or if they can be resolved at all.

So with three new properties we could define the coordinates of anything that is an aa in a protein, or an aa chain as part of a protein.

Comments? --SCIdude (talk) 09:02, 13 August 2019 (UTC) P.S. "protein variant associated with" does not exist yet either...

It seems the proposal has stalled. I think the best alternative now would be a proposal that is very broad, like "position in sequence" to also get good support. --SCIdude (talk) 08:23, 23 August 2019 (UTC)

TCDB import done / UniProt coverageEdit

With the new TCDB property import six thousand proteins now have such an ID which is an exact correspondence, i.e. the TCDB entries mostly describe a single protein in UniProt. There are however 9,400 more in TCDB where the UniProt entry is not in Wikidata, e.g. P23586 from Arabidopsis. A quick search shows there is only two(!) protein items from A. thaliana with UniProt. Are there priorities for import, at all? It looks like imports are "up for grabs" since there is lots from prokaryotes---which is good. --SCIdude (talk) 14:38, 6 September 2019 (UTC)

WD enzymatic activities are GOEdit

Thumbs up to ProteinBoxBot.

Using the EC-->GO mapping: http://current.geneontology.org/ontology/external2go/ec2go (data version: 'releases/2019-07-01') I checked that all 5,162 enzyme-related GO function IDs have a WD item; each of them has the correct EC; 50 of them had several ECs; all checked, they are already in ec2go. This means the EC number statements of catalytic activity GO items are exactly as given in ec2go, i.e. they are canonical. --SCIdude (talk) 16:20, 11 September 2019 (UTC)

EC is sparsely mapped in WDEdit

Wikipedia articles, from which we get almost all enzyme family items, tend to favor terms that are widely used. Example: "oxidoreductase, acting on the CH-CH group of donors" is not one of them, although it is a main general enzyme family (EC:1.3). This naturally leads to a sparse mapping of the EC categorization (tree without leaf nodes):

├── 1 oxidoreductases (Q407479)
│   ├── 1 alcohol oxidoreductase (Q4713306)
│   └── 12 hydrogenase (Q424135)
├── 2 transferase (Q407355)
│   ├── 1 
│   │   ├── 1 methyltransferase (Q415875)
│   │   └── 4 amidinotransferase (Q68688747)
│   ├── 3 acyltransferases (Q2609152)
│   ├── 4 glycosyltransferases (Q67201373)
│   │   └── 1 hexosyltransferase (Q5749058)
│   ├── 6 
│   │   └── 1 transaminase (Q424288)
│   ├── 7 
│   │   └── 6 Diphosphotransferase (Q5279763)
│   └── 8 
│       ├── 2 sulfotransferase (Q175950)
│       └── 3 CoA-transferase (Q68689639)
├── 3 hydrolase (Q96286)
│   ├── 1 esterases (Q418750)
│   │   ├── 1 Carboxylesterase (Q409840)
│   │   ├── 3 phosphatase (Q422476)
│   │   ├── 4 phosphoric diester hydrolase (Q67202883)
│   │   └── 2 thioesterase (Q7784664)
│   ├── 2 glycosidase (Q13527914)
│   │   └── 1 Glycoside hydrolase superfamily (Q375795)
│   ├── 4 peptidase (Q212410)
│   │   ├── 22 cysteine protease (Q419343)
│   │   ├── 11 Aminopeptidase (Q419527)
│   │   ├── 21 serine endopeptidase (Q420032)
│   │   ├── 24 metalloendopeptidase (Q6822865)
│   │   ├── 17 Metalloexopeptidase (Q6822868)
│   │   └── 25 Threonine protease (Q7798075)
│   ├── 5 
│   │   ├── 1 Amidohydrolase (Q4746164)
│   │   └── 2 Amidohydrolase (Q4746164)
│   └── 6 
│       └── 4 helicase (Q138864)
├── 4 lyase (Q407727)
│   ├── 1 
│   │   └── 1 carboxy-lyases (Q417781)
│   └── 2 
│       └── 1 hydro-lyase (Q16915067)
├── 5 isomerase (Q118026)
│   └── 2 Cis-trans isomerase (Q5122112)
├── 6 ligase (Q410221)
└── 7 transport protein (Q2449730)

Fortunately the GO enzymatic activities are complete in WD, so we have two different hierarchies. Ninety percent of the 5,000 families have a link to a GO enzymatic activity through molecular function (P680). I'm considering linking back using has cause (P828) if the link is one-to-one (ie. exact). This would enable automatic creation of instance of (P31)/subclass of (P279) statements for single enzymes if they have such an enzymatic activity, finally placing single enzymes in the above hierarchy instead of just instance of (P31)/subclass of (P279)-->enzyme (Q8047) or protein (Q8054). I'm just not clear which of instance of (P31)/subclass of (P279) to use for this. --SCIdude (talk) 07:26, 29 September 2019 (UTC)

Correction of Wikidata descriptions of Wikipedia protein articlesEdit

Hi, about 95 % of the 2.4k protein articles on de.wp display the wd description 'gene', and i guess this is similar in en.wp. The description is displayed on Wikipedia articles in the desktop and mobile version, and in de.wp we sporadically receive complaints. There has been a discussion on this issue here: Wikidata_talk:WikiProject_Chemistry#Proteins. Can you help, with a bot run, to correct the wd item linking wp protein articles in different languages from gene to protein? Or change the description? All the best, --Ghilt (talk) 14:05, 6 November 2019 (UTC)

@Ghilt: Of course the description of gene items should not be changed. What has to be done is to move the item's sitelinks to a different (=protein) item, or to one item that covers everything that is covered in the WP article. This can be automatized (QS can move sitelinks) so I think we just need to decide if the target item should be the protein, or a newly created item that can potentially cover more than gene + protein. As you can see from the insulin (Q70598743) example and looking at enwp, dewp articles they are about gene, protein, preproteins, protein subchains, protein complex, protein family and superfamily, and pharmaceutical too. --SCIdude (talk) 14:34, 6 November 2019 (UTC)
Hmm, out of "gene, protein, preproteins, protein subchains, protein complex, protein family and superfamily, and pharmaceutical" only the gene is not a protein. But i can live with the gene+protein solution. --Ghilt (talk) 15:55, 6 November 2019 (UTC)
The pharmaceutical could be a mixture too, and still wikipedians will put it all in one article, you never know. What next? Actually there are 1093 such gene items on WD that encode proteins and link to dewp. In none of the cases there is a second sitelink on the protein. If there is no objection I'll move them to the proteins. PS: our estimates 1K/2.4K are quite different, please use this SPARQL query to see my list:
SELECT ?g ?gLabel ?p ?pLabel WHERE {

   ?g wdt:P31 wd:Q7187 .
   ?g wdt:P688 ?p .
   ?article schema:about ?g .
   ?article schema:isPartOf <https://de.wikipedia.org/>.

   SERVICE wikibase:label {
      bd:serviceParam wikibase:language "de"
   }
}
PPS: to be exact I would move all sitelinks of all languages together if they all have no second sitelink on the protein. Also I would inspect the list before starting QS, there are genes with multiple protein fragments encoded. --SCIdude (talk) 16:31, 6 November 2019 (UTC)
For some historical context on the mapping of Wikipedia pages to Wikidata, you can find the previous discussions here. Note there were more extensive discussion in Wikipedia on how the Wikipedia pages should be created and what should go in them too. Gtsulab (talk) 18:58, 6 November 2019 (UTC)

@Ghilt: There were actually only 828 candidates, the process is running now and is documented here. So <5% of gene items with sitelinks are affected if I'm counting correctly, enwiki has much greater coverage than dewiki. --SCIdude (talk) 16:08, 7 November 2019 (UTC)

What i forgot to mention: Insulin is somewhat of an exception, since its enwp article actually contains information on the gene, whereas by far most enwp protein articles don't. But hmm, where does the discrepancy 828 via SPARQL vs. 2.4K via PetScan come from? By the way, the difference in coverage between enwp and dewp is that most protein articles in enwp were made by ProteinBoxBot, which copied the text from ncbi. There is no german language protein database. But any change away from 'gene' towards 'protein' would be great. --Ghilt (talk) 18:23, 7 November 2019 (UTC)
Thanks for correcting! The discrepancy stems from the fact, that a majority of protein articles don't have a description on wd. --Ghilt (talk) 21:42, 7 November 2019 (UTC)

Manuscript: Wikidata as a FAIR knowledge graph for the life sciencesEdit

Andrew Su
Marc Robinson-Rechavi
Pierre Lindenbaum
Michael Kuhn
Boghog
Emw
Chandres
Dan Bolser
Pradyumna
Chinmay
Timo Willemsen
Salvatore Loguercio
Tobias1984
Daniel Mietchen
Optimale
Mcnabber091
Ben Moore
Alex Bateman
Klortho
Hypothalamus
Vojtěch Dostál
Gtsulab
Andra Waagmeester
Sebotic
Mvolz
Toniher
Elvira Mitraka
David Bikard
Dan Lawson
Francesco Sirocco
Konrad U. Förstner (talk)
Chris Mungall (talk)
Kristina Hettne
Hardwigg
i9606
Putmantime
Tinm
Karima Rafes
Finn Årup Nielsen
Jasper Koehorst
Till Sauerwein
Crowegian
Nothingserious
Okkn
AlexanderPico
Amos Bairoch
Gstupp
DePiep
Was a bee
SarahKeating
Muhammad Elhossary
Ptolusque
Netha
Damian Szklarczyk
Kpjas
Thibdx
Juliansteinb
TiagoLubiana
SCIdude
Photocyte
Yusra Haider
  Notified participants of WikiProject Molecular biology

Dear all: You may have seen that we recently published a preprint entitled "Wikidata as a FAIR knowledge graph for the life sciences". This manuscript was primarily spearheaded by the Gene Wiki team, which has been active in data modeling and data ingestion for a variety of biomedical resources.

Our goal was to write a manuscript that educated the general biological community about Wikidata and to drive more growth and participation. To do this, we selected and described a series of scientific vignettes -- identifier translation, integrative biomedical SPARQL queries, crowdsourced curation, Wikidata-backed application development, and phenotype-based disease diagnosis. Those vignettes were based on our own areas of interest as well as our guess at what would appeal to our target audience.

Of course, there are many possible vignettes that could fit under the broad title we chose. As a matter of practicality, we could not include them all while still creating a final product of reasonable length and focus.

However, upon further reflection and discussion with colleagues, we realized that while the selection of vignettes needed to be somewhat limited, the manuscript should reflect a more complete and inclusive representation of the people behind the larger movement, including those that worked on aspects that weren't directly highlighted as vignettes. Therefore, we'd like to invite anyone to add their name to the author list or acknowledgements by adding their name to Wikidata:WikiProject Molecular biology/FAIR_knowledge_graph. Note that due to journal policies, all authors must still meet the ICMJE standards, but interpreted according to the broadly-defined title of the manuscript. (That broader scope might also be summarized by the class-level diagram shown at right, which is included as Figure 1 in the manuscript.)

Finally, this message is being cross-posted to many places. We will monitor replies at Wikidata talk:WikiProject Molecular biology, or please {{Ping}} me to notify me of replies or discussion elsewhere. Best, Andrew Su (talk) 22:41, 18 December 2019 (UTC)

canonical databasesEdit

Please note the addition of Wikidata:WikiProject_Molecular_biology/Properties#Main_classes_and_their_canonical_database. --SCIdude (talk) 17:05, 18 January 2020 (UTC)

Wikidata at the Bioinformatics Community Conference 2020?Edit

What do people think about going to the community-guided conference day at the Bioinformatics Community Conference?
I've drafted a suggestion for some wikidata activities in the communal conference planning document.
Would others be interested in attending? Could be an ideal outreach opportunity to a community with closely aligned goals and interests. T.Shafee(evo&evo) (talk) 12:06, 11 May 2020 (UTC)

Return to the project page "WikiProject Molecular biology".