Wikidata:Requests for permissions/Bot/ProteinBoxBot
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 15:20, 28 July 2013 (UTC)[reply]
ProteinBoxBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator:Chinmay26 (talk) 03:14, 19 July 2013 (UTC) Chinmay26 (talk • contribs • logs)[reply]
Task/s: Populate Gene wikidata items with gene properties. Runs are here This bot is an update over existing Proteinboxbot
Function details: There are around 10,000+ Human protein templates which are maintained by pygenewiki(current bot). The Bot will create Gene Wikidata items and and populate them with gene/protein specific properties. Through molecular biology community discussions , now each wikipedia page will be sourced from four seperate wikidata items -- Human protein,Human Gene, Mouse Gene, Mouse protein . The entire design of the new items and their properties is here. I am using Pywikipedia-rewrite branch and the project is hosted on bitbucket --Chinmay26 (talk) 03:14, 19 July 2013 (UTC)[reply]
Two things:
- Does it make sense to use P89 (P89) rather than taxon ? I suppose the taxon will usually be a species, but that does not sound like a logical necessity, and the property does not serve the same purpose as in Brassica oleracea var. botrytis (Q7537). If you agree with that, and are in a hurry suppose we can create a new property without going through the whole proposal process.
- A much broader (and hackneyed sorry) issue but it appears clearly here: if human reelin is a subclass of protein, and enzyme is a subclass of protein, is there a way to know that "human reelin" is protein in a different sense than "enzyme" is. --Zolo (talk) 18:43, 19 July 2013 (UTC)[reply]
- Thanks for the feedback! Your points in order:
- Can you clarify why you think P89 (P89) is serving a different purpose in Brassica oleracea var. botrytis (Q7537) and RELN (Q414043)? As near as I can tell, in both cases P89 (P89) links to items that in turn have properties for taxon name (P225), P74 (P74), taxon rank (P105), etc.
- Great point. On second look, I think Q13569356 should be an instance of (P31) of protein (Q8054), not a subclass of (P279). Would that be more correct?
- Thanks! Cheers, Andrew Su (talk) 20:04, 19 July 2013 (UTC)[reply]
- The Wikidata items this bot is concerned with represent classes of things, not a particular, concrete object, so subclass of (P279) and not instance of (P31) is the correct membership property to use here. The scenario here is analogous to the quark example in Help:Basic_membership_properties -- Wikidata items about genes and proteins speak about their subjects as a class of things (in which case P279 is appropriate), not, say, a particular molecule of reelin in a specific mouse's hippocampus (in which case P31 would be appropriate). Classifying genes and proteins with P279 is also consistent with the convention used in OBO ontologies, which use rdfs:subClassOf (the basis of P279) to represent genes and proteins. (See e.g. http://owl.cs.manchester.ac.uk/goal/ontologies/MouseGOAL.tar.gz, discussed in http://www.ncbi.nlm.nih.gov/pubmed/22541594.)
- To Zolo's specific concern, I would note that enzyme (Q8047) is not a subclass of protein, because not all enzymes are proteins. Enzymes can be RNA molecules, too.
- But that just moves the goalpost: it's still worthwhile to consider how we can indicate that a specific kind of protein is, say, a biomolecule in a different sense than an enzyme is a biomolecule. This issue is a matter of semantic distinction independent of whether we call these things instances or classes. We can distinguish enzymes from proteins by keeping them in separate branches of the term hierarchy. For example, proteins and enzymes can both be considered biomolecules (or whatever we decide), but proteins biomolecules that are defined by being sequences of amino acids, whereas enzymes are biomolecules that are defined strictly by having a catalytic activity. This distinction is discussed in more detail at Property_talk:P591#Distinguishing_enzymes_and_gene_products.
- WT:MBTF would probably be a better venue for this discussion, if folks feel like exploring it in depth. I don't think it should hold up the nomination of ProteinBoxBot. Emw (talk) 23:15, 19 July 2013 (UTC)[reply]
- I took enzyme because it is the first thing I could think of, but that would probably apply to things like DNA-binding protein (Q2252764) as well, but I agree that reelin seems more like a subclass than an instance of protein and that if the bot task is urgent it can be approved before the issue is elucidated. --Zolo (talk) 06:28, 20 July 2013 (UTC)[reply]
- Keeping the "species" claims as such is probably fine. I was concerned that a new "taxon" (with values that are items) would be better if anyone ever wants to work with ProteinBoxBot to classify microbial genes and proteins, but it seems that the infoboxes for viral proteins like env point to resources like http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11676&rn=1, which, following the link for the organism to http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=11676&lvl=3&lin=f&keep=1&srchmode=1&unlock, assigns the microbial organism a taxonomic rank of "species". Entrez Gene also assigns records to a species taxon, e.g. http://www.ncbi.nlm.nih.gov/gene/155971.
- So in summary, keeping the "species" claims for genes and proteins in place is in line with reliable sources. Emw (talk) 23:44, 19 July 2013 (UTC)[reply]
- In Brassica oleracea var. botrytis (Q7537), "species" is essentially a subproperty of subclass of (P279): cauliflower is a type of Brassica oleracea. In Reelin, "species" means something different, like "is to be found in humans". Now, you may argue that the P89 (P89) of cauliflower is badly redundant with parent taxon (P171) and I would agree, but that is a different issue: currently the main use of the property is the one that is made in Brassica oleracea var. botrytis (Q7537), not in reelin. --Zolo (talk) 06:28, 20 July 2013 (UTC)[reply]
- I see. Thanks for clarifying -- I agree this is a problem that needs to be solved. In brief, claims involving P89 (P89) in items about taxa treat "species" as a subproperty of subclass of (P279) as you say, but claims involving P89 (P89) in items about genes and proteins treat "species" as a subproperty of part of (P361). This seems like it would cause problems with semantic reasoners that work with items about taxa and items about genes/proteins, which is entirely foreseeable.
- Per your comment, I think a new property "found in taxon" might solve this problem. The property would explicitly note that it is a (distant) subproperty of P361 and should not take on the semantics of P279. Thoughts? Emw (talk) 12:31, 20 July 2013 (UTC)[reply]
- Whew, glad we have a couple of data modeling heavyweights here to sort this out. I'm out of my league here, so suffice it to say that we'll follow whatever consensus emerges here. (And to echo Emw's previous comment, perhaps it's worth moving the discussion to WT:MBTF.) Cheers, Andrew Su 00:08, 21 July 2013 (UTC)[reply]
- In Brassica oleracea var. botrytis (Q7537), "species" is essentially a subproperty of subclass of (P279): cauliflower is a type of Brassica oleracea. In Reelin, "species" means something different, like "is to be found in humans". Now, you may argue that the P89 (P89) of cauliflower is badly redundant with parent taxon (P171) and I would agree, but that is a different issue: currently the main use of the property is the one that is made in Brassica oleracea var. botrytis (Q7537), not in reelin. --Zolo (talk) 06:28, 20 July 2013 (UTC)[reply]
- Thanks for the feedback! Your points in order:
Comments from Emw: I've looked through ProteinBoxBot's test edits and compared them with the data model worked out at WT:MBTF. Below are some things I think would help to do before scaling up this bot's activity:
- For a few proteins, have ProteinBoxBot (PBB) create the additional 3 items expected per Gene Wiki article -- human gene, mouse gene, mouse protein -- and fill them in with the expected properties. The test edits only show how the bot works for human proteins.
- On protein items, have PBB link to EC enzyme classification (P660) instead of EC enzyme number (P591). Per Property_talk:P591#Distinguishing_enzymes_and_gene_products, "EC number" (P591) should be used on enzyme items, and "EC classification" (P660) should be used on protein items. This will require PBB to have some way to 1) link the "EC number" in the Wikipedia infobox to the enzyme's "EC accepted name" (e.g. EC number "3.4.21" -> EC accepted name "serine endopeptidase"), then 2) get the Wikidata ID of the enzyme item for the EC accepted name from step 1 (e.g. EC accepted name "serine endopeptidase" -> Wikidata ID Q420032).
- For step 1, you should be able to map EC number to EC accepted name by using the lists in the table at http://www.chem.qmul.ac.uk/iubmb/enzyme/#recommend. For step 2, you could look up the ID of the item with that EC accepted name using the Wikidata API's wbsearchentities module, e.g. http://www.wikidata.org/w/api.php?action=wbsearchentities&search=serine%20endopeptidase&language=en&format=jsonfm. If no Wikidata item has the EC accepted name as a label or alias, then that enzyme item would need to be created.
- Have PBB make labels and descriptions of gene and protein items match the pattern described at WT:MBTF#Gene_and_protein_labels_and_descriptions. I've modified the reelin items as examples: human protein reelin (Q13569356), human gene RELN (Q414043), mouse gene Reln (Q13567973), mouse protein Reelin (Q14155909). For human proteins (the only gene/protein items with site links) this would mean adding/changing their description, since their label already matches the convention.
Also, to my understanding it's convention to make test edits using the bot account, not the bot operator account. Let me know if I can clarify anything! Emw (talk) 13:35, 21 July 2013 (UTC)[reply]
- Thanks Zolo and Emw for the feedback.I am currently working on creating new human gene,mouse gene, mouse protein items for the first few "human protein templates". I will be following the model of "Reelin item" and add the necessary description field, EC classification etc as well. Regarding the "subclass of" property, should we go ahead with the current model?. Chinmay26 (talk) 16:40, 21 July 2013 (UTC)[reply]
- Hi Chinmay, from my perspective, using the current model (all genes noted as subclass of 'gene', all proteins subclass of 'protein') makes sense. Emw (talk) 12:07, 22 July 2013 (UTC)[reply]
- I have created found in taxon (P703). I think it is ok to use subclass. If we need to adjust things later on, it should still be differentiate proteins from group of proteins based on the properties used in the items. --Zolo (talk) 20:17, 22 July 2013 (UTC)[reply]
- Hi Chinmay, from my perspective, using the current model (all genes noted as subclass of 'gene', all proteins subclass of 'protein') makes sense. Emw (talk) 12:07, 22 July 2013 (UTC)[reply]
- What is the actual situation with the bot? Have the objections been answered? Are we moving to approval--Ymblanter (talk) 15:25, 26 July 2013 (UTC)[reply]
- Currently, the bot runs are for Human Protein items only. As Emw and Zolo suggested, i will extend the bot functionality and include mouse protein, mouse gene, human gene items. I am working on it(will be over in 2 more days) and i will show test runs here under the bot account. Chinmay26 (talk) 12:25, 27 July 2013 (UTC)[reply]
- Ok, the bot will be approved in 24h provided there have been no objections raised.--Ymblanter (talk) 13:56, 27 July 2013 (UTC)[reply]
- Currently, the bot runs are for Human Protein items only. As Emw and Zolo suggested, i will extend the bot functionality and include mouse protein, mouse gene, human gene items. I am working on it(will be over in 2 more days) and i will show test runs here under the bot account. Chinmay26 (talk) 12:25, 27 July 2013 (UTC)[reply]