Wikidata talk:WikiProject Taxonomy/Archive/2015/10

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

OpenTree IDs

Hi - would it be useful to add a field for taxon IDs used by the recently published Open Tree of Life (http://tree.opentreeoflife.org/)? They have a downloadable list of IDs and corresponding NCBI / etc ids at http://files.opentreeoflife.org/ott/ HYanWong (talk) 11:43, 2 October 2015 (UTC)

I don't know. What is this site? I see no explanation, nor any detail? - Brya (talk) 16:25, 2 October 2015 (UTC)
There is a paper. And I found Proposal for OpenTree node stability. So maybe it's to early. --Succu (talk) 16:47, 2 October 2015 (UTC)
Yes. This does not tell me much: it is a composite Tree of Life, made up from many other trees. There seems to be an unending stream of these projects, it gets a little tiresome. Maybe, when it has stabilized, it will be possible to find out what exactly it is ... - Brya (talk) 16:56, 2 October 2015 (UTC)
AFAIK this is the main effort to construct a proper phylogenetic tree of life (declaration of interest: I've been involved a little on the side lines). It hit the news last week, but has been in production for a few years now. Brya: out of interest, which other projects are you referring to? The only one I know is the recent (2015) Hedges paper, but that only has 50,000 nodes. All the other projects I know of (GBIF et al) are taxonomies, not phylogenies - a few are listed at GlobalNames (http://resolver.globalnames.org/data_sources). As for node number stability, the numbers for known and labelled nodes (those that appear in the "Open Tree Taxonomy") are essentially stable, although there are (necessarily) unlabelled nodes with numbers that change depending on the current 'best guess' synthetic phylogeny. HYanWong (talk) 17:06, 2 October 2015 (UTC)
I should say that I'm quite willing to write a bot that will automatically add Open Tree IDs (the stable type, not the unstable ones) to wikidata taxon entries. But I wouldn't want to do it if it was generally considered a bad idea. HYanWong (talk) 17:33, 2 October 2015 (UTC)
Adding the IDs with my bot should not very problematic. But what will we gain from these Ids? Phylogenetic trees? I had a look at the Cactaceae subtree. Incomplete and outdated with silly references (compare A farewell to dated ideas and concepts: molecular phylogenetics and a revised suprageneric classification of the family Cactaceae (Q13521346)). And I'm feeling not very well with the selected taxonomic sources. IF ist good, but GBIF? --Succu (talk) 18:10, 2 October 2015 (UTC)
The thing is that the IDs remain stable as the tree is (continually) updated. So once the source you cite is made available in Dryad or Treebase, the taxonomic change should become (semi) automatically incorporated into the OpenTree, but either way, any OpenTree ID references in Wikidata will remain valid. I see what you mean about GBIF, but most taxa have multiple sources (NCBI etc), and the taxonomy culled from (e.g.) GBIF is only a stand-in until phylogenetic studies become entered. Re: Index Fungorum - I see Hillis is involved in the fungal OpenTree, so I'm assuming that is a good sign. HYanWong (talk) 19:40, 2 October 2015 (UTC)
Is NCBI a taxonomic source? Another constructed taxonomy is Vertebrate Taxonomy Ontology. Should we incorporate this one too? Integrates the Open Tree of Life standards like APG III system (Q156982) or proposals like A higher level classification of all living organisms (Q19858624)? --Succu (talk) 20:12, 2 October 2015 (UTC)
Working from Succu's Cactaceae example I see a little more. On a positive note I do see references to contemporary research papers; in itself it is worth importing this material (titles of papers, authors, links, etc). - Brya (talk) 05:16, 3 October 2015 (UTC)
So there is a distinction here between taxonomies and phylogenies. I see that there are a very large number of online projects providing taxonomies (aka 'classifications') of various sorts (as in your example of the Vertebrate Taxonomy Ontology). But I can't see many that provide attempts at a full evolutionary tree (aka phylogeny), which is why I think the OpenTree may be a useful additional link to provide. The simplest example of the distinction I'm talking about is seen in genera: a phylogeny should (ideally) show the branching order within genera, as in https://tree.opentreeoflife.org/opentree/opentree3.0@777553/Presbytis. In the Presbytis case there are 3 separate literature refs which are used to resolve this node below the species level. I presume this sort of resolution is not provided by most other sources. Of course, most genera are currently missing this sort of species-level resolution in the OpenTree (including the various genera of Cactaceae), so some sort of taxonomy is currently being used in the Open Tree to provide a framework. But as phylogenetic studies are added to open source databases, the idea is that these taxonomic constructs should be replaced with full phylogenies. I can't quite work out from what has been said above whether there is consensus to add OpenTree IDs to wikidata, though? HYanWong (talk) 09:50, 3 October 2015 (UTC)
p.s. the other thing to point out is that the open source nature of the Open Tree and the data requirements used to produce it (e.g. only using phylogenies whose data has been made available under CC-0) presumably fits well with the ethos of wikidata/pedia. But that probably applies to the majority of other ontogeny / taxonomy sources too, so it's not a very strong reason HYanWong (talk) 09:57, 3 October 2015 (UTC)
I see nothing like consensus, in any direction. I have become convinced that there is some useful content there, but remain unclear how much (beyond literature cited) and where. It would not hurt (extra) to add OpenTree IDs (it is not worse than GBIF and EoL), but how much would it add? I would prefer a selection. - Brya (talk) 14:20, 3 October 2015 (UTC)
Here you'll find a literature list. At the moment only 484 are used to build the phylo tree. APGIII is not used. --Succu (talk) 16:10, 3 October 2015 (UTC)
I think APGIII is a classification, not a phylogeny, isn't it? So it probably is incorporated somehow, even though it isn't in the list of phylogenetic studies. But it's not my project, so I'd need to ask to be sure HYanWong (talk) 19:18, 3 October 2015 (UTC)
So read yourself: An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG III (Q13683362) --Succu (talk) 19:28, 3 October 2015 (UTC)
Thanks. So yes, APGIII is (as I thought) a phylogenetic classification, not a true phylogeny. Hence it is not in the OpenTree literature list, although it (may) be used as a taxonomy. NB to clarify, I meant that OpenTree is not my personal project: I didn't mean APGIII, of course! HYanWong (talk) 09:04, 4 October 2015 (UTC)

@HYanWong: This taxon is in our taxonomy but not in our tree synthesis database. This can happen for a variety of reasons, but the most probable is that it is flagged as incertae sedis. This is Collema pustulatum subsp. heterosporum from IF. Why is this subspecies marked as incertae_sedis and no rank - terminal in your dataset (V2.8)? Family Triperidiaceae is an Nom. inval. but has OTT 5347948. --Succu (talk) 17:25, 3 October 2015 (UTC)

FWIW, here's the reply I got from the mailing list (https://groups.google.com/forum/#!topic/opentreeoflife):

The index fungorum dump that we used for making OTT 2.8 (the taxonomy used in the synthetic tree) is missing parent taxon information for many taxa; this is one of them. So it becomes an incertae sedis child of 'life'. Similarly it contains no rank information for this taxon. Here's all the information from IF that we had at the time:
438428 | | Collema pustulatum subsp. heterosporum | no rank - terminal | | | |
It appears we now have a parent pointer for this one (thanks to further work by Peter Midford); it's given as Collema, the genus. They new parent pointer is not in the latest taxonomy draft, but I'll get it into the next one. But even so, it will show up as major_rank_conflict_inherited, because our IF dump has an Incertae Sedis in the lineage, and so the taxon won't be available to synthesis.
Working with taxonomic sources has proven challenging. We have to do a lot of reverse engineering and data cleaning. In this case maybe we should be looking at the name string, noticing the 'subsp.', and assigning rank subspecies. Ranks don't get a whole lot of attention in this project, so this didn't get noticed. And perhaps we should notice that, being a subspecies, we should look for a species that it can be a subspecies of, even when the taxonomic source doesn't tell us that there is a connection. I can't say now whether that's a good idea.
Our IF dump is from early 2014. We're investigating how to get a more recent one.

Not sure that addresses any concerns here, but there we go HYanWong (talk) 09:01, 4 October 2015 (UTC)
Ah, the databasers disease of "if it has an entry in the database, it must be a taxon" ... This is why CoL is so horrible, and The Plant List has so many no-go areas. This is also why problems get bigger and bigger. I foresee the day when databases will contain more out-and-out fiction than actual information. - Brya (talk) 10:13, 4 October 2015 (UTC)
Yes, I agree this is a general problem. But in this case, at least these taxa are being left out of the OpenTree (they are in the list of taxa for which there is information, but are omitted from the synthesised tree). So I guess they are doing the right general thing, although it results in taxa like Collema pustulatum subsp. heterosporum being omitted for the time being. That doesn't mean there aren't other areas of concern, though, of course. HYanWong (talk) 15:08, 4 October 2015 (UTC)

Darwin Core dump

Out of interest, has anyone tried to make a dump in DwC format of all the WikiData nodes with a Taxon Name? Is it easy to download a list of all such nodes using the API? I'm not sure how many there are, and if basically every taxon with a wikipedia taxon box has an equivalent WikiData node. HYanWong (talk) 09:17, 4 October 2015 (UTC)

At the moment we have 1951479 items with taxon name (P225). 3137 items have no parent taxon (P171).43877 items are tagged as taxa but still missing taxon name (P225). You can use the Wikidata:SPARQL query service to create lists with more information. I think it's to early to create an DwCA. --Succu (talk) 09:42, 4 October 2015 (UTC)
Great, thanks a lot. HYanWong (talk) 15:09, 4 October 2015 (UTC)

A few items that need looking at

I've come across a few things that look like they need some attention, so it would help if someone with more experience with taxonomy stuff could have a look at them:

  1. Luzonobasis glauca (Q2346341) - There are two names used by the sitelinks. I'm not sure if the item should be split, or if it's just an alternate name (if so, surely it should be an alias).
  2. Macrognathus fasciatus (Q20798054) - The great big blob of text in the description says it's a species, but it has no statements at all.
  3. puto (Q3543297) - This is an item about a type steamed rice cake, but it looks like it's being used as a parent taxon too.

- Nikki (talk) 15:45, 8 October 2015 (UTC)

About puto (Q3543297), it is used by mistake as parent taxon. The correct is genus Puto (Q19686517). --Termininja (talk) 20:11, 8 October 2015 (UTC)
Did #2. --Succu (talk) 21:20, 8 October 2015 (UTC)
Thank you for reporting this.
  1. Number #1 is basically all right, just incomplete. It has proved that what works best if each name has its own item, while keeping the sitelinks together in one of the items. So in this case an extra item should be created and the two items connected by statements stating a relationship; these offer opportunity to put in references for that relationship (these relationships tend to be dynamic). Existing statements should be checked to see if they are in the right item (the IUCN statement should be with Amphicnemis glauca).
  2. Number #2 is indeed pathetically wrong. Succu put it right.
  3. Number #3 is an error (as explained by Termininja) created by a bot run. Perhaps Succu can explain this? - Brya (talk) 05:17, 9 October 2015 (UTC)
Maybe this is explanation, before puto (Q3543297) was instance of taxon. According to me it will be more ease to make it again taxon. --Termininja (talk) 06:27, 9 October 2015 (UTC)
User:Wylve splitted the item, but did not checked the backlinks. --Succu (talk) 09:23, 9 October 2015 (UTC)

Odd alternative names

Just looking at some classic examples used in Wikidata:Wikinews, in particular Q36611. Oddly, the 'In more languages' tab lists a large number of 'Also known as' under the 'Bengali' language, which is also seen in the API

       "bn": [
           {
               "language": "bn",
               "value": "Gorillinae"
           },
           {
               "language": "bn",
               "value": "Silverbacks"
           },
           {
               "language": "bn",
               "value": "Black-back"
           },... etc etc.

). Is this normal? HYanWong (talk) 20:48, 9 October 2015 (UTC)

@HYanWong: I agree that it looks odd, but those words are redirects in Bengalese Wikipedia, as you can see in https://bn.wikipedia.org/w/index.php?title=%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%87%E0%A6%B7:WhatLinksHere/%E0%A6%97%E0%A6%B0%E0%A6%BF%E0%A6%B2%E0%A6%BE&hidelinks=1 . I think that it would be a good idea to ask in the Bengalese Wikipedia Embassy page if such words are useful redirects there and if they can be considered synonyms of the Bengalese name of gorillas in any Bengalese contexts.--Pere prlpz (talk) 21:21, 9 October 2015 (UTC)

Another item to look at

This one looks like a taxon mistake too, doesn't it: https://www.wikidata.org/wiki/Q19862317. HYanWong (talk) 11:35, 9 October 2015 (UTC)

My mistake - didn't spot the definition in German, sorry HYanWong (talk) 23:21, 12 October 2015 (UTC)

Global identifiers

Not wanting to trigger another debate about what IDs from other sources to include / exclude, but thought I should point out that the 2nd prize in the recent GBiF contest is to a project that aims to unify biological identifiers from different databases. That might be an alternative to storing all those GBIF / NBCI / etc IDs on wikibase. I presume it isn't ready for use here yet, but maybe worth keeping an eye on. HYanWong (talk) 08:52, 14 October 2015 (UTC)

I have been aware of the project for a long time, but it has always seemed to be tied up in discussions about formats and connecting things. Not sure if it will ever reach the point where they start to weed out the errors. But I agree that it is very discouraging to have EOL and GBIF entries which only tell that there is CoL entry (which is all too likely to be completely fictitious). But something to keep an eye on, yes. - Brya (talk) 11:02, 14 October 2015 (UTC)
Return to the project page "WikiProject Taxonomy/Archive/2015/10".