Wikidata:Requests for comment/Inheritance of taxon ranks

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

An editor has requested the community to provide input on "Inheritance of taxon ranks" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.

If you have an opinion regarding this issue, feel free to comment below. Thank you!

THIS RFC IS CLOSED. Please do NOT vote nor add comments.

Question 1: Avoid adding taxon rank statement which could be inherited by superclasses

IMHO we should avoid adding (manually or by bot) biological statements that could be "inherited" by superclasses: e.g. adding "phylum" to a species that already has "genus".

Support

Support for the fundamental properties as defined by Cactus in the comments section. Redundancy in data only leads to a lot of ambiguity. I think, it is only a bug (or a not yet implemented feature) that properties of linked items can not be gathered. The best way to cope with this is to inform the developers about the importance of the feature and hope for a quick fix. Cluttering the database with redundant information is less helpful (especially when you think about adding references to them). FelixReimann (talk) 10:39, 21 May 2013 (UTC)[reply]
I support this option for the same reasons that I support "subclass of" and "instance of". (I will be responding the comments below. --Izno (talk) 15:24, 26 May 2013 (UTC)[reply]
On that note, actually, you could use subclass of to entirely describe the tree come to think of it, making parent taxon entirely a duplicate. It's only for the reasons that were mentioned at PfD (too lazy to link it) that one might confuse it with the actual taxon rank subclass. --Izno (talk) 17:19, 26 May 2013 (UTC)[reply]
Redundancy should be avoided. --Succu (talk) 17:47, 26 May 2013 (UTC)[reply]
Support because data redundancy is a bad thing except in very special cases. To anyone who opposes, please read the article I linked and what it links to. We're not making stuff up here, this is a researched topic in computer science, and we'd be foolish to reject its results. Silver hr (talk) 20:29, 26 May 2013 (UTC)[reply]
Comment – It is a weak argument to link to some general article without explaining why it is a problem exactly here. Did you actually read the Wikipedia article and its references? Reference number 2 (Peter Rob; Carlos Coronel (2009). Database systems: design, implementation, and management. Cengage Learning. p. 88. ISBN 978-1-4239-0201-0) says "However [...] keep in mind that controlled redundacies are often designed as part of the system to ensure transaction speed and/or information requirements. Exclusive reliance on relational algebra to produce required information may lead to elegant designs that fail the test of practicality." So I am sure not that your references actually support your point of view. Byrial (talk) 21:31, 26 May 2013 (UTC)[reply]
It is similarly a weak argument to do exactly as Silver hr supposedly just did. :) In this case, why would a controlled redundancy (and I doubt we can really call this controlled; it's a wiki!) help us to satisfy those requirements? I have seen no indication that these will be problems in this instance. And it's only one of the many references; and in fact it is used to support the notion that generally, one should avoid redundancy. --Izno (talk) 14:36, 27 May 2013 (UTC)[reply]

I didn't link to a general article. I linked to the article on data redundancy, which is precisely the issue here. I did read the article. I didn't read its references, however I did read other books on the topic of data modelling which also cover data redundancy. Your quote falls under my "very special cases" statement--database denormalization and other forms of controlled redundancy are done for reasons of performance only in certain systems and where needed. Note the word "controlled"--if the data is kept in a redundant state, access to it is controlled such that its integrity remains constant. Wikidata is not a controlled environment, nor is there a proven performance problem that redundancy described here would solve. And finally, I must ask you if you read the Wikipedia article I linked--it says right there in simple-enough terms that "data redundancy leads to data anomalies and corruption", which very much supports my point of view. Silver hr (talk) 23:38, 30 May 2013 (UTC)[reply]
Support, per Silver hr. Emw (talk) 17:13, 8 June 2013 (UTC)[reply]
Support --Tobias1984 (talk) 09:46, 11 June 2013 (UTC)[reply]

Oppose

I copy what I said in the WD:PC "I disagree because if you want to see it that way, nobody should add P107 (main type) to items which have already something that shows what is the main type (e.g. birth date or birth place). figuring out properties of an item by some other properties of that item is not our job, and besides Wikidata is not for Wikipedia exclusively, If google wants to know what is the phylum of a species, we shouldn't make them to write a very complicated code to review and load so many other pages (and make pressure on server) every time to understand what is the phylum " Amir (talk) 11:31, 19 May 2013 (UTC)[reply]
Re: "pressure on server" are there availible backup-files with no pressure at all in the wmf-servers. Regarding the complicated code, are the larger risks that the bots are missing items who have to be updated. To make an example: The timezone is now added to many items related to Sweden. It would have been enough if the timezone was added to Sweden (Q34) instead, since the timezone that now have been added in every item related to Sweden is wrong, and I do not know how we are going to find them, to remove or correct them. (And, yes, Wikipedia has been used as source.) -- Lavallen (block) 12:58, 19 May 2013 (UTC)[reply]

Re: Google has the Knowledge Graph, they can well write very very complicated code! --Ricordi samoa 14:52, 19 May 2013 (UTC)[reply]
@Amir What has Google to do with our data models? If our models work, I'm sure that the Google's devs are able to work with them... — Felix Reimann (talk) 15:22, 17 September 2013 (UTC)[reply]
Currently it isn't possible to use (in Wikipedia) properties that aren't "flat" linked to an item even with Lua API. he:יחידה:מיון (taxobox in hewiki; see also en:Module:Taxobox) use this direct "flat" links: So adding kindgom (for example) to species entry is needed by the template to get the correct kingdom. However, if it would be possible in a later phase of the project to use non-direct linked properties (parent of parent of parent...) I'll remove my objection ;) Eran (talk) 19:48, 20 May 2013 (UTC)[reply]
That seems to me to be the case because the module was written so as to most closely emulate the current taxobox (which requires each field to be filled in, precisely because wikidata hadn't existed). I suspect it is not the case that the module or template cannot or won't be able to get the inherited item data. --Izno (talk) 15:37, 26 May 2013 (UTC)[reply]
@Eran: Correct until this bug is not fixed. However, the Module:Taxobox which is a reference implementation of the data model proposed by the WD:Taxonomy task force shows that it works. Thus, we should push the devs to fix this bug instead of relying on an extremely redundant and unmaintainable data model. — Felix Reimann (talk) 15:22, 17 September 2013 (UTC)[reply]
Bad idea, due to three reasons.
1. Firstly, in the case when the taxonomy needs to be updated we are talking about updating them locally with bots, versus updating modules on several wikis. The latter choice is worse, as updating scripts has always been lacking, especially in smaller wikis, no matter whether they are in javascript in the form of gadgets or in lua in the form of modules. At least this is an smaller issue when it comes to gadgets, as the solution of fetching an gadget from another wiki via API has been used (with importScriptURI for example). No similar mechanism exists for lua modules, althrough there are plans to make an central repository for them, which are at least 2 years from becoming an reality.
2. Secondly, I feel like this is an move that solves one problem by creating another, and thereby doesn't really solve anything. Also, the problem that is created by doing this would need to be solved by the wikipedians, not wikidatians.
3. Thirdly, I am going to be a bit bold and say that this proposal is against one of the fundemental points behind wikidata. Wikidata has always been about centralization. This, clearly, is decentralization.--Snaevar (talk) 21:02, 21 May 2013 (UTC)[reply]
It seems to me that the first is entirely irrelevant to this question (and as you note, common Scribunto modules/templates are coming).

The second doesn't feel to me like a problem. From my point of view, it makes lots of sense to put less information directly into each item precisely because we can eliminate the problem entirely. A module, at a common location or otherwise, that queries for the values of the inherited properties seems much more efficient than running a bot to change every property every time biologists change the tree of life. If a species gets moved, then it gets moved, and we only need to update here one or two properties on each species article. If we maintain a full hierarchy either here or there, then it is worse because we have to update properties on hundreds of pages either way. Someone feels that crazy maintenance burden, and it's easier if it's here and each item does not touch the entire hierarchy.

I disagree. (And I now suspect there are some language issues here?) If I put in "parent taxon" on an item, I've actually centralized the information about that taxon to the actual taxon, rather than each of the items. If anything, this actually centralizes the data on the items where each property is most relevant. --Izno (talk) 15:37, 26 May 2013 (UTC)[reply]
I also think that the problem discussed in this RfC is a different one. Wikidata will be the central database for all taxoboxes (if we do it right). The question discussed here is which data model to use. — Felix Reimann (talk) 15:22, 17 September 2013 (UTC)[reply]
Oppose All data about a specific subject should be directly avaliable from its item. That is a simple and usefull model. If one has to traverse several items in a tree structure to find the class of a specie, then the model is no longer usable for sisterprojects like Wikipedia. /Esquilo (talk) 13:40, 27 June 2013 (UTC)[reply]
I disagree that a central model is useful. If you would have a look about how many taxa we speak of and how often their hierarchy is changed, a taxon central model will lead to a lot of inconsistency. — Felix Reimann (talk) 15:22, 17 September 2013 (UTC)[reply]

Comments

Inheritance has some great advantages. Lets say that scientists suddenly decide that Ladybirds is a Fungi. Then you do not have to change ever item related to the Ladybird. -- Lavallen (block) 09:05, 19 May 2013 (UTC)[reply]

Also, with only few information on each taxon level, we could build a full taxobox following "up-links". --Ricordi samoa 09:15, 19 May 2013 (UTC)[reply]

For instance, botanists change the classification of flowering plants from APG system (1998) to APG II system (2003) and then to APG III system (2009) in just 11 years. Besides, the Cronquist system (1981) of flowering plants classification is still widely used. If we want to add all these classifications, I think it would be too redundant to add them to every single species, since it means hundreds of thousands of plant species. --Stevenliuyi (talk) 11:44, 19 May 2013 (UTC)[reply]

Ideally, all that is needed is the parent taxon and the taxon rank properties. Everything else can be deduced from them. Therefore, everything else is redundant, IMO. Everything that is redundant and can be deduced from other properties should be generated at "runtime" instead of being stored in the item, whenever it's technically possible, IMO. - Soulkeeper (talk) 13:00, 21 May 2013 (UTC)[reply]

That's not that simple. Example: You can treat the genus Nolina as belonging to the family Nolinaceae as many authors prefer to do and not as part of the family Asparagaceae according to APGIII. Or you can treat it as tribus Nolineae within the subfamily Nolinoideae. But not all authors accept a tribus Nolineae. Taxon rank for Nolina is allways genus. What's the parent taxon? --Succu (talk) 14:58, 21 May 2013 (UTC)[reply]

Two different taxa, according to two different sources, it seems. Isn't that the sort of thing we're meant to sort out using qualifiers? - Soulkeeper (talk) 19:09, 21 May 2013 (UTC)[reply]

Nothing differs. Taxon rank (genus) is steady. --Succu (talk) 21:38, 21 May 2013 (UTC)[reply]

I've some experience managing redundancy concerning systematics (Taxobot, see my statement here, "In dewiki there's some experience concerning this task. "). An outline for a possible solution with Wikidata:

there must be a clear separation between fundamental an derived (redundant) properties
the only fundamental properties concerning systematics are
- parent_taxon
- rank
- scientific_name
- flag if this is an (less important) intermediate taxon (then it is skipped in the upper part of taxobox)
derived properties may be
- higher_taxon[n] (to bei used in Taxoboxes, n=2-Levels shown in taxobox)
- phylum, order, ...
a bot updates derived properties periodically
Comparing with taxobot there's still one big disadvantage. There's no page showing comprehensive systematics (like this for mammals). because of this a bot should also generate a comparable summary for verification purposes. Also errors must be detected (cyclic parents, invalid sequence of ranks, e.g. family as parent of order

If theres support I could break down this outline.--Cactus26 (talk) 16:42, 19 May 2013 (UTC)[reply]

Support. Do you think that the flag for intermediate taxa is important? Perhaps, the taxobox of each wikipedia should decide, which taxon ranks to show (like, en-wp displays the parvorder, de-wp skips it showing directly the next higher suborder). The hierarchy could be displayed by adapting Magnus geneawiki. But of course, additional verification is needed. FelixReimann (talk) 10:51, 21 May 2013 (UTC)[reply]

I don't think a flag for intermediate taxa is needed. The taxobox can get the two or three closest parent taxa, regardless of rank, after that it could get only the taxa that have a certain subset of ranks. - Soulkeeper (talk) 13:02, 21 May 2013 (UTC)[reply]

The less-import-intermedia-flag is not essential. It is an optimization an can be added later if it is needed. It was also introduced subsequently for Taxobot when beetles where adapted cause there are many intermedia taxa (suborder, infraoder, superfamily) and it should be avoided to consume Taxobox rows with such hardly known taxa. This flag should not be sticky to rank, e.g. suborder en:Passeri is important, Galbulae (see en:Piciformes) is not. Btw: Galbulae is an good example why this flag will hardly be dispensable with Wikidata: If just one language has an article for this taxon it must be connected in the parent chain. And there is one: ca:Galbulae.--Cactus26 (talk) 17:24, 21 May 2013 (UTC)[reply]

With the Module:Taxobox we have an example which proves that the datamodel proposed by the WD:Taxonomy task force works. All taxa which have real references (i.e. not being just copied from Wikipedia taxoboxes) use this data model. However, due to this bug it cannot be used in Wikipedias yet. I propose to deprecate all the properties P74 (P74), P75 (P75), ... and concentrate now on the working data model. If you vote for the bug, it's importance is increased. PS: Also the data model defined by Help:Sources heavily relies on the possibility to access other items. — Felix Reimann (talk) 15:30, 17 September 2013 (UTC)[reply]