# Wikidata talk:WikiProject Chemistry/Archive/2018

Active discussions

## HMDB identifiers

Hi all, 10 days ago the Human Metabolome Database (Q5937262) seems to have updated a new identifier scheme, one with more zeros, I guess anticipating a major data load. The regular expression of the Human Metabolome Database ID (P2057) is still fine, but they decided to give all compounds new identifiers, with extra zero's. For example, D-fructose was HMDB00660 and is now HMDB0000660, with the original identifier now a secondary identifier. How should we update the Wikidata entries? Keep the original identifier as well as the new identifier, maybe even with a new property (personally, I rather not)? Use ranks to indicate what is the primary identifier (I'm aware of an ongoing related discussion about secondary identifiers). Only the new identifier? That will break compatibility with a lot of database. --Egon Willighagen (talk) 13:35, 26 August 2017 (UTC)

@Egon Willighagen: Correct all existing values of Human Metabolome Database ID (P2057) and perhaps it is good to think if the current display of that kind of identifiers is the best: if we use only the relevant part of the identifier i.e. the number without the zeros (HMDB0000660 -> 660), we would not have this problem. WE can always add the letters and the zeros when we create the links. Snipre (talk) 22:44, 31 August 2017 (UTC)
@Snipre: I'm personally not fond of these shortened identifiers, as *all* tools need to start adding such prefixes... but since this seems to be a Wikidata habit, I'm fine. I can easily add the new values (SPARQL query followed by QuickStatements), but not sure how to automate removing the old values... work for a bot? @Andrawaag:, or can you help me bootstrap a modern bot? --Egon Willighagen (talk) 07:47, 2 September 2017 (UTC)
@Egon Willighagen: The main problem is that an identifier shouldn't be modified so if the different databases are not respecting this rule, we have to find a way preventing us to modifiy all values each time the identifier pattern is changed. The question is simple: are we sure that HMDB won't change again their identifier structure in the future ? As they already did it once, we can't exclude that probability so we have to find a solution for us which limits the risk of having to update the whole set of values. And this can be done by storing in WD the only part of the identifier which is stable. So unless you can provide the garantee that HMDB won't change any more their identifier pattern, the best is to switch to a more secured datat format. Snipre (talk) 04:22, 3 September 2017 (UTC)
@Snipre: Sorry, you lost me here... I think I already said I would not object to your suggestion... are you now suggestion yet another approach? Other than that, I am not member of the HMDB developer community... I am not the person to ask if they will change in the future... I did ask about the next practical steps, and your opinion on that is not clear to me: how shall we update the Wikidata content... do you have a bot to make your suggested changes? --Egon Willighagen (talk) 07:16, 3 September 2017 (UTC)
@Egon Willighagen: My comment was not personal, it was just an answer to your remark "this seems to be a Wikidata habit": this is not a habit, but a real strategy to use only the core part of identifier. And my reflexion about HMDB is just a possible way to join your position: if you have some connections with HMDB and you can get the garantee that they took enough margin with their new identifier pattern, then we can consider to keep the new pattern like it is. Snipre (talk) 19:12, 3 September 2017 (UTC)
@Snipre:@Egon Willighagen: So if I'm not mistaken, the suggestion now is to change all HMDB identifiers in the following way: Remove the letters HMDB and all zeros after these letters, until a number which is not zero is being used. These new numbers will then be stored as "unique" HMDB identifiers, but can be visualised on the website with the correct link (either to old/new numbering system from HMDB). Is it then also possible to query these identifiers, and return the results with the old/new numbering system (or both)? DeniseSlenter (talk) 10:56, 2 January 2018 (UTC)

## (Molar?) mass of compounds

I'm trying to correct some informations in pl.wiki (sometimes in WD too) by comparing data from infoboxes and from Wikidata. It works for CAS number, PubChem, DrugBank and Commons category, but not for mass of chemical compounds. In pl.wiki molar mass comes from PubChem as well as in WD. In the first case we copied it from 'Molecular Weight' (so its 'average mass'), but in WD its 'Monoisotopic mass'... Why we use monoisotopic mass instead of average mass (or why we use only monoisotopic)? As I can see in PubChem/ChEBI/ChemSpider/... average mass is everywhere present and in many databases it's the main data (sometimes the only one). Wostr (talk) 00:34, 22 December 2017 (UTC)

I'm not aware of any limitation of Wikidata to present monoisotopic mass, you can add (hopefully with a source) any mass (P2067) value that you like, the qualifier criterion used (P1013) would be appropriate to distinguish between different methods of determining that mass. ArthurPSmith (talk) 14:36, 22 December 2017 (UTC)
Yeah, and I'll do that manually for every PubChem compound... I simply don't understand why monoisotopic mass (which has limited uses) has been added instead of most commons molar mass (average). Wostr (talk) 18:45, 25 December 2017 (UTC) Maybe there is someone who is able to import this from PubChem? Wostr (talk) 19:06, 25 December 2017 (UTC)
monoisotopic mass is the more stable mass: average mass is changing time to time when IUPAC recalculates the abondance of the different isotopes. This is a more stable parameter allowing a correct comparison between data from different sources. Snipre (talk) 20:22, 25 December 2017 (UTC)
Yep, that's all true and I do not think that only one mass should be present in WD items. Monoisotopic mass is also important in some fields, but in others, average mass is used. criterion used (P1013) is correct here (sometimes meaning in other language may be slightly different, so I'm not sure)? determination method (P459)? of (P642)? We have monoisotopic mass (Q3297559) already, so only 'average mass' item is missing. Wostr (talk) 21:55, 25 December 2017 (UTC)
I’d prefer a proper property. « mass » seems incorrect anyway as its unit should be at first sight an absolute mass, and not a mass/volume or mass/mol. I’d also prefer proper items for variants of the substance, one item for the mono isotopic variant for example, a subclass of the main item, instance of « monoisotipic variant of a substance ». Each of those items could have statements built from new properties like « volumetric mass density » and « molar mass ». Seems better than messing with qualifiers and multiplying the statements in one item, which could become messy pretty soon imho … author talk page 21:02, 3 January 2018 (UTC)

## Documenting how to model chemical concepts in Wikidata

Notified participants of WikiProject Chemistry Over the past months several discussions have taken place around the question how chemical concepts are modelled in Wikidata. I have started Wikidata:WikiProject_Chemistry/Proposal:Models to document the various practices and in one case (compound classes), where we seem to want to end up. The page is a proposal, but whatever models we decide on, aims at documenting what model we use. That will allow us to further structure the chemical data, and clean up inconsistencies due to wrongly tied together Wikipedia pages. Looking forward to your input! PS. what about we wrote up this effort as a formal application for, say, Journal of Cheminformatics (Q6294930)? --Egon Willighagen (talk) 13:37, 31 December 2017 (UTC)

Don't know where to put this (here or in the discussion of Proposal:Models). The first thing we should do is to decide whether to use instance of (P31) or subclass of (P279) for chemical compounds items. The second: does chemical compound (Q11173) should always be present in every item or the lowest chemical classes only. As I tried to understand the difference between instance of (P31)/subclass of (P279) in regard to chemical compounds, I think the difference is quite... philosophical (as both properties can be IMHO correct at the same time) and only one property should be used (subclass of (P279); instance of (P31) would be used for family of isomeric compounds (Q15711994) and similar items). And as SPARQL allow to make queries with P279*, I think we shouldn't keep chemical compound (Q11173) in every item. Wostr (talk) 21:36, 31 December 2017 (UTC)
@Wostr: « as both properties can be IMHO correct at the same time » Then you just don’t get the difference :p Or the definitions are not very good and several stuffs are cluttered into the same item, which is not very good. Do you have an example where you think both are correct ? I did a scheme to explain the differences at the elemental level  , same levels should be found on the compound side. See en:Type/token distinction for philosophically sound ground for scientific definitions (that are used in modern scientific definitions, actually) and en:Metaclass (semantic web). The point is that we should be precise on which definition is used. The correct property will naturally follow. If there is a confusion, it’s because we’re building a cathedral on sand ground. Might collapse at any time. author talk page 14:08, 2 January 2018 (UTC)
@TomT0m:, maybe I just don't get the difference (it's more than likely ;). I was writing about compounds – you can't (of course you can, but IMHO it would not be proper approach) split definition of chemical compound into 'molecule of chemical compound' and 'set of molecules of chemical compounds'. The chemical compound is both at the same time and right part of this definition is used depending on the situation. There are such terms as '(moleculer) entity' and '(chemical) species', but their use is rather limited (BTW the latter does not even have a translation to Polish ;) and maybe in some other languages too). However, if you want to tell me that 'molecule of chemical compound' is a class (because it's not about a particular molecule, but about the type of molecules), then should we use only instance of (P31) (as all chemical groups/families/etc. look like metaclasses to me)? Wostr (talk) 15:55, 2 January 2018 (UTC)
« The chemical compound is both at the same time » seems like a non sensical statement to me. In the real world there is concrete substances of the chemical compound - the content of the bottle in the lab - and the abstract property associated to the content of all the bottles labelled as such (molecular formula for example). There is not « THE chemical compound » except as a language shortcut. If you get one item for each notion, there is no ambiguity to be resolved looking at the context. Just use the right item.
More : there is definitions, maybe competing for this notion : https://goldbook.iupac.org/html/C/C01039.html for the UIPAC about a substance that can be of some use. From this definition, I deduce that a substance is by definition a class of real world lab of factory bottle content, hance a class. And that « substance » is the metaclass for all substances as a bottle is definitely not « the » substance by itself. It would be weird that both « water » and « the water in my bottle » are instances of substances. Last, to take the water example again, it’s not shocking to have an item « water » and an item « water molecule ». The relationship « water has part water molecule » is pretty straightforward. author talk page 18:45, 2 January 2018 (UTC)
Might I suggest, @TomT0m, Wostr: that you relocate this discussion to Wikidata talk:WikiProject Chemistry/Proposal:Models and provide specific arguments for or against the proposals @Egon Willighagen: has give us there, or add your own modeling suggestions. I think we can make real progress and that page is a good place to focus discussion. ArthurPSmith (talk) 20:38, 2 January 2018 (UTC)
@TomT0m:, I cannot disagree with you and come up with some reasonable counterargument. Wostr (talk) 08:49, 3 January 2018 (UTC)

## Use of has part (P527) to describe the atomic composition of chemicals

Currently we use has part (P527) to indicate the elemental composition of a chemical but has part (P527) has a constraint which creates the inverse relation:

• ethanol has part hydrogen
• hydrogen part of ethanol

But considering that we have more than 100'000 chemicals having hydrogen, I think we can find a solution to avoid to have 100'000 statements in the hydrogen item. Two solutions:

• delete the constraint which requires the inverse relation
• create a new property "is composed of" used mainly for chemicals and without any any inverse relation. Snipre (talk) 23:40, 17 October 2017 (UTC)
• The second solSnipreution is IMHO better; the existing relation and constraint in has part (P527)/part of (P361) is useful with other items. But I think there will be voices that there should be 100 000 statements in the hydrogen item... Wostr (talk) 00:15, 18 October 2017 (UTC) PS okay... I just noticed recent changes in carbon (Q623)... Wostr (talk) 00:19, 18 October 2017 (UTC)
• Why does there need to be either has part or part of for each element? It's redundant information found in the chemical formula. The only reason I can see for including it is for a time-memory tradeoff when quarrying. And that's a question relevant for all of Wikidata: should derived data be included and how? Devon Fyson (talk) 01:41, 18 October 2017 (UTC)
@Devon Fyson: The format of chemical formula is the problem: this is not a structured format and you need to parse the formula string in order to find the information you want. And I don't think this is just a matter of time but a problem of 2 steps query: you first need to extract the formula and then you have to parse correctly the string. The problem is again the non structured feature of the formula which is difficult to handle: you have to be able to differentiate the B for boron from the B for beryllium (Be), you have to deal with structural formulas compared to molecular formulas (compare C2H6O and CH3OCH3),... Can we really perform simple queries like extrating all chemicals containing 2 carbons and 1 nitrogen from an unique query based on the chemical formula values ? If you can provide the solution for that example, perhaps we can take account of your solution. Snipre (talk) 02:22, 18 October 2017 (UTC)
And what's about chemicals containing isotopes like for dideuterium (Q6419441): can I extract the isotope from the chemical formula ? Again this is depending on the quality of the chemical formula. Snipre (talk) 02:27, 18 October 2017 (UTC)
@Snipre: Parsing a chemical formula to get a list of each element and some of the functional groups and their quantities is a very doable problem and shouldn't be overly difficult. It likely already exists in source code. But once that parser is found/made, it could either be used to fill in all the fields for this proposal and save a huge amount of tedious manual labour, or used directly in a query. Either way building a query, such as for your example, would be just as easy for the user. It would just affect either the speed of the query or amount of space used by the database (time-space tradoff). As for isotopes and functional groups, they should be included in the formula when possible so there's less guessing of the isomer. The formula in dideuterium (Q6419441) should really be 2H2 or D2, not H2. But now I'm thinking why not work on a framework for describing the entire structure? For example SMILES or InChI (I'm not sure if legally those can be used or not). And when that happens (I'm sure it will eventually), not only will this proposal be obsolete, but it will provide a much greater wealth of information. Devon Fyson (talk) 03:43, 19 October 2017 (UTC)
@Devon Fyson: I never said that parsing is not possible, I just want to see an example to be convinced. We will have to provide an easy way for contributors and saying that's doable is not sufficient I am afraid. Without a solution some people will start to propose properties and to continue to add the current "Has part" statements. And for InChI and Smiles, I let you read the corresponding WP articles on WP:en to discover the licence of these systems. Snipre (talk) 21:45, 19 October 2017 (UTC)
• I would support a new property. I'm not sure whether it should be "is composed of", but that sounds like a good starting point for discussion. Consider whether whole functional groups could also be included. --99of9 (talk) 03:48, 18 October 2017 (UTC)
• Delete the constraint. This property is classically used in chemical ontologies, no reason to make a difference here. Easy to build a query to compute the inverse now we have the query engine. This will eventually (maybe …) be possible to use it in lua or templates on WikiPages if needed. author talk page 17:22, 4 January 2018 (UTC)

#### New property for composition

If we assume a new property for both elemental and functional composition, then we face a drawback of the current WD data structure. The first thing is the need of 2 properties, one general for the composition and one for the composition element: We need a composition description property and a composition element property.

Case of methanol

• composition description: element
• composition element: carbon
• composition element: oxygen
• composition element: hydrogen
• composition description: functional group
• composition element: hydroxyl
• composition element: methyl

The problem is we won't be able to add the number of each element or each functional group. Snipre (talk) 08:00, 18 October 2017 (UTC)

@99of9, Wostr: Snipre (talk) 10:05, 18 October 2017 (UTC)
@Snipre:, so if we need 2 properties for this, maybe there souldn't be composition description/element properties as there is no option to add quantity (P1114), but we can add is composed of (element) and something like has substructure/functional group (building blocks?). In this case:
• is composed of:
• has substructure/functional group
• hydroxyl
• methyl
substructure property may be useful in cases like benzimidazole (Q415190) to indicate imidazole (Q328692) and benzene (Q2270) as its substructures. Wostr (talk) 11:11, 18 October 2017 (UTC)
can’t the firt one just be deduced from the second one ? I mean, if you just use « has part » and that « hydroxyl has part hyddrogen » and « methanol has part hydroxyl » the list of elements can just be deduced from a query « (hydrogen has part [ has part ?element ] ; ?element instance of chemical element ». Nothing simpler. author talk page 10:04, 19 October 2017 (UTC)
This assumes that « hydroxyl has part hydrogen » is a statement of hydroxyl. author talk page 10:19, 19 October 2017 (UTC)
@TomT0m: Everything is possible, but the question is: can you give an example of your proposition, something like a SPARQL query or anything else ? The current proposition is not smart, I agree, but it is based on a functional system property/value with a query system with a good support. Snipre (talk) 21:45, 19 October 2017 (UTC)
Comment It seems to me that capturing the full conceptual content of chemical structures is beyond the scope or capabilities of Wikidata. How would you describe ring structures? What about crystal structures of bulk materials? How would you describe a protein composed of tens of thousands of individual atoms? There are standard ways to do these things and we could point to data files, perhaps hosted in Commons, that provide these details, but I don't think Wikidata can or should be expected to be able to fully describe all chemical structures. ArthurPSmith (talk) 13:06, 19 October 2017 (UTC)
@ArthurPSmith: If you have some doubt please have a look at this [[1]]: enzymes are studied based on structure decomposition using

~400 functional groups. Snipre (talk) 21:45, 19 October 2017 (UTC)

Comment Like reported above, the elemental composition information is already found in the chemical formula. I rather see a chemical formula data type in Wikidata, to ensure it is machine readable. There are enough standards for that. Or maybe even a regular expression is enough? Since people are eager to do this, I won't object. Please do make sure that all elements are listed, as I now frequently run into has_part with only a subset of elements of the compound. --Egon Willighagen (talk) 08:59, 22 October 2017 (UTC)
@Egon Willighagen: we have a property chemical formula (P274) that should be using a standard notation. I'm not sure it's been followed consistently though. ArthurPSmith (talk) 19:40, 23 October 2017 (UTC)
@Egon Willighagen, ArthurPSmith: We can't prevent people to add formula like CH3CH2OH so better create a new property refering directly to the method used to code the formula like "Hill chemical formula" for formulas following the Hill system. Snipre (talk) 22:45, 16 December 2017 (UTC)

## GenX (Q29388239)

The item is about a chemical, while the two linked Wikipedia articles are about a process. --Leyo 13:30, 15 January 2018 (UTC)

@Leyo: Create a second item for the process Snipre (talk) 22:49, 15 January 2018 (UTC)
Egon Willighagen who created the item may like to do that. --Leyo 10:57, 16 January 2018 (UTC)
Nice catch! I don't remember it was like that; I will look into it, and create a second item for the process. --Egon Willighagen (talk) 15:21, 17 January 2018 (UTC)
Fixed, for which I reverted a few changes by @DePiep: and created GenX (Q47455781).. --Egon Willighagen (talk) 16:41, 17 January 2018 (UTC)

## Plural or singular labels

Labels of items like carboxylic acid (Q134856), amine (Q167198) and others that describe groups, classes, families etc. of compounds should be plural or singular? Firstly, I used plural labels in Polish labels, then I was convinced by someone that it would be better to use singular. For some time, hovewer I use pluaral again ;) for Polish labels, because it makes more sense (singular does not in my language; writing in Polish that 'amine' is a 'group of...' is quite funny).

Should there be some guidelines in this case or maybe it's language-specific matter? Now, most of the English labels are singular (like in en.wiki articles), but eg. many Russian labels are plural. I do not know English well enough to say that it's better to write 'amine is a class/group/family' (BTW the use of these three names could also be regulated somehow) or 'amines are a class...'. Wostr (talk) 22:56, 1 January 2018 (UTC)

@Wostr: Help:Label#Labels_in_English Snipre (talk) 19:52, 3 January 2018 (UTC)
@Snipre:, okay... but I can't find the answer on that page (or maybe you want to indicate that every language has to establish its own guidelines? — if so, it's still not clear to me whether to use plural or singular in English?). Wostr (talk) 20:03, 3 January 2018 (UTC)
@Wostr: Sorry, I didn't read the page to find the info. But this shiould be solved in that page. For ontology building rules, label have to be singular if the singular form can be used (some concepts can be always plural). So as we can use the concept amine in the singular form, the singular form should be used. Better open a new section there and if no opposition are raised then we can add this new rule in the rules list. Snipre (talk) 20:36, 3 January 2018 (UTC)
• note: ChEBI uses plural for compound classes (like in [2]). Wostr (talk) 21:43, 4 January 2018 (UTC)
• I think the wikidata (English) label policy comes from English wikipedia: en:Wikipedia:Naming conventions (plurals). That said we should probably have our own independent statement of it. ArthurPSmith (talk) 21:55, 4 January 2018 (UTC)
• Otherwise, there will always be 1=2 (or 2=1).WP is atavism and rudiment, living by its own rules (only human-readable (Q16716513)), which will lead to its collapse (few people need the right thing, because there is a better alternative). --Fractaler (talk) 18:15, 6 January 2018 (UTC)
• There are two main types of exceptions to this rule: Articles on groups or classes of specific things – am I the only person who thinks that en.wiki articles about compound classes and groups are a violation of this? ;) But, in fact, there are some more specific guidelines – en:Wikipedia:Naming_conventions_(chemistry)#Groups_of_compounds from 2008 (hovewer I couldn't find any related discussion in WikiProjects Chemistry/Chemicals). Wostr (talk) 20:20, 6 January 2018 (UTC)
• note 2: Glossary of Class Names of Organic Compounds and Reactive Intermediates Based on Structure (IUPAC Recommendations 1995) uses plural names (probably to make clearer distinction between classes and compound names, e.g. pyridine vs pyridines). Wostr (talk) 19:45, 6 January 2018 (UTC)
• note 3: it's seems that other wikis like de.wiki or it.wiki uses plural names (en:Wikipedia_talk:WikiProject_Chemistry/Archive_26#Plural_names_for_classes_of_compounds), but this should be verified. In pl.wiki there was short discussion, but it seems quite obvious that class names in Polish should be written in plural. Wostr (talk) 20:20, 6 January 2018 (UTC)
• note 4: Glossary of Class Names of Polymers Based on Chemical Structure and Molecular Architecture (IUPAC Recommendations 2009) uses singular class names for polymers to enable an individual polymer within a class to be referred to by using the indefinite article, “a”. For example, poly(3-octylthiophene) is a polythiophene and polythiophene itself is also a polythiophene. That may be related to the use of instance of (P31). Wostr (talk) 20:20, 6 January 2018 (UTC)
We create a knowledge base, a common terminological coordinate system. With its help, we can say now only, for example: dicarboxylic acid (Q422050) (dicarboxylic acid) is carboxylic acid (Q134856), tricarboxylic acid (Q2823314) (tricarboxylic acid) is carboxylic acid (Q134856) (carboxylic acid). And can not say that, for example, "dicarboxylic acid (Q422050) and tricarboxylic acid (Q2823314)" is carboxylic acid (Q134856). Because "group of carboxylic acids"/"carboxylic acids" we have not. --Fractaler (talk) 13:02, 8 January 2018 (UTC)
We do not need 'carboxylic acid' and 'carboxylic acids' items. One of them is redundant as both items describe the same thing – structural class of chemical compounds. I think there is only linguistic problem with proper grammatical number and with is a (as a synonym of instace of); in this case the use of instance of (P31) is IMHO incorrect, as the correct ones are: tricarboxylic acids < subclass of > carboxylic acids and dicarboxylic acids < subclass of > carboxylic acids. Wostr (talk) 14:24, 8 January 2018 (UTC)
Ok, what is, for example, "dicarboxylic acid (Q422050) and tricarboxylic acid (Q2823314)" by your version? --Fractaler (talk) 10:05, 9 January 2018 (UTC)
Classes of chemical compounds, of course. Wostr (talk) 14:22, 9 January 2018 (UTC)
"Dicarboxylic and tricarboxylic acids" is "classes" or "class"? "dicarboxylic and tetracarboxylic acids" is "classes" or "class"? "dicarboxylic, tetracarboxylic and heptacarboxylic acids" is "classes" or "class"? Groups or group? --Fractaler (talk) 14:43, 9 January 2018 (UTC)
There is nothing like Dicarboxylic and tricarboxylic acids right now, so discussion about this makes no sense. Dicarboxylic acids is a class and a subclass of Carboxylic acids class. Same with Tricarboxylic acids and so on. Dicarboxylic acid = Dicarboxylic acids as a chemical class (and speaking 1≠2 in that case would be rhetorical figure only, as there is a linguistic problem which version to use, not substantive or methodological problem and both dicarboxylic acid and dicarboxylic acids are describing the same thing: class of chemical compounds). Wostr (talk) 19:18, 9 January 2018 (UTC)
There is nothing like Dicarboxylic and tricarboxylic acids right now where? --Fractaler (talk) 20:12, 9 January 2018 (UTC)
As a WD item. (And as a separate and well-established chemical class). Wostr (talk) 20:35, 9 January 2018 (UTC)
For a long time in ru-Wikipedia I asked the WP-editors: for whom do you create Wikipedia? They did not even want to think about it. And now there was a powerful competitor. Do you think that we are doing WD-items for Wikidata (Q2013)? We make knowledge base (Q593744), a common hierarchy of terms, common frame of reference (Q184876). So, it is not only in the Wikidata, but also outside it (in the Internet conversation, articles, etc.). Example of use: 1) "Alice gave Bob the malic acid (DL-malic acid (Q190143))". So Alice gave Bob (one) dicarboxylic acid" (dicarboxylic acid (Q422050), 1 acid). 2) "Alice gave Bob citric acid and succinic (succinic acid (Q213050)) acids". So Alice gave Bob carboxylic acids " (several acids, >1). Can you duplicate this message (when 2≠1, 1≠2) so that the meaning does not change if you have carboxylic acid = carboxylic acids (2=1, 1=2)? --Fractaler (talk) 09:30, 10 January 2018 (UTC)
I think that your problem is not of any connection with the title problem and I really don't understand what you're trying to achieve. Wostr (talk) 13:28, 10 January 2018 (UTC)
The task is very simple. You are trying to convey your thoughts to your interlocutor (he is a foreigner and does not understand your language). But you have a universal means, a single system of terms - Wikidata. Try to apply it. For example, replacing the words in the sentences (above) with the Wikidata items so that the foreigner correctly understands you. --Fractaler (talk) 14:28, 10 January 2018 (UTC)
English is not my first language and I also have problems in using it, but I'm trying to use it as best as I can. Hovewer, I won't replace English or any other language with some pseudo-langauge based on P's and Q's, sorry. It seems to me that I would understand Russian better (as my first language is Polish and for a short time I studied Russian) than this kind of discussion. As of the previous comment: I don't think that we will ever have duplicated items like 'carboxylic acid' and 'carboxylic acids' (both describing chemical class) just for the sake of some linguistic problems that are not part of chemical classification. That would be a problem for Wiktionary and maybe in the future Wiktionaries will be integrated into/with Wikidata – that would allow to automatically choose required grammatical number. What this topic is about is to decide (A) whether there should be some consistency in naming chemical classes between languages and (B) if so, should the names be plural or singular. Wostr (talk) 15:28, 10 January 2018 (UTC)
WD-language (only Q*, because P* can be replaced by Q*) is not pseudo-language. It is a common terminological reference system (ie, not only a relative, but an absolute path to the term), the next stage (after the mathematical language, semantic network (Q1045785) and etc.) of the evolution of the universal language. Mathematics to anyone who understands it, can explain why 2≠1, 1≠2. WD can show anyone that "carboxylic acid" is not "carboxylic acids", is not synonymous, is not " linguistic problem", because it is object of group (Q36809769) and group (Q16887380). object of group (Q36809769) is group (Q16887380)? And if so convenient, we can continue here). Fractaler (talk) 08:34, 11 January 2018 (UTC)
The advantage of using singular form is the possibility of automatic text creation using the labels (#label instance is a #label class. Then as WD will be linked to the Wiktionary with creation of new properties for the plural and the female, then the label should be the masculine singular form. Snipre (talk) 12:36, 11 January 2018 (UTC)
I don't think I have any choice in using grammatical gender... And still there is no agreement that we should use instance of (P31) to chemical compounds at all. Wostr (talk) 15:31, 11 January 2018 (UTC)
Do we apply the grammar of the language in the grammar of the Wikidata? On what grounds? singular (Q110786)/plural (Q146786) is grammatical number (Q104083) (grammatical category of nouns, pronouns, and adjective and verb agreement that expresses count distinctions (such as "one", "two", or "three or more")). Does anyone else think that between 2 (3, 4, etc.) and 1 there is no difference? That between singular (Q110786)/plural (Q146786) there is no difference? If there are still doubts, then apply the grammatical categories in the sentences (Wikidata has sentence (Q41796)?). Yes, singular (Q110786)/plural (Q146786) refer to the sentence, to the full, absolute path (full object name (Q38667285), breadcrumb (Q846205))), and not to the Wikidata's "item/superitem" or "subitem/item", short object name (Q38667440). Just use this cool tool to check for a number. --Fractaler (talk) 14:05, 11 January 2018 (UTC)
@Snipre, Fractaler: Okay, let's reverse this problem. In WD we will have chemical classification of compounds based on structure. I am quite sure of this, as it is the basic classification in chemistry. So, we will have items that are equivalent to chemical classes/groups/or whatever you like to call them. And let's take na example here: there is a WD class that describes every compound having pyridine ring in its structure, so it's something similar to commons:Category:Pyridines. This specific item should be named (in English) pyridine or pyridines? (in Russian) пиридин or пиридины? This item will certainly include pyridine (Q210385) (but at this moment we cannot be sure of the exact relation instance of (P31) or subclass of (P279)). What I can say from Polish language perspective is that having it (in Polish) in singular does not make sense because a group/class of objects have to be plural and with singular labels I would have to come up with some weird descriptions matching singular labels (like kind/type of compound which maybe sounds okay in English but not so well in Polish). Wostr (talk) 15:31, 11 January 2018 (UTC)
@Wostr: Who said that classification based on substructures was a good idea ? When you say "we will have chemical classification of compounds based on structure", I think you already conclude the discussion before it starts. Just go deeper in your example: if a chemical compound has a dozen of functional groups, with your classification we will have a dozen of instance. Is it the correct way to do thing ? And then take a big protein with hundreds of functional groups and other substructures. Do you really think that using a classification can help for complex molecules ? For me we shouldn't use functional groups as classification criteria but as we are doing with elements, we should use a new property like "functional group" or "substructure".
Then for your example of pyridine, just write the relation between the compound pyridine (compound) and the pyridine (class). Can't you write in Polish pyridine (compound) is a pyridine (class) ? So if you can write that sentence with pyridine (class) in the singular form, why do you have so many problem to write it in the label ?
If you still have a problem, then use another simpler concept: a dog is an animal, you can easily write animal in the singular form, and I think this is the case in Polish too. So can you still say after that "singular does not make sense"? And this way I want to have at the lowest level of the classification th euse of "instance of", because with that rule you can always create sentences like A is a B, A is a C,... This allows a easy check of the classifcation links. I can always say carbon dioxyde is a gas, is a chemical compound, is a chemical substance,... if I define carbon dioxyde as instance of something. If I define carbon dioxyde as subclass, how can I test the relations: carbon dioxyde is a suclass of chemical compound, carbon dioxyde is a subclass of chemical substance ? This is not so clear as instance of. Snipre (talk) 22:48, 15 January 2018 (UTC)
@Snipre: chemical classification based on structure is used in every chemical database that I know of, so using it seems obvious to me – in other words: not using it would be a great loss. And about proteins: chemical classification based on structure is used of something defined like 'small compounds'; for proteins etc. other classifications are used. Using has part (P527) or other property is of no use here, would be incorrect – acridine (Q342713) do not has part (P527) pyridine (Q210385); it has part (P527) with pyridine ring without some hydrogen atoms etc.
I can write 'pyridine (compound) is as pyridine (class)', but: (1) description won't match label in singular – pyridine (class) is a class of objects, so it seems natural that should be plural and description should have 'class of compounds etc.', (2) is a is a synonym (one from many) for instance of and it's much better IMHO to write 'pyridine (compound) is an instance of pyridines (class)' than 'pyridine (class), (3) what you wrote is correct only for compounds, not for compounds classes: 'dibenzopyridine (class) is a subclass of pyridine (class)'? (quite nonsensical to me). Wostr (talk) 23:02, 15 January 2018 (UTC)
@Wostr: Please can you spot where I wrote we won't used structures as descriptors for chemicals ? I never we shouldn't used structure, I just say that structure should be described using other properties than instance of/subclass of. What is the big advantage of that ? We won't mix sturcture classification with other classification based on use or reglementation. Just remember that ethanol is not only an instance of chemicla compound or alcohol, but a solvent, a drug, a fuel, ... Your structure classification based on instance/subclass will be mixed with dozen of other classifications.
"chemical classification based on structure is used in every chemical database" What is the interest to do what is already done in other databases ? Will WD just be a mirror of ChEBI ? You really show a poor inovative spirit if your argument is mainly "the others are doing like that so just do it like that". Following that spirit, WP and WD shouldn't never be created as referenece encyclopedias already exist in the past. And finally you always have the same lack of ambitions when you say taht proteins should be classified using in a diffferent way than small molecules. Why do we have to do that ? Shouldn't we be a little more open ? Trying new things ? If you really want to copy ChEBI so better extract directly ChEBI clssification in your wiki and avoid all the import and maintenance work related to keep the ChEBI classification up-to-date in WD. This is non sense to copy CheBI in WD and then WD in WP:pl if you can directly do the import from ChEBI to WP:pl.
And for the label problem, you problem is the way to formulate the label: class of objects. Why can't you use as description pyridine (class) = any compound with a pyridine structure or a compound with a specific substructure having 6 carbons ... ? No need of plural for that. Snipre (talk) 22:45, 16 January 2018 (UTC)
@Snipre: starting from the last mentioned thing: yes, I can. But version with 'any compound...' seems like someone tried to adjust this description to poorly chosen label, but plural corresponds with e.g. IUPAC definitions, also ChEBI has no problem with using plural and is a, second also ;) this distinction (singular – compounds, plural – classes) in natural way helps in choosing the right item.
Hmm, and what is the problem with mixing many classifications in subclass of (P279)/instance of (P31)? In most cases of chemical compounds there should be max a few classes, not dozens. Do I understand correctly that your point is to add P31:'chemical compound' everywhere and use some other property like 'chemical class' to add structural classification? If so, could you propose such property – it would be much easier for me to have some formal basis for adding classes (right now I use structural class of chemical compounds (Q47154513) and subclass of (P279)/instance of (P31), but I'm almost sure that I will have to modify this in the future – hovewer, thanks to structural class of chemical compounds (Q47154513) it will be possible to easily obtain all items that are in fact classes not compounds, and modify them).
And about my ambitions and innovative spirit (I assume that this is not argumentum ad personam): (1) the fundamental principle in Wikipedia is no original research, thus I'm not here for inventing anything, only for... repurposing what has been already invented to something what will be best suited to the WD needs. I have no illusions about that the few wikidatians would be able to invent unified and correct chemical classification for all chemical compounds including macromolecules – something that has not been created so far, even by rich chemical corporations and scientific institutions, and what has been described as impossible by many authors (because of too big differences between different forms of what chemists call 'chemical compound').
In my opinion, le mieux est l'ennemi du bien – we do not have any chance for the best, we don't have good. What we could have is something I would call reasonably good – and that means existent classifications like in ChEBI. What we have now in WD is nothing and chasing the best will leave us with nothing. Wostr (talk) 23:19, 16 January 2018 (UTC)
@Wostr: Sorry for the delay of my answer.
* In most cases of chemical compounds there should be max a few classes, not dozens. Wrong, if you have a chemical with 15 different functional groups, then using a classification based on structure you will have 15 instance of. Again you assume that the classification will be used only on small molecules which is not the case. Just think again that we have proteins and other big natural compounds so we have to consider them and not just saying that is another kind of classification: we need something which can cover all items under chemical compounds. So if you don't want to have protein in chemical compounds subclasses, please provide the classification tree integrating chemical compound and protein with the definition allowing to differentiate protein from chemical compounds. We shouldn't create different classifications but one classification.
And I won't create any property if we didn't agree about the need of it. I don't want to start a process if nobody is convinced again its utility. The goal of the discussion is mainly to avoid any useless action by defining a priory what is necessary.
My point is first to start with chemical compound everywhere and then to start the creation of a classification which is based on not on the usual functional group structure. For me a hydroxyl group on a big molecule doesn't allow to say that molecule is an alcohol.
Why do you need to add instance of structural class of chemical compounds (Q47154513) ? You can retrieve all these items by a query looking for all subclass of chemical compounds including subclass of subclass. See
SELECT ?compound WHERE {
?compound (wdt:P279/wdt:P279) wd:Q11173
}
Try it!
We are using a database so instead of creating unnecessary structure, use database properties and in that case we case use queries instead of useless classification.
So if you don't want to create something new, why do you want to import in WD something which is available and maintained out of WD ? ChEBI is one possible classification and not THE classification. If you want to follow the WP rules correctly so please do it completely: according to NPOV (neutral point of view) we can't apply an unique point of view from an unique reference like ChEBI. And finally I am wondering if WP:pl follow your rule of no original research when creating categories in WP. And as classification in WD is similar to category in WP, I think we can apply the same rules. Your argument is very poor because you criticize the unclear definition of chemical compound and you are using a less clear concept of structural class of chemical compounds (Q47154513). Are you really coherent ? I don't think so. Snipre (talk) 14:48, 22 January 2018 (UTC)
PS: I have nothing against you personaly I just try to find something which can convince me about your proposition.
@Snipre: that is simply not true — e.g. in monoethanolamine (Q410387) there shouldn't be two separate classes (alcohols and amines), but only one (hydroxyamines). The whole concept of chemical classification based on structure is to create more specified subclasses if there are enough compounds sharing specific structure.
From your SPARQL I get a bunch of unrelated items being minerals etc., but there are in fact many classifications of chemical compounds that should be separated, e.g. structural class of chemical compounds (Q47154513) is about structural classification (maybe there should be 'structural' in the label), there is also functional classification already present in WD (like 'acids', 'bases', 'oxidants' etc.), there is classification based on use of compound (pigments, bla bla). Adding instance of (P31) with specific item about class (like structural class of chemical compounds (Q47154513)) is IMHO the easiest way to retrieve only items being classes in specific classification.
And the structural classification can be used on macromolecules too, but in different way (as it is done already in science): e.g. by indicating amino acids building blocks (not every functional group in every amino acid) or other macrostructural feature (not every funct. group that feature is composed of). But at some point transition between classification of small compounds and macromolecules, even the most smooth transition, is some kind of boundary between very accurate structural indication for small compounds and something like estimation for macromolecules).
And unclear definition of concepts like structural class of chemical compounds (Q47154513) is IMHO an advantage here (until we establish solid model for classification), because classification based on such items will be correct either we model chemical compounds as molecular entities or we choose to model these item as chemical substances. Wostr (talk) 16:48, 22 January 2018 (UTC)
Now for pyridine (Q210385): properties (mass (P2067), etc). It is mass (P2067) of what? 1 (pyridine, пиридин), >1 (pyridines, пиридины), compound, compounds, substance, substances?--Fractaler (talk) 17:13, 15 January 2018 (UTC)
@Fractaler: why pyridine (Q210385) is a problem here? Mass of pyridine is a mass of molecule (if in Daltons [u]) or mass of a mole of molecules (if in moles [mol]). But I don't understand why you asking me this? The problem is not how to name pyridine (Q210385) (it will always be pyridine/пиридин because its about molecule/compound). The problem here is how to name pyridines (Q47317020) — item that describes class of compounds = all compounds that have pyridine ring in their structure (so this class includes e.g. bromazepam (Q422435), 2,4,6-trimethylpyridine (Q409155), 2,6-pyridinedicarboxylic acid (Q417164), 2,6-lutidine (Q209284), 4-methylpyridine (Q2189778) and many, many others). It's similar to Wikipedia categories: ru:Категория:Пиридины is for every compound from pyridine class = compound having pyridine ring in the structure. Wostr (talk) 19:38, 15 January 2018 (UTC)
pyridines (Q47317020) (pyridines, "class of chemical compounds with pyridine ring"): what does "compounds with pyridine ring" mean here? --Fractaler (talk) 09:10, 16 January 2018 (UTC)
@Fractaler: that compounds have pyridine ring (i.e. pyridine core without one or more hydrogen atoms) as part of their structure. Wostr (talk) 16:02, 16 January 2018 (UTC)
@Wostr: The structure of compound is the same as the structure of molecule? Can compound has ring? Or molecule of pyrimidine has piri idine ring?--Fractaler (talk) 18:35, 16 January 2018 (UTC)
@Fractaler: sorry, but I have an impression that either you're not a chemist, or we have too big language barrier here. Structure of compound is the same as structure of molecule (because in the most popular definition of compound, it is substance composed of one kind of molecules, and the terms 'structure of compound' and 'structure of molecule' are used interchangeably). And yes, compound/molecule can have a ring, e.g. toluene has benzene ring and methyl group. But no, pyrimidine does not have pyridine ring – pyrimidine ring has two heteroatoms in its ring, pyridine only one heteroatom in its ring. Wostr (talk) 18:48, 16 January 2018 (UTC)
@Wostr: substance composed of one kind of molecules. Agree. So, substance consists of molecules. Molecule consists of atomes. So, we have two levels, and of course thenthen on this levels objects can not have the same structures (it is not fractals). --Fractaler (talk) 19:07, 16 January 2018 (UTC)
@Fractaler: I really don't know what are you getting at? IUPAC uses 'compound' in both meanings, as most chemists do. So e.g. 'alkynes' are a subclass of 'acetylenes' (both are chemical classes) — (1) 'alkynes' (molecules in which there is one C≡C) is a subclass of 'acetylenes' (molecules in which there is one or more C≡C); (2) 'alkynes' (substances composed of molecules in which there is one C≡C) is a subclass of 'acetylenes' (substances composed of molecules in which there is one or more C≡C). The exact meaning depends on the chosen definition and classification tree, but on this level remain the same. So, if we choose that we classify all 'chemical compounds' as 'molecules' (cf. discussion about definition of chemical compound), then classification will be based on this. If we choose otherwise ('chemical compounds' are 'substances composed of one kind of molecules), then classification will be the same, with the same definitions, and the same connections. Wostr (talk) 19:17, 16 January 2018 (UTC)
@Wostr: Does IUPAC make any knowledgebase or we? IUPAC is just for rules, for notability,for living of item in the WD-space. To do it legitimic.Also as link to WD, very cool source and so on by notability. --Fractaler (talk) 19:29, 16 January 2018 (UTC)
@Fractaler: at this point I think that further discussion makes really no sense, sorry. Wostr (talk) 19:31, 16 January 2018 (UTC)
@Fractaler: Wikidata's idea is to be a secondary source. The definitions of other authorities matter a great deal. ChristianKl❫ 14:32, 17 January 2018 (UTC)
Of course. I mean: no Wikidata:Notability (IUPAC, other cool sources), no life in WD for item. Fractaler (talk) 14:39, 17 January 2018 (UTC)
Ok, but idea that structure of molecule = structure of compound is wrong. We can ask other editors here about this. --Fractaler (talk) 19:44, 16 January 2018 (UTC)
@Wostr: IUPAC choses not to use the term chemical compound and doesn't have any definition for it. It does have a concept of inclusion compounds that seems to me like it includes multiple molecules. ChristianKl❫ 14:32, 17 January 2018 (UTC)
After reading a bit more it seems that neither ChEBI nor IUPAC have a concept of a "chemical compound". Why should we have one? Would it make sense to rename the item into "pure substance"?ChristianKl❫ 15:30, 17 January 2018 (UTC)
/conflict/ @ChristianKl: yep, and that's why when IUPAC definition contains 'compound' (and IUPAC uses compound very often, yet without defining it), it's the matter of context which version (substance vs molecule) should be used – so in WD it is the matter of defining 'chemical compound' or the chosen model of chemistry top-level items the problem, i.e. aromatic compound (Q19834818) will always be a valid chemical class for e.g. benzene (Q2270) or pyridine (Q210385) – but both items can be modelled either as a substance or as a molecule (the discussion about how should we treat items about chemical compounds – as items about molecules or items about substances composed of such molecules – is still ongoing: Wikidata talk:WikiProject Chemistry/Proposal:Models). Wostr (talk) 15:38, 17 January 2018 (UTC)
Not rename, as 'pure substance' is not a synonym of 'chemical compound' (pure substance includes also chemical elements), but 'chemical compound' can be just ignored in modelling chemical items (but it is still notable concept and should have its item in WD) and other terms may be used. Wostr (talk) 15:40, 17 January 2018 (UTC)
If it's a notable concept, why doesn't ChEBI and IUPAC define it? It seems to me like they don't have a concept in their database with that name because the term has no clear meaning and is used with different meanings and they prefer to have terms with clear meanings.
A "inclusion compound" for example is not a single molecule or even multiple molecules of the same type but a complex. ChristianKl❫ 18:17, 17 January 2018 (UTC)
@ChristianKl: It is one of most basic concepts in chemistry, but... it has more than one definition (some widely used, some used only in narrow fields). And I don't advocate for using 'chemical compound' concept as a base for classification – but using 'compound' as a part of many terms and definitions is unavoidable.
And you are of course right that IUPAC does not want to define it – if clear definition is required, IUPAC uses e.g. 'molecular entity', 'chemical substance', 'chemical species' etc. But when this is not required (i.e. when something can be related to both 'molecular entity' and 'chemical substance'), IUPAC uses 'chemical compound' a lot. This is the case of the whole chemical nomenclature, where chemical names are not limited to molecular entities but are valid also for chemical substances composed of such entities.
You are right too about the fact that the most widely used definition of a compound ('chemical substance composed of molecules') is not strictly used by IUPAC. And there are many more similar examples: salts and hydrates are not molecules either (but are usually included in something called 'chemical compounds'). Frustrated Lewis pairs are not compounds either, etc. etc.
But I think that has nothing to the title problem ;) You can write about this in Wikidata talk:WikiProject Chemistry/Proposal:Models and I would be grateful if you do so, because there are not many users in this WikiProject (that are active in the discussions) and I think it would be easier to come to conslusions with more users.
And, frankly speaking, I do not know why this discussion in this topic came to this point. Fractaler is asking question, which are IMHO not related to the title problem and I just don't know what is wrong or what are his proposals. I decided to not participate in discussions with him, as I see that he's indefinitely blocked on his home project for disrupting actions and I really don't have time to solve his enigmatic and philosophical questions. Wostr (talk) 23:44, 17 January 2018 (UTC)
@Wostr:I see that he's indefinitely blocked on his home project for disrupting actions: You see? for disrupting actions? Proof, please. Otherwise I will consider it as the distribution of unverified information (real chemists, like other real scientists, fake information do not spread). And how can you discuss something here, if you use the terms (plural label, singular label), which have not yet been defined? Mankind has long passed the stage of such philosophical discussions. --Fractaler (talk) 09:30, 18 January 2018 (UTC)
@Fractaler: sorry, I really don't have time to answer questions/comments that (1) are not related to the problem or I can't find any relation to the problem – maybe there is some relation, but I wrote a few times, that I don't understand how your questions are related to anything in this topic; (2) are full of Q's and P's (like your comment on January 11th about grammatical numbers) what makes them really hard to understand; (3) includes only enigmatic questions without any proposals or even indications of what is wrong. Since I'm a volunteer, like all of us here, I also have the right not to answer to your questions and comments — the right I intend to use until your questions will be substantive and clearly related to the problem. As for the proof you want: your block log on ru.wiki is not a secret. Wostr (talk) 18:48, 18 January 2018 (UTC)
Excuse me, yes, I have such a drawback (to lead a person to an idea). Ok, my idea was (I probably needed to say this at the beginning): pyridine (Q210385) (pyridine, molecule (Q11369) (molecule), молекула пиридина) is component (Q1310239) (component, компонент) of pyridines (set of all pyridine molecules, пиридины, группа всех пиридиновых молекул).
block log said "disrupting actions" and I see disrupting actions (in my opinion, maybe this is the result of my poor knowledge of English) it sounds differently (has a different meaning). Fractaler (talk) 07:45, 19 January 2018 (UTC)
@Fractaler: so your idea is something like:
1. existing item pyridine (Q210385) would describe one molecule (molecule (Q11369)) of pyridine
2. there will be also another item (not existing now) that would describe pyridine/pyridines as a portion of matter (chemical substance (Q79529) → set of molecules of pyridine (Q210385))
Do I get this right? Wostr (talk) 19:55, 19 January 2018 (UTC)
Yes! (again sorry for using that bad method "step by step"). --Fractaler (talk) 20:04, 19 January 2018 (UTC)
Okay, so there is also the third thing. The 1st is 'item about one molecule' (this is obviously singular, like pyridine (Q210385)), the 2nd is 'item about set of molecules of one type' (the grammatical number here is not so obvious: if it was about countable number of molecule, then it would certainly be plural, like two pyridines, five pyridines etc., but as it's a set of uncountable and not specified number of molecules, the grammatical number may be language-specific, e.g. in Polish it would be singular).
The above two things cane be discussed here: Wikidata talk:WikiProject Chemistry/Proposal:Models – I think there was a comment similar to your idea. But it's not clear yet, if we should use only 1st (molecule items), only 2nd (substance items) or maybe both.
The 3rd thing is: pyridines (Q47317020). It's not the 1st nor the 2nd thing. This is for grouping different kind of molecules (or substances if we choose not to use molecule items at all) into classes on the basis of their structure – so it's some abstract class, not something what can be achieved in reality. Let's take an example based on your idea:
But all three items are pyridines (Q47317020) = their molecules have pyridine ring in structure (or rather all three items belongs to chemical class named pyridines (Q47317020)). So it's here the problem: should this 3rd abstract class be пиридин or пиридины? Wostr (talk) 00:14, 20 January 2018 (UTC)
@Wostr: about grammatical number may be language-specific - it seems that while it is necessary to point at the place of the label: one molecule X, a small amount of (small) molecules X, many molecules (large group) X, all molecules X (the whole group) X.
about based on your idea - this is not my idea, it is the idea of/from chemical elements. There we had the same problem (the problem was due to homonymy (Q21701659)) - an element or elements, an atom or atoms, an isotope or isotopes, etc. As soon as the words are indicated (the totality of all atoms of a certain kind, the atom from this set, the molecule of such atoms, the totality of such molecules), the problem disappears. So, α-picoline (Q2216745), 4-ethylpyridine (Q27257452), triprolidine (Q417654),..., X = 1) molecule of X, 2) all molecules of X, 3) something other (because homonymy (Q21701659) now).
their molecules have pyridine ring in structure, so, how about "molecule with pyridine ring in structure"/"molecules with pyridine ring in structure"?
So, to continue, it is better to move into WikiProject Chemistry|the project? or while continuing here, and there to make specific proposals? Fractaler (talk) 09:37, 20 January 2018 (UTC)

## potassium ferrocyanide

We have potassium ferrocyanide (Q422017) and potassium ferrocyanide (Q27279279). In the 1st we have all sitelinks and Wikipedia-imported properties about anhydrous form and PubChem-imported properties about trihydrate, the 2nd is PubChem-created item for anhydrous form. Which is better: move all properties, sitelinks, labels etc. about anhydrous form to potassium ferrocyanide (Q27279279) or move PubChem-imported data between these two items? (we have also potassium ferrocyanide (Q27110378), but this has to be merged when the above problem is fixed). Wostr (talk) 20:27, 27 January 2018 (UTC)

Usually the correct way is to respect the label: if the label is different from the data, this often means that the data import was not correct as people don't check what is the real compound defined by the item. Snipre (talk) 08:46, 29 January 2018 (UTC)
Okay, I'll fix this that way, thanks. Wostr (talk) 18:57, 30 January 2018 (UTC)
potassium ferrocyanide (Q422017), potassium ferrocyanide (Q27279279) and potassium ferrocyanide (Q27110378) merged; data about trihydrate moved to new item potassium ferrocyanide trihydrate (Q47520593). It appears that the 1st step was that some bot added wrong PubChem CID (but correct CAS#), then another bot changed data (including CAS#) based on PubChem CID.
However, in potassium ferrocyanide (Q422017) there are doubled ChemSpider IDs, InChI, SMILES now – that is beacause in databases like PubChem/ChemSpider there are different items about the same compounds, but represented in different way (like [3] and [4] in this case). I don't know whether I should delete one ID or maybe add some exceptions to unique value constraint? Wostr (talk) 19:29, 30 January 2018 (UTC)

## Help I made a mess!

Hi. I'm not sure if this is the best place to ask, but it has been brought to my attention that a big edit i did on a batch of items for Human Genes has gone wrong. I was trying to copy the en aliases to cy using quickstatements but some alias have become fragmented, so '9-cis,12-cis,15-cis-octadecatrienoic' has become '9-cis' '12-cis' and '15-cis' for example. At the very least i need help to get these removed so that i can start from scratch. Or if some one knows how to transfer the alias programmatically that would be even better. I can provide a list of Q's for effected items. Please can someone help? Best Jason.nlw (talk) 16:34, 30 January 2018 (UTC)

If you can create a list of the bad aliases along with the list of Q's then a bot could remove those aliases. Or if you think it's ok a bot could remove all the cy-language aliases for those items. About how many items were affected? ArthurPSmith (talk) 16:57, 30 January 2018 (UTC)
It effects up to 888 items, and around 9000 alias (Genes have a lot!) I couldn't easily prepare a list of effected aliases so i think it would be best to remove all cy alias from those 888 items and i will start again. None of these had aliases before my edit yesterday. Here is a list of the items. Thanks for your help! Jason.nlw (talk) 17:23, 30 January 2018 (UTC)
Hmm, I've been trying to fix these, but so far the bot approaches I've taken don't work. Something special about removing aliases? Both Quickstatments and WikidataIntegrator don't seem to want to do that. I will look at it again but maybe you should post a request on Wikidata:Bot requests to get it done sooner... ArthurPSmith (talk) 19:57, 31 January 2018 (UTC)
Thanks ArthurPSmith, i will post a bot request as suggested. Thanks again for your help. Jason.nlw (talk) 09:54, 1 February 2018 (UTC)

## chemical formula (P274)

It is now set with single value constraint (Q19474404), but clearly there is more than one way to write chemical formula (I noticed that when I tried to add inorganic formula [cation-anion]). Shouldn't this be removed? And the Hill formulas should be tagged with criterion used (P1013) = Hill system (Q900739) (and maybe other formulas too, but with different items like 'inorganic formula' [I don't know at this moment whether this kind of formulas have any official name])? So it would be mandatory qualifier constraint (Q21510856) and one-of constraint (Q21510859) (Hill system (Q900739) and others). Wostr (talk) 21:00, 30 January 2018 (UTC)

@Wostr: We already discussed a little about the problem (see Wikidata_talk:WikiProject_Chemistry#New_property_for_composition). First step: definition of the different kind of chemical formulas in order to see if new properties are required or not. Then if no need of new property, we should replace the constraint of an unique chemical formula by the constraint of the qualifier criterion used (P1013) with the different possibilities. We can perhaps do a bot request to add criterion used (P1013) = Hill system (Q900739) to all statements chemical formula (P274) having Pubchem as reference. Snipre (talk) 08:18, 31 January 2018 (UTC)
I'll check in a few days if there are any official names for different formula writing systems. Wostr (talk) 15:37, 31 January 2018 (UTC)

## Auxiliary matching for the COSING Cosmetic Dataset

Hi Alchemists,

After getting a property for COSING created, I've imported the COSING IDs for matching in Mix N'Match. COSING is a huge cosmetic dataset used EU wide (and beyond). It has a lot of interesting information about stuff we put daily on our bodies, and that is essential to parse cosmetic ingredient lists for Open Beauty Facts.

Some automatic matches have been made by M&M, but it would increase reliability of matches if we could use the following columns as sanity checks, and perhaps to automate matching.

 INCI name INN name Ph. Eur. Name CAS No EC No

They are available as open data on the EU Open Data Portal (https://data.europa.eu/euodp/data/dataset/cosmetic-ingredient-database-ingredients-and-fragrance-inventory/resource/33aa4726-d05c-4756-ad91-6c6297de9771) , and in the target page on the EU Commission website (http://ec.europa.eu/growth/tools-databases/cosing/index.cfm?fuseaction=search.details_v2&id=74153)

A step further would be to import the addition columns as statements (eg: HUMECTANT, MASKING, SKIN CONDITIONING,SOLVENT properties, SCCS opinions…)

Poke @Snipre @Magnus Manske -   Notified participants of WikiProject Chemistry Teolemon (talk) 19:55, 1 January 2018 (UTC)

@Teolemon: Optimal importation process is the following:
1) extract WD data in the format Q number/CAS number/EC number in a table A
2) extract EU data in the format INCI name/INN name/Ph. Eur. Name/CAS number/EC number in table B
3) match lines in tables A and B having both same CAS number/EC number and create a new table C with the format Q number/CAS number/EC number/INCI name/INN name/Ph. Eur. Name/function (eg: HUMECTANT, MASKING, SKIN CONDITIONING,SOLVENT properties, SCCS opinions…)
4) match lines in tables A and B having CAS number or EC number but not both and create a new table D with the format Q number/CAS number/EC number/INCI name/INN name/Ph. Eur. Name/function (eg: HUMECTANT, MASKING, SKIN CONDITIONING,SOLVENT properties, SCCS opinions…)
5) Table D has to be checked in order to verify if the WD items are correctly defined and missing or wrong data have to be added/corrected. Once completed/corrected lines of Table D can be added to Table C.
6) Table C can be imported after bot request
7) Table C has to saved somewhere and be used as reference for periodic data check in WD to identify vandalism or wrong data handling. Snipre (talk) 23:58, 2 January 2018 (UTC)
Thanks a lot. Daunting task, but at least the plan is clear :) Teolemon (talk) 12:52, 23 January 2018 (UTC)
The Cosing databases currently comprise 25937 Ingredients (with the name INCI) and substances (with chemical name) components of perfumes or subject to restrictions in the annexes of the European regulation. The INCI names are decided and defined by the INC and the Cosing reports only those recognized in the EU, as the IECIC database reports those recognized in China. The Cosing database is periodically updated. It is of one year ago the addition over 10000 new ingredients. The Cosing number is not clearly identifying, when in wikidata the INCI name is missing as identifier. I proposed to include the INCI name as property.--Rodolfo Baraldini (talk) 12:37, 3 March 2018 (UTC)

## peramivir hydrates

Are these two (peramivir trihydrate (Q47495829) and peramivir tetrahydrate (Q27158395)) duplicates? Some ext-ids indicate so, but e.g. PubChem pages are for trihydrate and tetrahydrate (but maybe it's a mistake – names in PubChem indicate trihydrate)? Wostr (talk) 21:28, 23 February 2018 (UTC)

Never looked at the names in PubChem: this is not under the control of PubChem team and nobody checked if the name is relevant with th formula.
Pubchem has 2 different strutures (see InChIKey) so 2 items are necessary. So th ecorrect way is to delete all redundant ID and to rename peramivir tetrahydrate (Q27158395) peramivir tetrahydrate. Snipre (talk) 22:57, 23 February 2018 (UTC)
Okay, thanks for the changes. Wostr (talk) 00:12, 27 February 2018 (UTC)

## New Wikidata aware <chem/> tags

In wikitext <chem /> tags represent chemical sum Formulae. For instance <chem>H2O</chem> rendered as

${\displaystyle {\ce {H2O}}}$

represents Q283. However the rendering mechanism has some issues. Most fundamentally it is based on mhchem version 2, which is not optimal and can not be updated to mhchem v3. Maybe @mhchem: can add a link to the details about the incompatibilities.

In Wikidata there is already a property P274 which expresses the sum formula as UTF chars. This information can currently be displayed using either the parser function invoke or the lua module wikidata.

My goal is to improve the situation, by adding a better version of chem tags and use information from Wikidata in Wikipedia. I would like to find a page where I can interact with the community${\textstyle {}^{\text{TM}}}$  to brainstorm how this could be done best. Is this the correct place? From a technical perspective, I see two main questions:

1. What grammar should be used to encode chemical structures?
2. Where should the data be stored (inside the tag or in wikidata)?

--Physikerwelt (talk) 12:01, 21 February 2018 (UTC)

@Physikerwelt: This is probably the right place to discuss this. What do you see as the problem(s) with chemical formula (P274)? We also have general formula (P1673) which works similarly, and chemical structure (P117) which links to an image. ArthurPSmith (talk) 18:26, 21 February 2018 (UTC)
Physikerwelt, what do you mean by grammar in relation to chemical formulae? Wostr (talk) 19:33, 21 February 2018 (UTC)
I am not entirely sure if there are problems with the properties mentioned above. I clearly see the svg image as a disadvantage since it's hard to change. For instance adding another element to the example linked in chemical structure (P117) [5] would require the user to download the svg change it and upload a new svg and link that. With mhchem one can express more than just sum formulae such as "chemical equations" [6] ${\displaystyle {\ce {CO2 + C -> 2 CO}}}$  or even more complex structures
${\displaystyle {\ce {Zn^{2}+<=>[+2OH^{-}][+2H^{+}]{\underset {amphoteres\ Hydroxid}{Zn(OH)2\downarrow }}<=>[+2OH^{-}][+2H^{+}]{\underset {Hydroxozikat}{[Zn(OH)4]^{2-}}}}}}$

Are there any SPARQL queries using chemical properties, that would probably give a better intuition what is already good and what could be improved. --Physikerwelt (talk) 16:36, 26 February 2018 (UTC)
And how it is possible to obtain such structural formula in mchem? Wostr (talk) 00:01, 10 March 2018 (UTC)
@Physikerwelt: A good thing would be to discuss a new « chemical formula » datatype with the dev team : @Lea Lacroix (WMDE): author talk page 18:41, 21 March 2018 (UTC)

## Beilstein numbers

I wonder why we have Reaxys registry number (P1579) named that suggests it contains numbers in Beilstein database? As far as I know there is no such numbers right now and the Elsevier's Reaxys use only 'Reaxys Registry Number'. What's more this property is used in items that are not in Beilstein, but in other databases available trough Reaxys; also, some of these numbers comes from sources where it is indicated that the number is Reaxys Registry Number not Beilstein. I think we should rename this property accordingly. Wostr (talk) 00:12, 27 February 2018 (UTC)

@Wostr: Do you know if the Reaxys databse use the same numbers that the Beilstein database ? I mean if compound x had Beilstein number YYY in the Beilstein database, does Reaxys reuse that number YYY as Reaxys number ? Snipre (talk) 14:18, 13 March 2018 (UTC)
AFAIK (from the time I had access to Reaxys, which was about half a year ago) there is no such thing as Beilstein database or Beilstein numbers right now – there is only Reaxys (with all the information from Beilstein, Gmelin etc. databases included in it and Reaxys numbers). But to be sure, I will ask a person who have this access. Wostr (talk) 14:45, 13 March 2018 (UTC) PS Also, sometimes there are both ids (Reaxys and Beilstein) in ChEBI and both are the same number, but nevertheless I asked about it. Wostr (talk) 14:51, 13 March 2018 (UTC)
Yep, the Beilstein RN is now Reaxys RN (or Reaxys ID) and the numbers are the same. Also, some examples: in ChEBI [7], [8]; an source [9]; funnily, we have some Beilstein RN imported from sources where it is described as Reaxys RN, see e.g sulfuric acid (Q4118). I think we should relabel this property and Beilstein RN should be an alias. That's, however, not the case of Gmelin numbers (Gmelin number (P1578)), as these are different from Reaxys/Beilstein numbers. Wostr (talk) 16:59, 13 March 2018 (UTC)
Yes, Reaxys is a superset of beilstein now, using the same numbers. Most of these Beilstein numbers I imported from ChEBI. So maybe we should relabel Beilstein to Reaxys. Sebotic (talk) 13:33, 5 April 2018 (UTC)

## EC Inventory

The EC Inventory is a database that contains 106,211 unique substances/entries. Has it been (partially/fully) imported? EC number (P232) is currently used in 20,339 items. --Leyo 12:08, 9 April 2018 (UTC)

@Leyo: No, and I prefer to avoid any large data import before a good curation of the existing items:
- we still have 1122 items sharing the same CAS number and 196 items with 2 different CAS numbers (see report)
- 82 items sharing the same EC number (see "Single_value"_violations report)
- 88 items sharing the same InChIKey and 396 items having 2 different InChIKey (see [10])
Just adding large amount of data in the current situation will create more mess.
If you really want to work with the above source, you can extract the EC number and the CAS number from WD items having one values for these two properties and check if both values are the same in the EC inventory database, then create a list of conflicts and we will curate that list. Snipre (talk) 13:45, 9 April 2018 (UTC)
Items with CAS number issues or having EC numbers already shall not be changed.
Unfortunately, I am not really skilled in doing tasks like the one you proposed efficiently. --Leyo 14:20, 9 April 2018 (UTC)
@Leyo: So you can see what is the future need for WD: datasets comparison and analysis of possible matching: if we have 4 datasets and for one entry, 3 datasets have the same data, can we conclude that the entry is the same for all datasets ? And can we do the same if only 2 datasets have the same data ?
But ebefore doing that kind of job we have to clean our reference dataset, WD, and be sure that we don't have 2 items for the same chemical or one item mixing data about 2 chemicals. Snipre (talk) 14:52, 9 April 2018 (UTC)
Just to be clear: I was not suggesting to create any new items, but to import the EC number to existing items lacking a EC number based on the CAS number in an item. Items with CAS number issues are to be skipped. I don't think that such an import would cause a many issues. If so, I will fix them manually. --Leyo 15:00, 9 April 2018 (UTC)
@Leyo: This is not only a question of new items, this is a question of adding the data to the right place. You have in any case to do a choice in the data import process:
• use the CAS numbers in WD as matching parameter and then add the corresponding EC number from the EC inventory database
• use the EC numbers in WD as matching parameter and then add the corresponding CAS number from the EC inventory database
In each case you need to curate the existing items having some constraint violations before to be able to run that process import. If you have 2 items with the same CAS number, do you want to add the EC number to both items without checking if the CAS number id correctly used ?
If you try to use the name or the chemical formula to match the WD items with the EC inventory database, in the best case you will find no correlation, in the worse case you will add the data to the wrong item (typical example: an item with the English label describing an isomer but the item data are describing the isomers mixture).
If you want to be convincing about the relevance of your proposition, perhaps can you describe the process you will use to add the data ? Just to explain my position: one year ago, more than 1000 constraint violations were reported for CAS numbers. With the help of several contributors, we were able to reduce that number to less than 600. I don't want to see that number growing again just because someone wants to add data without taking care about consequences. I am direct because I spent a lot of time to curate data and I am tired to try to improve WD when others just play with data without any care.
I prefer few data with low errors than a lot of data with a lot of errors. Snipre (talk) 19:58, 9 April 2018 (UTC)
Most of your questions have already been answered. Didn't I express myself clearly? --Leyo 12:46, 10 April 2018 (UTC)
@Leyo: Sorry I missed the "Items with CAS number issues are to be skipped". I would propose to do the invers: use the EC number as matching parameter and add the CAS number. CAS number is not a reliable parameter especially not in WD. Snipre (talk) 11:16, 13 April 2018 (UTC)
Well, I intend adding EC numbers. There are currently 72,137 items with a CAS number, but only 20,336 with a EC number. I wonder how many items contain the latter, but not the former. --Leyo 12:14, 13 April 2018 (UTC)
The problem is that CAS numbers are not reliable mainly because we don't an official open source for CAS numbers. Snipre (talk) 13:47, 13 April 2018 (UTC)
By the way can you extract the ECHA InfoCard ID from ECHA database and add it to the corresponding EC number ? Snipre (talk) 11:19, 13 April 2018 (UTC)
A while ago, ECHA InfoCard ID (P2566) was added to items based on the CAS number by a bot. --Leyo 12:14, 13 April 2018 (UTC)

Is tetrakis(triphenylphosphine)lead (Q27284745) supposed to be a Pd or a Pb compound? It links CID 91667687 that is erroneus in that sense, i.e. a mishmash. --Leyo 16:30, 20 April 2018 (UTC) PS. It is potentially a duplicate of tetrakis(triphenylphosphine)palladium(0) (Q2366402).

It seems that's an erroneous entry imported from external database and I think there are two options: (1) if the lead compound exists, we should update tetrakis(triphenylphosphine)lead (Q27284745) respectively (by removing some properties/moving to tetrakis(triphenylphosphine)palladium(0) (Q2366402) etc.), (2) we could merge these two items and deprecate some erroneous data (also with Wikidata reason for deprecation (Q27949697) and proper value; we have applies to other chemical entity (Q51734763), so I think we should also have something like erroneous entry in external source or more specific reasons, because this is not an isolated case and we had and will have issues like this). As I can't find anything about this lead complex I'd choose the second option. Wostr (talk) 17:09, 20 April 2018 (UTC)
@Leyo, Wostr: Better report the error to PubChem team and see what is the answer. Email of PubChem: info@ncbi.nlm.nih.gov. Please indicate the CID of the palladium complex too. Snipre (talk) 22:26, 20 April 2018 (UTC)
I did so. --Leyo 07:52, 25 April 2018 (UTC)

## GHS data after creation of Property:P4952

As I see that Snipre is making some progress in relation to this property, I have to ask about the proper value in safety classification and labelling (P4952), because the proposition that we should use e.g. safety classification and labelling (P4952) = Regulation (EC) No. 1272/2008 (Q2005334) may cause some problems. I'm placing this in subsection, because I'm planning to compile a list of needed changes and needed new items, which I place in the next subsections to discuss.

### 1. Value in Property:P4952

If we use safety classification and labelling (P4952) = Regulation (EC) No. 1272/2008 (Q2005334) in items, it can have some implications in the future, because very few people understand which H-phrases one should choose from the source and place in WD. As an example for further discussion, the GHS classification and labelling for 2,2,4-trimethylpentane (Q209130) taken from Sigma-Aldrich SDS for European Union, relatively up-to-date (2017) [11]:

• classes and categories (classification): Flammable liquids (Category 2); Aspiration hazard (Category 1); Skin irritation (Category 2); Specific target organ toxicity - single exposure (Category 3); Acute aquatic toxicity (Category 1); Chronic aquatic toxicity (Category 1)
• H-phrases (classification): H225, H304, H315, H336, H400, H410
• H-phrases (labelling): H225, H304, H315, H336, H410
• EUH-phrases (labelling): none
• P-phrases (labelling): P210, P261, P273, P301 + P310, P331, P501
• GHS pictograms (labelling): 02, 07, 08, 09
• signal word (labelling): Danger

So, the options I see are:

1. use safety classification and labelling (P4952) = Regulation (EC) No. 1272/2008 (Q2005334)
• it will have to be clearly indicated that P728 (P728) is only used for: H-phrases (labelling).
• in this option it will not be possible to add both classification and labelling data in one item (so the TomT0m's method for classification using subclass of (P279) would have to be adopted).
2. use safety classification and labelling (P4952) = GHS labelling (Q50490754) (and if we agree to add GHS classification using P4792, also safety classification and labelling (P4952) = GHS classification (Q50490688))
3. use safety classification and labelling (P4952) = Qxxx (Qxxx created as a subclass of e.g. Regulation (EC) No. 1272/2008 (Q2005334) and GHS labelling (Q50490754): GHS labelling according to CLP Regulation)
• there will be no need for qualifiers, but we would need a few new items for each document (USA, EU, Japan, etc., etc.)
• if we agree to add GHS classification using P4792, we would have two items for each country, e.g. Qxxx: GHS labelling according to CLP Regulation and Qyyy: GHS classification according to CLP Regulation.

But maybe there is some other way which I don't see? Or maybe some problems may be eliminated in a way I'm not familiar with? Wostr (talk) 19:04, 14 March 2018 (UTC)

@Wostr: Do we need to do the difference ? You never find all labelling data (signal word, GHS pictograms, H-phrases, P-phrases, EUH-phrases) under classification so if you have only H-phrases without other data this means that the editor took the information from the wrong section. Then if the editor mixed H-phrases from classification section and other labelling data from labelling section then this is not our fault: if someone doesn't understand the difference between both sections then we can't teach everyone about everything. I prefer to specify in the property page the rules of use (meaning that P4952 used with Regulation (EC) No. 1272/2008 (Q2005334) implies that only labelling data from labelling section) and that's it. Snipre (talk) 14:23, 21 March 2018 (UTC)
@Snipre: the problem is that I've corrected dozens of GHS data in Wikipedia, because someone added wrong H-phrases (because I didn't know there is a difference etc.), so that's why I am a bit oversensitive on this. And we don't have to make distinction by safety classification and labelling (P4952) = GHS labelling (Q50490754), we can agree that P728 (P728) should be used for labelling H-phrases and add some complex constraints (that would catch situations where there is a probability that classification H-phrases has been added; if it's possible of course to make such constraints, e.g. if there is Hxxx and Hyyy then...). That may be however kind of confusing if we agree in the future that classification (classes, categories) should be added by safety classification and labelling (P4952) too – then is should be noted somewhere that: H-phrases in safety classification and labelling (P4952) are for labelling and H-phrases for classification have to be taken from GHS categories items by some query. Maybe Wikidata usage instructions (P2559) can be of some use here. Wostr (talk) 17:57, 21 March 2018 (UTC)

### 2. NFPA 704

Do we agree to file a bot request for merging existing NFPA 704 data into new property? And, of course, adding constraint to NFPA 704 properties that from now these properties should be used as qualifiers only?

The proposed model (identical like in the property's discussion):

Wostr (talk) 19:09, 14 March 2018 (UTC)

• As there is no answer for my bot request (migration NFPA 704 from an old model to the new), I'll try to do the most of these edits myself using QuickStatements (and the rest manually). This will take some time and will result in a situation in which for a few days some part of NFPA 704 data will be present in WD in an old model (every NFPA 704 property separated) and some in new model (every NFPA 704 property as a qualifier of safety classification and labelling (P4952)). Wikipedias using NFPA 704 data has been notified ~week ago about the change. If anyone have any comments about this, please let me know. Wostr (talk) 09:56, 27 April 2018 (UTC)
• Most of the NFPA 704 data has been changed to the new model. The completed batch included P143-sourced NFPA 704 data only (most of NFPA 704 data we have): ~150 items with full NFPA 704 labelling (4 properties) and ~1040 items with 3 properties (without NFPA 704 Special/Other). There is over 100 items in which NFPA 704 is incomplete/unsourced/sourced in a way that was not easy to convert using QuickStatements/etc. — these I'll try to edit manually (after update of constraint violations pages). Wostr (talk) 00:32, 5 May 2018 (UTC)

### Agreement to distinguish between system and document

Do we agree to use legal documents or standard documents instead of classification systems for safety classification and labelling (P4952) ?

For example:

Globally Harmonized System of Classification and Labelling of Chemicals (Q899146) is a system but can have different applications depending on the country. For EU, US and China at least some differences can appear due to different regulatory application texts. An we can't rely on the source to determine the good application text. For example an international company has to issue a MSDS for each country where its chemical is sold according to the local regulatory text. So for one product sold by one company, we can have at least 4 MSDS with slight differences (one for US, one for EU, one for China and one following the UN documentation). I don't know for other countries and I hope contributors can help me to define which text is relevant for each country.

Then if we agree for that solution for Globally Harmonized System of Classification and Labelling of Chemicals (Q899146), do we agree to use the same distinction for other safety classification system like NFPA 704 (Q208273) ? NFPA 704 (Q208273) is for the system and we have to create a new item for the document which describe the NFPA 704 system ? Snipre (talk) 14:48, 21 March 2018 (UTC)

That solution would solve two problems; normally we should use system item in safety classification and labelling (P4952) with some qualifier to distinguish between different jurisdictions. Don't know though if we should e.g. for UE GHS distinguish between different ATPs? With NFPA 704 the problem is that the document is NFPA 704 (it's a NFPA standard and 704 is a code for this standard) which introduces system (AFAIK usually called NFPA 704 too) to determine which categories should be used in NFPA 704 hazard diamond. So in the case of NFPA 704 I think we already have the document item.
The problem is for GHS, because I really don't know how the GHS for US and other countries placed in legal acts – if it's a single document we can use just one item for specific country or maybe there were more than one documents in different times. Fortunately, in Russian Wikipedia there is no GHS in their infoboxes so there won't be mass uploads of their unsourced data – but nevertheless I'l try to determine how it is done in Russia (AFAIK GHS in Russia will be mandatory from 2020? 2021?). Wostr (talk) 18:13, 21 March 2018 (UTC)
@Wostr: I don't like to mix different types of items as value for safety classification and labelling (P4952):
No mixing of concepts, that's the rule to avoid bad infering later. Snipre (talk) 20:33, 21 March 2018 (UTC)
Okay, I know what you mean. We should establish some constraint in this property, because we will have 'NFPA 704' item (about system), 'NFPA 704: Standard System for the Identification of the Hazards of Materials for Emergency Response' item (about standard) and a few 'NFPA 704: Standard System for the Identification of the Hazards of Materials for Emergency Response (version xxxx)' about editions of this standard. It won't be clear for people to understand which item they should use. And, if I understand this correctly, only the edition items will be correct? However, this will be somewhat not consistent with using Regulation (EC) No. 1272/2008 (Q2005334) – there were several amendments to this regulation (most of them called ATPs) which were introducing some changes to the UE GHS. There are situations where GHS data according to CLP Regulation after X ATP is different than GHS data (for the same substance) after X+1 ATP. So, should we make items for different ATPs and use them in safety classification and labelling (P4952)? Wostr (talk) 23:06, 21 March 2018 (UTC)
@Wostr: You clearly described the problem and no we won't use the versions because there is no way to define which version was used to define the classification/labelling of a compound. Only the fundamental document is mentioned in the SDS, not the version. If I list the versions, this is just to have an idea about the up-date of the fundamental document: if you have no up-date since 10-20 years, perhaps a new fundamental document is used. Snipre (talk) 11:03, 22 March 2018 (UTC)
• This and this may be of some help. BTW I think that – when we agree on all issues regarding this property – we could establish the full instruction here and just transclude relevant sections of this instruction to all properties discussions (rather than write instructions one by one). Wostr (talk) 14:16, 22 March 2018 (UTC)

### GHS statements

I've created items for GHS pictograms, H and P statements (see here). I will add items for EUH/AUH statements and for obsolete H/P statements the next week. Also, I'll try to convert old GHS data to the new model so as to P728 (P728) and P940 (P940) could be deleted. Wostr (talk) 19:47, 17 April 2018 (UTC)

@Wostr: Are you aware of the table of harmonised entries in Annex VI to CLP? I wonder if the content may be imported to Wikidata. --Leyo 09:11, 31 May 2018 (UTC)
@Leyo:, yes, I use CLI database on pl.wiki for adding GHS labelling in infoboxes. There are two problems though: (1) [12] The replication, in whole or in substantial part, of the ECHA databases is prohibited; I don't know if this has any legal value, I'm not a lawyer (2) harmonised labelling does not include P statements, so we should have in such situations add no value and probably also add some kind of comment that this is harmonised labelling or in some cases that this is a minimal labelling info. In Wikipedias (like in the case of pl.wiki) it is possible to add most of the labelling elements from CLI and P statements from other source (GESTIS/SDSs; if labelling from CLI match labelling from other source), but I think that is not the case for Wikidata.
I think we could have harmonised labelling added in WD (with P statements always added as no value; we should however determine first how to distinguish harmonised labelling from companies' labelling etc.), but doing this manually would be a nightmare. But: even if this database cannot be reproduced now, I think I heard that there are some changes coming to the EU Database Directive, so maybe in a few years it would be possible to incorporate CLI database into WD in an automatic or semiautomatic way. Wostr (talk) 12:08, 31 May 2018 (UTC)
The content of the database corresponds to the information available in Table 3 to Annex VI of CLP Regulation and therefore in the public domain.
I would recommend to rely on harmonised labelling, i.e. to skip P phrases, at least for now. --Leyo 13:30, 31 May 2018 (UTC) PS. If you understand German, the guideline de:Wikipedia:Richtlinien Chemie/GHS-Kennzeichnung may be of interest for you.
Yep, it seems logical, but there were some issues, even discussed in WD project chat, where a database containing public domain data could not be extracted in whole or in a substantial part. Don't know the details, I think it has something to do with the Database Directive and rights to the database (collection of information) not the data itself, but I'd be more cautious in this case — I think importing CLI database would require at least discussion in project chat or in other place here on WD (to make sure or at least be more certain that we can use CLI; there are some discussions now that specific data should be removed from WD because either the database license is not compatible with CC0 or someone imported data with violations of terms of use of the database). And thanks for the link; we have something similar on pl.wiki [13], but I'm curious how it is done in de.wiki. Wostr (talk) 19:43, 31 May 2018 (UTC)
@Wostr: You didn't understand the remark of Leyo: the labelling of the chemicals present in the ECHA database is defined in a legal annex of the European law. A legal document can't be copyrighted or even have restriction. So if you take the Table 3 of Annex VI from the CLP regulation (the legal document) then you can do what you want. The problem is that this document is a PDF and only ECHA database, which reuses that table in his database, offers an electronic document. So if you use as reference not the ECHA but the annex of the CLP law, then you can reuse all the data. The tricky thing is to be sure that the ECHA dataset is corresponding 100% to the legal document or to find a way to extract the labelling data from the PDF of the legal text. See that link to the legal document. Snipre (talk) 20:07, 31 May 2018 (UTC)
@Snipre:, yeah, of course CLP Reg. is in public domain and theoretically we could import CLI database and add the CLP Reg. as a source. But this would be a bit phoney. The table can be extracted from the HTML view of the CLP Reg. [14] to e.g. Excel sheet – it's how I extracted all the H and P statements. Wostr (talk) 20:16, 31 May 2018 (UTC)

## Assign CAS RN to INN

I wonder whether it is possible to make use of Wikidata to assign CAS numbers to (latin) INNs of pharmaceuticals, for example from this list[15]. Or alternatively without Wikidata. ;) --77.59.124.95 22:30, 4 June 2018 (UTC)

## Wikidata:Requests for deletions#Q27882203

Additional opinions are welcomed. --Leyo 21:59, 26 April 2018 (UTC)

The RfD is now open for almost two months. More opinions are welcomed. --Leyo 12:05, 8 June 2018 (UTC)

## possible resolution of element/substance issue - a new property?

So I was thinking some more about this - maybe it is ok for Wikidata items to represent two different things where there's no ambiguity? Say you want the melting point for all the "elements" (as substances); that may be hard to do right now where sometimes the melting point property is on the element item and sometimes on an allotrope item. But if we had a property "as a substance" to link elements to their substance forms, then for the unambiguous cases that property could link the item just to itself, while where there's ambiguity the property would link to each allotrope. That is, for each element you would query for P-substance/P2101 rather than just for P2101, so for manganese (Q731) it would return the P2101 value for itself, while for sulfur (Q682) it would return the P2101 for each allotrope we have a Wikidata item for. ArthurPSmith (talk) 15:50, 8 June 2018 (UTC)

@ArthurPSmith: Ok for the property, but if we correctly manage our items we should be able to the same without a new property: just using the combination of instance of simple substance with has part property to group each allotrope to the corresponding chemical element. If we clearly separate items we can retrieve all possible associations using a correct SPARQL query. The problem is that people dont' know how a database is working and want to have everything in the item like they have in WP in the same article. Snipre (talk) 19:31, 10 June 2018 (UTC)
SELECT  ?substance ?substanceLabel ?elementLabel
WHERE
{
?substance wdt:P31 wd:Q2512777.
?substance wdt:P527 ?element.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".}
}
Try it! Snipre (talk) 21:58, 10 June 2018 (UTC)

### Before creating properties, we need to agree on the model

Do we agree to create different items to treat chemical element (species of atoms having the same atomic number) and the corresponding substance (part of matter composed of the atoms having the same atomic number linked together) ? If yes, then we can think about a possible way to link both concepts. Snipre (talk) 23:41, 13 June 2018 (UTC)
• Is say: no. (From/at enwiki: the element business works great wrt an unseparated pair; must say I'm waiting for the WD orthography to convince me otherwise). But that's just me, maybe have missed some readings. - DePiep (talk) 23:55, 13 June 2018 (UTC)
But hey, isn't the WD model OK to say: "ELEMENT = FORM A and/or FORM B", as in "Bonny & Clyde"? - DePiep (talk) 23:59, 13 June 2018 (UTC)
@DePiep: No, "Bonny & Clyde" is not a mixing of concepts: "Bonny & Clyde" item is about a duo, a group of persons, and the properties describing a person (birth date, sex,...) are not used in "Bonny & Clyde" item. In the chemical element/substance problem, we can't mix both concepts if we want to consider correctly the different kind of atoms of the same chemical element: the chlorine atom in the NaCl molecule is not concerned by the boiling point of the dichlorine molecule. The definition of the chemical element is "all atoms having the same number of protons" without any information about the way the atoms are bonded or if the atoms are linked to atoms of other chemical elements. So considering that the properties of dichlorine can be applied to the chlorine atom in the NaCl molecule, is IMHO an oversimplification.
Then your way of presenting data in WP:en for allotrops is something completely non-neutral: for oxygen, no information appear in the infobox about physical properties of ozone. Why this choice ? Then in the carbon article, physical properties in the infobox shows data for diamond and graphite, so following the reasoning of WP:en, we should merge items of diamond, graphite and carbon in one item ? I am not criticizing the way of WP:en managed its articles, but there is no logics. WD has to have logic as one of its purpose is to be machine readable: we need an uniform way to model the data and not to create a different way according to particular cases.
Finally WD is not constrained by any choice done by the different WPs: why do we have to take account of WP:en and not of WP:de or WP:zh ? And if I follow correctly the "politics" in WP:en, I think that WP:en is not interested by using WD (see the last RfC about WD), so I don't think that WD has to take care of WP:en. Snipre (talk) 21:41, 14 June 2018 (UTC)
• Yes, I think it would be much cleaner to separate the simple substances from the elements that make them up. --99of9 (talk) 00:46, 14 June 2018 (UTC)
• No, not for all elements. Only for the ones that have different allotrops, i.e. carbon, sulfur, oxygen, phosphorous, etc. --Leyo 21:25, 15 June 2018 (UTC)

## Nicotine

Could anybody please help to curate Nicotine (nicotine (Q12144), (−)-nicotine (Q28086552), (+)-nicotine (Q27119762)) where a lot of statements have to be moved from nicotine (Q12144) (racemic) to (−)-nicotine (Q28086552) (natural occuring isomer)? Should all the interwikis in this case also be moved to (−)-nicotine (Q28086552)?--Mabschaaf (talk) 07:24, 17 June 2018 (UTC)

@Mabschaaf: We need to curate the items, the interwikis is not the responsability of WD but of the different WPs. But if the WP articles are clearly focused on (−)-nicotine (Q28086552), we can move them. Snipre (talk) 19:36, 22 July 2018 (UTC)

## Selenium disulfide

I'd like to ask for help in proper separation of selenium disulfide (Q419375) and selenium disulfide (Q56249646). The first items describes a mixture that is used in medicine and cosmetics (mixture of various selenium sulfides), the second specific compound. The problem is that in most databases there are three concepts mixed into one: compound, mixture and group of compounds with selenium to sulfur ratio = 1:2. I'm not sure where should I put the identifiers and whether selenium disulfide (Q56249646) is needed (maybe there are conditions in which SeS2 molecule exists, but in solid state Se and S forms cyclic polysulfides. Wostr (talk) 20:31, 25 August 2018 (UTC)

## analog or derivative of (P5000)

I've accidentally found this property which was created not so long ago, apparently without participation of anyone from this wikiproject... and without even pinging this project. Wostr (talk) 21:50, 21 July 2018 (UTC)

To delete or to redefine the use before this property is used in too many items. First the label of the property is not clear: analog and derivative don't have the same meaning. We should define if this property should be used to link items having similar physcal/chemical characterictics or similar structural characteristics. Then we need to define the rule allowing to use that property.
My personal opinion is that this property is not required for now and should be blanked or deleted. Snipre (talk) 19:31, 22 July 2018 (UTC)
My opinion is similar, this is ambiguous property that may have some use in medicine, but it's unclear from chemistry POV (also, imho even the concept od 'derivative' is not unambiguous enough to be used in WD, but that's another story). I'll post a notice on the property's discussion page about this topic. Wostr (talk) 21:08, 22 July 2018 (UTC)

## Glucose

@DeSl: Following your comment above I want to share my opinion about the case of glucose. Your comment was

...glucose (Q37525) the group of compounds named glucose; D-glucose (closed ring structure, complete stereochemistry) (Q23905964) the closed ring structure of D-glucose;anhydrous dextrose (open form) (Q21036645) the open form of D-glucose...

This vision is of classifying is not objective: using commom name instead of objective parameter just leads the mess we found in other databases. WD classification should be more objective ans in term of objectivity, chemical structure is the best.

So D-glucose (closed ring structure, complete stereochemistry) (Q23905964) should not be instance of glucose (Q37525) but instance of glucopyranose (Q23905960). A global classification should vahe the following scheme:

Relations between glucofuranose, glucopyranose (Q23905960) and glucose (Q37525) should be managed using dedicated properties like stereoisomer of (P3364) perhaps should we have "tautomer of" (but I am not sure that the relation between open/close ring is part of tautomerism). Snipre (talk) 20:45, 23 August 2018 (UTC)

Just a note - I definitely prefer subclass all the way down; each one of these are abstract entities and the "instance of" distinction here I think is too subtle to either be understood by regular users or perhaps to be actually ontologically correct. For example, suppose I want to have an item for levoglucose (Q3266724) dissolved in water, vs. crystal, what is the ontological relation? ArthurPSmith (talk) 12:05, 24 August 2018 (UTC)
@ArthurPSmith: As always, before starting to create relations, we need to define concepts: what is levoglucose (Q3266724) dissolved in water ? Not an instance of or a subclass chemical compound. But an instance of mixture or of chemical substance. Then if this is a mixture then levoglucose (Q3266724) and water are part of the mixture. Snipre (talk) 19:36, 27 August 2018 (UTC)
Ok, I guess that can be a self-consistent viewpoint on this. Other narrower types (for example specific isotopic arrangements or molecular states) can be linked via other relations than the instance/subclass I guess, so maybe this is all ok. ArthurPSmith (talk) 14:41, 28 August 2018 (UTC)

## ChEBI secondary IDs (Property:P683)

There are over 700 single value constraint violations for this property, all or most of it caused by secondary IDs (entries in ChEBI database have one primary ID and may have several secondary IDs). I asked in Project chat what can be done in this situation (deprecate secondary IDs, mark primary ID as preferred or add exception to constraint (P2303)). However, Lucas Werkmeister pointed out that there is single best value constraint (Q52060874). I think we should replace single value constraint (Q19474404) with single best value constraint (Q52060874) and primary IDs in ChEBI should be marked as preferred. Do you have comments or ideas (how to do it differently)? Wostr (talk) 21:28, 13 June 2018 (UTC)

I would prefer to delete secondary identifiers, because this is not the role of WD to keep track of identifiers evolution in other databases. But if nobody has a similar position, then the minimal action is to change the constraint. Snipre (talk) 23:29, 13 June 2018 (UTC)
I would also prefer the deletion of secondary identifiers. We use WD's chemical IDs for creating mappings files between metabolites (with BridgeDb), and secondary IDs are creating several problems (but that is a long story). ChEBI does have its own API, where one could check their IDs for being primary/secondary. So I agree that WD doesn't have to accommodate for this. It will also send a clear message, that we don't want sec. IDs (because people will forget about adding the rank). DeniseSl (talk) 07:17, 14 June 2018 (UTC)
• So do I understand correctly that there is no reason for keeping secondary IDs? Wostr (talk) 19:20, 24 June 2018 (UTC) If so, I'll update the property's discussion page that sec IDs should be deleted. Wostr (talk) 19:22, 24 June 2018 (UTC)
• Yes, I think that's the consensus. I have asked Magnus to disable Mix'n'Match for now, which allowed people to include secondary identifiers, and will see if we can set it up again with only primary identifiers. --Egon Willighagen (talk) 14:56, 13 November 2018 (UTC)

## Inverse statements for group of isomers

We have consensus that stereoisomers should have instance of (P31) = group of stereoisomers, e.g. , but there are items in which users tried to add inverse statements, i.e. . I proposed above to use disjoint union of (P2738) for that and I changed several part of (P361) claims, but it's not correct as it turned out, because values in disjoint union of (P2738) should be classes and we treat chemical compounds as instances. There is a third option: use of (P642) like in cymenes (Q2672403), but it also causes some problems, because it's not valid in every language, as of (P642) is quite hard to define and statements like in cymenes (Q2672403) may be interpreted in different ways.

But maybe we don't really need to indicate that 2-pentanol (Q210479) is either (R)-2-pentanol (Q24953060) or (S)-2-pentanol (Q20680358) as a statement in 2-pentanol (Q210479) and and are sufficient? Wostr (talk) 19:18, 20 November 2018 (UTC)

## 'Is a' = 'chemical compound'

Could we redefine one of the points from the main page of this wikiproject? Add for each pure chemical substance (i.e. not mixtures or solutions) the property jest to (P31) with the value związek chemiczny (Q11173) to something like For every pure chemical substance add the property instance of (P31) with the value being chemical compound (Q11173) or one if its subclasses? I see more and more items having replaced by more specific classes and as for now we already have over 500 classes of chemical compounds linked to chemical compound (Q11173) directly or indirectly, and over 200 groups of chemical compounds (including family of isomeric compounds (Q15711994)) linked to chemical compound (Q11173) as well. Wostr (talk) 19:18, 20 November 2018 (UTC)

I'm ok with this, but see my note above - we maybe want to be a little careful about what we consider a "class" and what we consider a "metaclass" here, and treat chemical compound (Q11173) and its subclasses consistently (either as metaclasses in which case P31 is appropriate in many places, or as regular classes in which case P31 should be rather rare here). ArthurPSmith (talk) 21:03, 20 November 2018 (UTC)
I'd say that chemical compound (Q11173) and all its subclasses are classes, e.g.
Every class here, except for chemical compound (Q11173), have (metaclass). This metaclass and some other similar metaclasses like group of chemical compounds (Q56256086) or family of isomeric compounds (Q15711994) can be easily used to differentiate classes in the classification tree (whether specific class is a 'real' class used in chemical classification, or it's just 'group of isomers' = 'compound without defined isomerism' etc.). I think these metaclasses could be also used in queries in a situation we don't have any instance of (P31) in chemical compounds but subclass of (P279) all along (i.e. only chemical compounds would have no metaclass, so queries like this: P279* Q11173 minus those having P31/P279* Q17339814 would give chemical compounds only).
But right now we have a situation in which every chemical compound is an instance of a class, not a class. It won't be easy to change that with over 150k chemical compounds and probably most of them not manually curated. Wostr (talk) 22:23, 20 November 2018 (UTC)
Yes, it would be a major change. But the situation right now really does not make logical sense. I'm not sure how we should try to move forward on it though... ArthurPSmith (talk) 15:13, 21 November 2018 (UTC)
@ArthurPSmith, Wostr: Not in favor of any new recommendation before a clear definition about the classification of chemical is provided. Just have a look at ethanol (Q153) to see that several ways of classification exist: classification according to use, classification according to functional group, classification according to properties, ... And once we choose the classification we need to cleanup the class tree in order to have something coherent. When people are adding alcohol and alkanol, I suspect that they don't understood the difference. Snipre (talk) 19:21, 4 December 2018 (UTC)
@Wostr: And by the way, with your automatic deletion of instance of chemical compound, you completely distort my monitoring of data improvement about chemicals (see Wikidata:WikiProject_Chemistry/Tools#Statistics). Can we expect consensus before changing the rules especially when the rules are written in the first page of a wikiproject ? Next time please get the concensus, then remove the rule from the first page and then start to modify the items. Snipre (talk) 19:39, 4 December 2018 (UTC)
/edit conflict/ @Snipre: Surely, I can revert any additions of subclasses of 'chemical compound' and changes of 'chemical compound' to its subclasses done by me or by the others, but I think it may be counter-productive. People do it more and more often (I did several such edits in a past few days, because I didn't see any opposition here) and I really have no argument to justify reverting/changing their edits (there was already one topic on this discussion page about me changing back to 'is a' = 'chemical compound' and I really don't want to explain something to others which I don't think is right). The problem with most of the statements 'is a' in ethanol is that people apparently don't know about 'has role' property: , but I think that and , polar protic solvent (Q27949287) should be probably moved to some sort of 'hazard classification' property.
Also, there will never be a point in time in which we will have cleaned up class tree, with over 150k compounds and a few of us cleaning this. In other words, we cannot wait with classification of chemicals until we have all of our chemical compound classes in place, we have to start as soon as we can and slowly proceed forward, so maybe in a several years we will have some results.
And about me distorting anything: any classes I added are subclasses of 'chemical compound', so it's simple change in a query; this + no opposition here + your comment in the past about that you're reserved about changing 'is a' 'chemical compound' to its subclasses, because we cannot ensure these classes will always be subclasses of a 'chemical compound' == I did several changes of 'is a' 'chemical compound' to its subclasses in a past few days. I can refrain from such edits, its no big deal, but I'm not the only one and once in a while I see someone doing changes like this or I see items without 'chemical compound', but with its subclass. Wostr (talk) 19:54, 4 December 2018 (UTC)
@Snipre: BTW yesterday I added to every item in which I changed it in a past few days, so my edits should not be a problem here, but we should concentrate on a solution or some kind of a roadmap, what we should do first to get closer to a solution. Wostr (talk) 12:20, 5 December 2018 (UTC)

## alpha-Fenchene

I'm a bit confused and I can't figure out what is wrong with 2 items about alpha-fenchene:

But PubChem and ChemSpider gives different data (InChI, SMILES; it seems that in one of the databases the data is about the second stereoisomer, but the name is for the first?); I was checking it for over 20 minutes and right now I really don't know which id and which InChI should be added to these two items. I'd be grateful if someone would look at it with a fresh eye. Wostr (talk) 18:27, 4 December 2018 (UTC)

I trust Chemical Abstracts most in situations like this. (PubChem contains many errors and is not well curated. ChemSpider is based to a large extent on PubChem data, but it is actively curated, from what I can tell.) Here's what I can discern from Chemical Abstracts via SciFinder in regards to these three compounds. The absolute stereochemistry is specified for 471-84-1 and 116724-26-6 in the systematic names. The optical rotation is specified for 7378-37-2 as (+), but is not specified for 116724-26-6 so I'm assuming it is (-). The SMILES column is derived from pasting the chemical name into MarvinSketch and then using its "Copy As Smiles" function. If the ChemSpider and PubChem pages aren't consistent with this data, it might be best not to just not link to them. Edgar181 (talk) 21:21, 4 December 2018 (UTC)
CAS number Systematic name Common name/optical rotation SMILES
(based on systematic name)
471-84-1 7,7-Dimethyl-2-methylenebicyclo[2.2.1]heptane α-Fenchene CC1(C)C2CCC1C(=C)C2
116724-26-6 (1R,4S)-7,7-Dimethyl-2-methylenebicyclo[2.2.1]heptane (-)-α-Fenchene (assumed rotation) CC1(C)[C@H]2CC[C@@H]1C(=C)C2
7378-37-2 (1S,4R)-7,7-Dimethyl-2-methylenebicyclo[2.2.1]heptane (+)-α-Fenchene CC1(C)[C@@H]2CC[C@H]1C(=C)C2
Thank you Edgar181, I will check these two items against your data from SciFinder and assign PubChem/ChemSpider accordingly. Wostr (talk) 12:24, 5 December 2018 (UTC)
 I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. Wostr (talk) 12:47, 5 December 2018 (UTC)

## Chemical substance

Could someone look at these changes in items about chemical substance (chemical substance (Q79529), chemical substance (Q21652022)): [16], [17]. I don't think the changes were correct, but I can't get any clear answer to my questions from the author of these changes, co I'm asking here for an opinion. Wostr (talk) 00:04, 15 November 2018 (UTC)

@Wostr: I saw strange things but I didn't have the time to look into details. Snipre (talk) 12:49, 15 November 2018 (UTC)
@Infovarius: You did a lot of changes in chemical substance (Q79529) and you created chemical substance (Q21652022), but there is no coherent system between these different items and substance (Q27166344). Please explain the relations between the 3 mentioned items.
Without answer from your part I will revert your changes because the previous relations were clear. Snipre (talk) 12:59, 15 November 2018 (UTC)
Looks like the changes were to reconcile the ruwiki pages "Вещество" and "Вещество (химия)" but it seems to me the better (less disruptive) course would have been to switch the two ruwiki links, as in almost every other language Q79529 refers to chemical substances, not substances in general. ArthurPSmith (talk) 14:51, 15 November 2018 (UTC)
@ArthurPSmith: Thank you for the answer. But can you explain me if the already existing item substance (Q27166344) could not do that distinction ?
As I understand Russian
"Вещество" = substance (Q27166344)
"Вещество (химия)" = chemical substance (Q79529)
And no new item is necessary.
@Infovarius: Please could you provide your feedback ? Thank you Snipre (talk) 13:47, 19 November 2018 (UTC)
@Snipre: Sure, substance (Q27166344) seems to be the same concept as chemical substance (Q21652022) was originally. Note that the latter was created first, so "already existing item" isn't really a correct description. I think it makes sense to revert the recent changes as you suggest, swap the Russia sitelinks, and then merge substance (Q27166344) and chemical substance (Q21652022). ArthurPSmith (talk) 15:08, 19 November 2018 (UTC)
Thanks, I will wait one or two days before reverting anything in order to let Infovarius the time to give his explanation. Snipre (talk) 20:14, 19 November 2018 (UTC)

As I've said here, most of sitelinks should be on chemical substance (Q79529), once that expressions such as "substância/substância química" (pt), "sostanza/sostanza chimica" (it), "sustancia/sustancia química" (es) are treated as synonyms. Rafael Kenneth (talk) 03:49, 20 November 2018 (UTC)

@Rafael Kenneth, ArthurPSmith, Infovarius, Wostr: Modifications done. Snipre (talk) 06:48, 13 December 2018 (UTC)
 I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. Snipre (talk) 06:49, 13 December 2018 (UTC)

## Isotopically modified compounds

It was mentioned in the discussion about compounds without defined stereochemistry above that isotopically modified compounds should be instances or subclasses of isotopic compound (Q22332141). But what should be the relation to the compound with natural isotopic composition? Today's example:

But there is no relation between:

I could add but I can add neither nor , because L-selenomethionine (Q27096144) is not a class but an instance of a class, so it can't have any instances or subclasses.

How it should be linked? Wostr (talk) 19:18, 20 November 2018 (UTC)

'L-selenomethionine (Q27096144) is not a class but an instance of a class' - this is the sort of problem I've been alluding to all along here: EVERY chemical compound, substance, molecule, etc. is actually an abstract concept unless we are talking about a specific physical manifestation (such as Hope Diamond (Q640037)). As such they can always be "subclassed" in the sense of finding some way to subdivide real physical manifestations by various criteria (ultimately perhaps, specific location at a specific time). So, no, actually, I would say L-selenomethionine (Q27096144) is indeed a class, and most of the relationships between chemicals should be P279, not P31. ArthurPSmith (talk) 21:01, 20 November 2018 (UTC)
• After some thought, it seems to me that the relation between isotopically modified compound and compound having natural isotopic composition could be done using a dedicated property, similar to is a hydrated form of (P4770). Would having or make statements from the superclass be inherited by a subclass? If so, such relation would not be correct. I looked how ChEBI does this and it seems they have this unorganized: some compounds are superclasses of isotopically modified compounds and in some cases there are no relation between them (like 11C-choline and choline). Wostr (talk) 19:29, 14 December 2018 (UTC)

## Chemical compounds with unspecified stereochemistry

While trying to curate some chemical compound items (either by resolving e.g. CAS number constraint violations or disambiguating between compound/ion/class/functional group) I'm finding many entries about compounds that have unspecified geometry, like 5-(bromomethyl)-1,2,3,4,7,7-hexachlorobicyclo[2.2.1]hept-2-ene (Q27155747) and bromocyclen (Q27281057). Something like 'compounds with unspecified stereochemistry' is only a theoretical concept and cannot exist, so in fact it means 'one of X stereoisomers'.

I see at least three options here:

1. treat 'compounds with unspecified stereochemistry' as a 'group of stereoisomers' (family of isomeric compounds (Q15711994) or a subclass of it)
2. merge two items and set deprecated rank for ids that refer to compound with unspecified stereochemistry (with a new Wikidata reason for deprecation (Q27949697))
3. create new item like 'compound with unspecified stereochemistry' and use it with instance of (P31) – it's a way I don't like very much by the way

I've started with option 1 for a several cases like this, but I'm not sure if this is the right way. Wostr (talk) 21:40, 22 July 2018 (UTC)

Heu, it was quite clear since a certain time that 'compounds with unspecified stereochemistry' should be treated as 'group of stereoisomers'. Same for cis/trans compounds which can be undefined and an item can be created as 'group of cis/trans compounds'. The question is do we want to create items for all possible combinations of unspecified chiral atoms ?
For racemic mixtures, the case is different: a racemic mixture has a defined compositions and so can have properties like densities, boiling point,... A racemic mixture is not a subclass of chemical compound but an instance of chemical substance. Snipre (talk) 15:14, 23 July 2018 (UTC)
2-pentanol (Q210479) is IMHO a group of stereoisomers, not a racemic mixture. Snipre (talk) 15:59, 23 July 2018 (UTC)

### Rules definition

Perhaps we should formulate the rules and put them somewhere in the project pages in order to formalize those rules. Snipre (talk) 15:59, 23 July 2018 (UTC)

1) All chemical compounds (or pure chemical substances) with a completely defined isomery (cis/trans isomers, enantiomers, structural isomers) can have a dedicated item.

2) All isomers can be grouped with the help of a completely undefined compound using the relation "instance of". The undefined compound has to be classified as "subclass of" "chemical compound".

Ex.1:
Ex.2:

3) Partially defined isomers should not have a dedicated item unless there are some identifiers referring to those mixtures. 4) Atropisomer can have an item only if the different compounds can be isolated. 5) Racemic mixture has a fixed composition and could not be considered as a group of stereoisomers but as a mixture with defined properties. This kind of mixture should be defined as instance of racemic mixture (Q467717).

Ex.1:

6) Partially or completely isotopically defined compounds should be defiend as subclass or instance of isotopic compound (Q22332141)

Okay, it seems logical to me. There is stereoisomer of (P3364) (which I've found recently) that may be helpful. Also I'm finding different methods of linking 'group of isomers' to isomers: e.g. by has part (P527) – this seems quite wrong to me, I added disjoint union of (P2738) to DL-N-carbamoylaspartic acid (Q2823324) as an example (I think that may be better solution, but do we need something like that at all?). Wostr (talk) 12:43, 29 July 2018 (UTC)
These definitions look good. --Egon Willighagen (talk) 13:58, 18 August 2018 (UTC)
Probably I've been adding several of these "has part" and "part of" relationships, wasn't aware of the properties mentioned above. When there is consensus on which method to use to link it all together, I'll track down my changes and upgrade them to the updated rules. DeSl (talk) 08:50, 23 August 2018 (UTC)
• I've noticed e.g. maltose (2 ring structure, not stereospecific) (Q56229989) and many others like this – don't know how to fit these items into the classification above. Should we have different items depending on open/ring structure of carbohydrates etc.? @DeSl:, as an author of these items, what's your opinion? Wostr (talk) 15:46, 22 August 2018 (UTC)
Hi Wostr, thank you for including me in this discussion. Recently, I've been doing a lot of manual curation, for chemical compounds which are in WikiPathways and are mapped to two Wikidata IDs (because they where annotated with a tertiary identifier that is used in Wikidata for two separate compounds; this goes wrong a lot for different stereospecific forms of compounds with a similar name). These 'double mappings" are easy to spot now that the "single identifier" constraint is displayed next to the id, with a linkout to the other Wikidata IDs it has been used for (so thanks to whomever made that possible, makes my life a lot easier!). But it is still hard to see these very subtle differences in chemical structure from the title of the compound.... I usually click on the isomeric smiles for the IDs I want to compare, and then switch between pages to see where the difference is. But for compounds that are very different in terms of structure (open/closed ring structure for example), I now put that information in the title, so the difference is also apparent to other users. And, when they type in a name like "glucose", they will clearly see that we have three different forms, just by the name (glucose (Q37525) the group of compounds named glucose; D-glucose (closed ring structure, complete stereochemistry) (Q23905964) the closed ring structure of D-glucose;anhydrous dextrose (open form) (Q21036645) the open form of D-glucose)... So that was the (very short) explanation of why I add these names... Now moving on to: "do we want different items on these open/closed ring structures" @Wostr:... Several databases have identifiers for these "different" compounds (even though they are probably tautomers in the case of small carbohydrates, hard to measure in reality etc.). Since the database we use to draw biological pathway (WikiPathways) is depending on identifier mapping support, we need to be able to map to these identifiers. Sometimes, it is unknown whether the closed or open ring structure is measured; sometimes the stereochemistry is undefined, or sometimes we really do know which which steps are followed to go from glucose-1-phosphate to fructose-6-phosphate (check out https://www.wikipathways.org/index.php/Pathway:WP534 for a detailed drawing of this, several open and closed forms of compounds and therefore IDs where needed). So I would like to see support for this, and I personally like to see the difference of these compounds in the name. It will help users of WikiPathways annotate metabolites with more chemical correctness (or at least make them aware of the differences between the compounds). But any thought on the matter are appreciated of course! DeSl (talk) 09:01, 23 August 2018 (UTC)
Okay, thanks for your input DeSl. I really don't have opinion whether we should differentiate between open chain/close ring, so I'll take your words for that this is needed in some areas. I have a question though: wouldn't it be better to move 'close ring structure' and similar descriptions from label to description? I've once tried to disambiguate items using different form of labels (singular/plural in my case), but I was convinced eventually that labels may be identical and the description in Wikidata is meant for disambiguation of items. Wostr (talk) 15:37, 23 August 2018 (UTC)
@DeSl: So why don't you use the scientific name as label to differentiate clearly the open/close ring (like L-glucose or α-D-Glucopyranose) ? The nomenclature is quite clear. By using the nomenclature, we will definitively do a difference with other databases and we will offer a good way to identify clearly the compounds. Snipre (talk) 19:34, 23 August 2018 (UTC)
@Wostr: What is really your concern ? The way of naming the items or the justification of the item creation ? According to 1), if the compounds are fully defined they can own their items, close ring or open ring. Snipre (talk) 19:43, 23 August 2018 (UTC)
• The statement above (see 1) is contradicting the one below (see 2?) in my opinion.... how can I link a stereo defined compound to its "superclass/parent compound", if this cannot be a dedicated item? And what about all the compounds in other databases, where stereochemistry is not/ill defined? Sometimes a not-stereospecific compound makes sense (since it was measured with MS for example)... DeSl (talk) 08:42, 23 August 2018 (UTC) @Snipre: @EgonW:
@DeSl: Sorry, but I moved your comment to avoid to mix the proposed rules with the comments. Can you use the numbering to indicate where do you find contradiction ?
I don't see the contradiction. The above rules say that if a compound has 2 chiral centers, I can create 5 items: one item for the compound with both undefined centers and 4 items, one for each compound with both defined centers, but no item for compound with only one chiral center and one undefined center. Where is the contradiction ?
Exception are partilly defiend compounds having some external identifiers like CAS number, EC number,...: the existence of external identifiers is a kind of structural need justfying the creation of items. Snipre (talk) 19:25, 23 August 2018 (UTC)
• We certainly need a list of clear rules, guidance, and exemplars on a subpage. I suggest starting with the set you have thought through, and then if anyone finds problems or cases that are not covered, we can discuss on the talkpage. --99of9 (talk) 00:26, 28 August 2018 (UTC)

### New proposal

I've prepared a proposal with 10 points, based on the above, and examples to each point. Wostr (talk) 19:59, 11 December 2018 (UTC)

• @Wostr: Thanks for putting that together. It looks very sensible; the only disagreement I have is point 5: I think we should avoid instance of (P31) for any relations between chemical compounds and use subclass of (P279) uniformly. Having instance of (P31) at the "lowest" level is likely to lead to (A) many errors from people who don't quite understand the stereochemistry implications, and (B) confusion regarding other aspects that may also be more precisely specified. What is an "instance" of a particular chemical compound? I think it must mean a particular group of molecules (in somebody's lab, for instance), not an abstraction - and that meaning of instance should be consistent whether or not the compound has been fully defined or not. ArthurPSmith (talk) 16:44, 12 December 2018 (UTC)
One problem with is ontological, but I see some specific practical problems with this: one is described in Isotopically modified compounds topic below, also I've noticed that we already have some instances of minerals in WD (i.e. real instances, some chunks of rock in a museum, with catalogue numbers etc.), so I wonder when will we have similar items about samples of chemical compounds that will have to be (or rather P31 of a specific compound) and that relation won't be possible, because our chemical compounds are already instances not classes. Wostr (talk) 17:30, 12 December 2018 (UTC)
• Excellent points. I'd be comfortable with either your 1. or 2. as a solution; the taxon hierarchy is its own little world and I think I've seen people discourage other areas from following that model, so maybe 1. is preferable... ArthurPSmith (talk) 20:12, 12 December 2018 (UTC)
• @Wostr: I think you don't understand the role of chemical compound (Q11173) until now. This is not a statistical role but chemical compound (Q11173) is the top of the future chemicla compounf classification and until a global classification is provided. Currently if you rely on any item to link an instance of chemical compound with the chemical compound concept described by item chemical compound (Q11173), there is a high risk that someone changes the intermediate classesand you will lose the link between the instance and the top class. The current situation is not the final situation but only a temporary solution until a coherent and accepted classification is developed.
• @Snipre: Yes, I know that adopting 'is a = chemical compound' was probably the best option at that time to group all the chemical compound-related items, but in this case 'chemical compound' works as a metaclass to me, because of its relation: instance of (P31). In other words, I don't think that it is correct to maintain 'is a = chemical compound' and use 'subclass of = Qxxx (class of compounds)' with 'xxx (class of compounds)' being a subclass of a 'chemical compound' in the same time. Also, I don't think that compounds should be classified using instance of (P31). Ergo: if we want a classification with 'chemical compound' in it, we cannot use 'is a = chemical compound', instead we could use some sort of a metaclass (let's say 'type of a chemical compound'), if we need such metaclass for queries etc.
About the risk of changing the statement and breaking classification tree: this risk was, is and will be present in WD. But I think we may have some tools to regularly check this: in pl.wiki I have daily updated lists of compounds having an article in pl.wiki created using {{Wikidata list}}. I see changes to some properties that are imported to pl.wiki (like InChI, CAS number, Commons, etc.), I also see additions or removals of P31/P279. I think it is possible to create such lists for WD and watch every change regarding P31/P279. Wostr (talk) 21:22, 17 December 2018 (UTC)
• Then I don't understand your concept of metaclass: a metaclass is used to classify classes and must not be used on instances of other classes. There is no need of creation of metaclass for statistical work if the correct links are created. For example I can using SPARQL extract all instances of chemical compound including the instances of subclasses of chemical compounds if a link exists through all subclasses between the instance and the top class. If the link is broken due to the fact that someone decides to modify the classification in the intermediate level without taking care of keeping a relation with the top class, then you lost some instances.
• Metaclass is sometimes needed, depending on what you want to get: how would you query all the 'classes of chemical compounds' excluding 'groups of stereoisomers' without proper metaclasses? All have 'chemical compound' as super-class and that distinction in needed e.g. to check whether stereoisomers are properly linked to each other etc. What's more, metaclasses may be not needed if every chemical compound have 'is a = chemical compound', but I think this should be changed to P279 (as chemical compounds are not instances) – then you wouldn't be able to distinguish 'class of chemical compounds' from a 'chemical compound' without a proper metaclass. Wostr (talk) 21:22, 17 December 2018 (UTC)
• Finally your option 2 based on a similar classification to parent taxon (P171) is not possible for chemicals because there are no concept of taxonomic rank (Q427626) for chemicals. Snipre (talk) 20:38, 17 December 2018 (UTC)
• It is possible just as it is possible to add P31/P279 to chemical compounds; it don't have to be systematical category like in biology, it may be just 'belong to the class' property with values being the lowest classes in classification tree. That's why I wrote about 'similar' system; taxonomic rank (Q427626) is essential in parent taxon (P171) only from biological point of view, such property would work also without taxon rank (P105) statement in the same item. Wostr (talk) 21:22, 17 December 2018 (UTC)
• @ArthurPSmith: What is an "instance" of a particular chemical compound? Wrong question. The correct one is: what is a chemical compound. If the definition of the concept is clear then the classification is clear. But as people never read the definition of items, and still use their own definitions, then the problems arise.
So we can start from the beginning chemical compound is a sublcass of chemical substance (we can spent a lot of time to dicuss the charactiristic of chemical substance as particular subclass, so I prefer to go directly to the core of the discussion). If you can provide a definition of chemical substance, then you will be able to deduce the definition of chemical compound because the same definition will be applied as from chemical substance plus some additional characteristics.
Inheritance is a class property than people forget but this is the only way to keep a global coherence in the classification. So if we want to start discussion about classes we need to have the whole picture of the classification in front of us and not trying to modify one concept wihtout taking care of the influence on the lowest levels or on the contradictions with th upper levels. Snipre (talk) 19:56, 17 December 2018 (UTC)
I like to start from chemical sustance because for this concept we can rely on the external and recognized reference: the gold book of IUPAC. A chemical substance is Matter of constant composition best characterized by the entities (molecules, formula units, atoms) it is composed of. Physical properties such as density, refractive index, electric conductivity, melting point etc. characterize the chemical substance.
The question then Do we agree to use that concept definition ? Snipre (talk) 20:04, 17 December 2018 (UTC)
@Snipre: I'm fine with 'chemical substance' as a basic class and its definition; I don't think we can argue with IUPAC definition, as it's the most authoritative source in this matter. What about its subclasses? Mixture, pure substance, simple substance, chemical compound – how would you propose to model this? There are definitions for some of them, but for the rest we can rely only on various definitions from chemical literature or indirect evidence in IUPAC publications (e.g. for chemical compound there is no IUPAC definition, but I know one IUPAC paper with definition like this 'A compound is a single chemical substance'). So, what would be the next step after adopting 'chemical substance' as a basic class and with this definition? Wostr (talk) 20:48, 17 December 2018 (UTC)
@Wostr: "I don't think we can argue with IUPAC definition, as it's the most authoritative source in this matter", I don't your opinion is correct by considering ontology: IUPAC never developped an ontology, i.e. a coherent classification, so it is a bad idea to use definitions which are contradictory event if they come from authoritary sources. We should first list the definitions of the different concepts and see if we can link them together. The choice of the definition if several definiitons exist for one concept or if a concept can be included in the classification is not based on the sources but on the coherence of all chosen concepts together.
We were starting once this job Wikidata_talk:WikiProject_Chemistry/Proposal:Models#Definition but never continue that work. Ontology is like a puzzle: you need to put all pieces on the table and you can match only pices with the correct form. If you mixe two puzzles or two classifications, you won't be able to merge pieces. Snipre (talk) 19:57, 19 December 2018 (UTC)

## What are the best modelled items for your areas of interest?

Hi all

Over the past few months myself and others have been thinking about the best way to help people model subjects consistently on Wikidata and provide new contributors with a simple way to understand how to model content on different subjects. Our first solution is to provide some best practice examples of items for different subjects which we are calling Model items. E.g the item for William Shakespeare (Q692) is a good example to follow for creating items about playwright (Q214917). These model items are linked to from the item for the subject to make them easier to find and we have tried to make simple to understand instructions.

We would like subject matter experts to contribute their best examples of well modelled items. We are asking all the Wikiprojects to share with us the kinds of subjects you most commonly add information about and the best examples you have of this kind of item. We would like to have at least 5 model items for each subject to show the diversity of the subject e.g just having William Shakespeare (Q692) as a model item for playwright (Q214917), while helpful may not provide a good example for people trying to model modern poets from Asia.

You can add model items yourself by using the instructions at Wikidata:Model items. It may be helpful to have a discussion here to collate information first.

Thanks

John Cummings (talk) 15:36, 17 December 2018 (UTC)

@John Cummings: IMHO model items are the best way to provide a generic model: there are too many particulers cases to be able to provide a clear model using only one or two cases. And if we include the fact that WD should be understandable by machines or at least translable in a programming language, then a show item is not the correct way to explain the model. A more generic system has to be proposed. Snipre (talk) 19:35, 17 December 2018 (UTC)