Wikidata talk:WikiProject Chemistry/Archive/2016

Active discussions


The OECD eChemPortal is a valuable database of information on chemical substances. I recommend including a link to that database in the items of chemicals. In the case of pseudocumene (Q376994) for example, the link is --Leyo 15:55, 4 February 2016 (UTC)

Leyo No, because this is not a database but a weblink database. This can a tool to find data in other databases but this is better to directly link to the original databases where data are instead of pointing towards a database which points to other databases. Snipre (talk) 18:09, 4 February 2016 (UTC)
OK, let's call it metadatabase. It's more than just a number of blind links to (possible) entries in other databases. The only problem for Wikidata is that there is no ID other than the CAS number. --Leyo 23:37, 4 February 2016 (UTC)
From WD point of view, we should have only one parameter defining a unique entry in the database. From what I see, this is not the case for this database. Snipre (talk) 16:01, 9 February 2016 (UTC)
The CAS number is “only one parameter”. Is the problem that it is not a parameter specific to the eChemPortal? --Leyo 23:21, 11 February 2016 (UTC)
Leyo We already have the CAS number as property so no need to create anything: people can search using this information using the search tool of the database.
But if you want to create a link using the CAS number, we have already the tool of Magnus which connect a CAS number to all databases using CAS number as search parameter: see here for the case of the methanol. Snipre (talk) 08:22, 15 February 2016 (UTC)
As far as I see, the tool is not linked in items of chemicals. However, my point is, that the eChemPortal should be accessible directly from there. --Leyo 00:43, 17 February 2016 (UTC) P.S. There are several dead links in Magnus' tool.
Leyo Seems that eChemPortal changed the way to link to their data. How do you want to create a link to eChemPortal ? Snipre (talk) 19:01, 17 February 2016 (UTC)
I am not sure how exactly it should be done. That's why I am asking. ;-) --Leyo 02:16, 18 February 2016 (UTC)
We should be using the InChI / InChIKey as main unique identifier for most compounds (e.g. all organic compounds). --Egon Willighagen (talk) 18:30, 14 February 2016 (UTC)
Egon Willighagen InChI is not really human friendly for comparison purpose. InChIKey is better but still complex. But to be honest we should have a tool which create a drawing of the chemical and the corresponding SMILES, InChI and InChIKey. These four elements have to be created at the same time and shouldn't have different origins. In that way PubChem is a good tool because it create these four elements together. Snipre (talk) 08:43, 15 February 2016 (UTC)
Snipre Yes, PubChem CID works for me. They have a bot that can help, see User:ProteinBoxBot and pinging User:Andrawaag. For the Wikidata:WikiProject_Medicine/Zika project I am using QuickStatements (see this source code:; both options use Bioclipse): 1. take a SMILES, generate InChI and InChIKey, lookup PubChem CID, and create a QuickStatement (linked to the paper in which the compound was mentioned); 2. take a ChEMBL ID, look up SMILES, InChI, and InChIKey in ChEMBL, and create a QuickStatement (with their permission to copy the data for these Zika-related compounds). In both cases, I visualize the 2D structures in Bioclipse (with the ui.view() command), to make sure things look OK. I have not had time for this, but need to learn how to write (and mostly use) bots, and then talk to the PubChem people, to autopopulate items with PubChem CIDs with additional CCZero/PD data from PubChem (as earmarked by them). Egon Willighagen (talk) 10:39, 15 February 2016 (UTC)

Open Beauty Facts

Hey all Saehrimnir
Jasper Deng
Egon Willighagen
Denise Slenter
Daniel Mietchen
Emily Temple-Wood
Pablo Busatto (Almondega)
Antony Williams (EPA)
Devon Fyson
Samuel Clark
Tris T7
  Notified participants of WikiProject Chemistry,

The volunteers behind Open Food Facts are attacking Cosmetics :-)

Just like what we did for food, we're going to create a worldwide open data base of all cosmetic products, with ingredients, allergens, categories, brands, reference photos, using mobile phones. The effort has started at

We'll get a full list of ingredients in your favorite lipstick, shampoo or creams. We're starting to realize how much chemistry is actually involved.

  • We're going to improve the cosmetics articles, and hopefully manage to better link them with the underlying molecules (parabens, quaterniums…) (hierarchy)

Let me know if you have any ideas, either for Open Beauty Facts or on how to improve the cosmetic situation on Wikidata.

--Teolemon (talk) 13:50, 9 February 2016 (UTC)


There is two ways to define elements

  1. Types of atoms with the same atomic number (aka. the set of all atoms for some atomic number)
  2. Types of substances with only one type of atoms (in definition 1)

It seems that we have an interwiki conflict here, because en:Chemical elements uses 1. as a main definition, and, for example, fr:élément chimique does use the second. I'm afraid I failed to gain a concensus jut for the pair of languages to align the definitions, so I guess we won't avoid the spitting of items, this will give work to WD:XLINK. For each element … The good news is that it will clarify some classification issue (assuming we solve the same problem for chemical substance, molecular entity and all) :

⟨ Hydrogen (en) ⟩ subclass of (P279)   ⟨ pure chemical substance ⟩
⟨ Hydrogène (fr) ⟩ subclass of (P279)   ⟨ atom ⟩
⟨ Hydrogène (fr) ⟩ part of (P361)   ⟨ Hydrogen (en) ⟩
Hydrogen (en) = hydrogène élémentaire/hydrogène pur (fr)
Hydrogène (fr) = Hydrogen atom (en)
⟨ Hydrogène (fr) ⟩ instance of (P31)   ⟨ élément chimique (fr) ⟩
élément chimique (fr) = "type of atom occuring in some element" (en)
⟨ USS Akron (Q1456109)     ⟩ has part (P527)   ⟨ Helium (en) ⟩

Does that seem correct ? @Emw, Snipre: of course.

  • I am very much in favor of splitting things. For several reasons: one "element" can have two or more substances. Oxygen has at least molecular oxygen and ozone, carbon has several, excluding all the pure-carbon molecular structures (buckyballs, graphenes). Is there consensus now? Egon Willighagen (talk) 06:39, 13 April 2016 (UTC)

Compounds with several CAS numbers

What should we with DL-tartaric acid (Q194322) for example?

CAS Comment
133-37-9 DL
87-69-4 L-(+)
147-71-7 D-(–)
147-73-9 meso
526-83-0 D-(–) ?

--Kopiersperre (talk) 10:14, 10 March 2016 (UTC)

We have to separate mixture of isomers and isomers. So we will have 3 items at least:
- one for the mixture (DL)
- one for the D form
- one for the L form
Snipre (talk) 14:04, 10 March 2016 (UTC)
There should be separate items for dextrorotatory isomers, laevorotatory isomers, and racemic mixtures. I have been creating separate items for D- and L-isomers in certain cases. James Hare (NIOSH) (talk) 15:16, 10 March 2016 (UTC)
@James Hare (NIOSH): Can you provide the Q number of the items you created ? We should transfer the data from DL-tartaric acid (Q194322). In this case we have to create another item for the meso form which is a component too. For the second CAS number for D form we should check if these 2 numbers are correct and in that case check which one is the current valid one. Snipre (talk) 08:07, 11 March 2016 (UTC)
@Kopiersperre: L-tartaric acid (Q23034944), D-tartaric acid (Q23034947) and (S)-tartaric acid (Q23034950). Snipre (talk) 13:23, 11 March 2016 (UTC)
There is still a problem: there one CAS number for the racemic mixture which is different from the generic tartaric acid. Snipre (talk) 13:31, 11 March 2016 (UTC)
Very much supporting the split up. The more precise we are, the better we do. This particular case is important to research I do in various projects. Egon Willighagen (talk) 06:41, 13 April 2016 (UTC)

Silicic acids

silicic acids (Q16524585) and orthosilicic acid (Q422843) should be merged. May you please me help to investigate the right CAS numbers (see also silica gel (Q308976) and metasilicic acid (Q3604536))?--Kopiersperre (talk) 17:34, 10 March 2016 (UTC)

@Kopiersperre: Please look at the german articles for both items silicic acids (Q16524585) and orthosilicic acid (Q422843): one is about the family of silicic acid and the other one about a specific form of silicic acid. German links will prevent any merge actions so perhaps should you first analyze them. But for me these items shouldn't be merged just relabeled. Snipre (talk) 14:23, 12 March 2016 (UTC)
Rename orthosilicic acid (Q422843), create disilicic acid (Q23038943), metasilicic acid (Q23038949) and pyrosilicic acid (Q23038952). Snipre (talk) 14:43, 12 March 2016 (UTC)
Thanks for the solution. Created trisilic acid (Q23038984).--Kopiersperre (talk) 15:11, 12 March 2016 (UTC)
Trisilic or trisilicic acid ? --Chris.urs-o (talk) 12:27, 17 June 2016 (UTC)

Items for hypothetical compounds?

What do you think on User talk:Marsupium#Ammonia / ammonium hydroxide? Should a separate item be created for the hypothetical ammonium hydroxide described by AAT record 300266781? How are similar cases handled such as elements not existing under common conditions? Cheers and thanks for pinging! --Marsupium (talk) 14:21, 17 April 2016 (UTC)

@Marsupium: Create a different item because ammonium hydroxide ia only one type of molecule in a ammonia solution. Ammonia solution is a chemical substance meaning a mixture of different types of molecules and one class of these molecules is ammonium hydroxide. So you can connect ammonia solution with ammonium hydroxide usinf property "part of". Snipre (talk) 19:59, 8 June 2016 (UTC)
OK, thanks! But the problem is that ammonium hydroxide seems not to exist actually by itself outside ammonia solution. Which instance of (P31) shall <ammonium hydroxide> get? --Marsupium (talk) 18:48, 13 June 2016 (UTC)
@Marsupium: I don't know but hypothetical compound is not correct: this compound exists but only in small quantity and in certain conditions. Snipre (talk) 15:15, 14 June 2016 (UTC)
OK, thank you! I thought about that. If there is no obligation to point that out, I'll simply create the item. --Marsupium (talk) 18:12, 14 June 2016 (UTC)

Import parts of UniProtKB

Hi, what is necessary for an import of 550.000 reviewed items with their properties "accession number, protein name, gene name, organism, GO - molecular and biological function, keywords, length, mass and sequence"? We already have their permission to import. Here's the archived discussion from the project chat and here's the section at Portal:Gene Wiki, thanks, --Ghilt (talk) 07:15, 8 June 2016 (UTC)

@Ghilt: You need
  • a spreadsheet with all the data or at least an API to extract data from UniProtKB
  • the list of all items about protein with the corresponding UniProtKB identifier
  • a matching table between the wikidata properties and the corresponding UniProtKB parameters
  • an agreement from contributors working in the field of biology to import al the mentioned data.
  • and finally a bot operator ready to do the job. Don't forget to ask him to add after each statement import the reference using as example help:Sources, section databases.
The goal of wikidata is not to import all data from all databases. You should aim for data which can be useful for wikipedia mainly. The best is first to analyze infoboxes from different WP like en:WP, de:WP an fr:WP to see what kind of data is used in the articles. Then you can start to extract all corresponding data from UniProtKB. Snipre (talk) 19:55, 8 June 2016 (UTC)
Hi Snipre, thank you very much for the reply. The 555,000 items are not the full database, only their reviewed items. The data is used for writing protein articles on wikipedia. The matching table and the agreement shouldn't be a problem. But the API might, as it was difficult to get answers to my questions in either section (gene wiki on en.wp, wd Partnerships_and_data_imports and wd project chat) and i can't code sufficiently. Is there anybody who can help with that? --Ghilt (talk) 08:47, 10 June 2016 (UTC)
@Ghilt: Wikidata: Bot request. Snipre (talk) 11:29, 11 June 2016 (UTC)
Thanks again, i'll try that --Ghilt (talk) 21:58, 11 June 2016 (UTC)
@Ghilt: I am a little surprised to discover this proposal here and that you did not find the 377,000 UniprotKB items (SwissProt curated items, SPARQL query) we (project molbio/Gene Wiki team) already imported. We have all code in place and could do a full Swissprot import anytime required, but we prefer to do it species-wise, so we can link genes and proteins as described in the data model the Wikiproject Molecular Biology agreed on. Please see our papers on this [1] [2]. Sebotic (talk) 08:51, 16 June 2016 (UTC)
@Sebotic: Thanks for the reply. I had checked two typical protein items for molecular weight and length and didn't find the info, which is why i started at the project chat, followed by Portal:Gene_Wiki at en.wp, Partnerships and data imports, on this page and at Portal:Biology. And I finally found you! As i didn't intend to reinvent the wheel, your reply is a great help! This way, i don't need to import the 551,000. Should i discuss the creation of the properties "GO - molecular and biological function, keywords, length, mass and sequence" and the subsequent imports here or there? Cheers, --Ghilt (talk) 17:58, 16 June 2016 (UTC)
@Ghilt: The Wikidata protein items already have the full Gene Ontology annotations, which are maintained by our bot, directly from the original source QuickGo, so no need to add anything. Regarding length, mass and sequence: Length could be determined from sequence, so no need to add that, but there is a general agreement in WD project Molbio, not to add protein or nucleic acid sequences at this point, but let the users go to the original source if they need sequence info. This decision makes sense, as the current character limit for most WD text field properties is 400. Regarding mass: Several months ago, mass has been proposed as a property in the domain of chemistry, but it has been declined, because the mass of a molecule can be calculated from its chemical formula. Best, Sebotic (talk) 18:31, 22 June 2016 (UTC)
By the way, here ist the german version of the template infobox protein, cheers, --Ghilt (talk) 08:08, 17 June 2016 (UTC)
If sequences aren't feasible, how about importing the length? And I would really like to have the mass for writing protein articles without having to calculate each one or to go look at Uniprot. Cheers, --Ghilt (talk) 18:43, 22 June 2016 (UTC)

Moving this discussion to Project Molecular biology, cheers --Ghilt (talk) 20:42, 20 June 2016 (UTC)

BTW, i'll be in Esino Lario, who else? --Ghilt (talk) 15:11, 23 June 2016 (UTC)
Not possible for me. But if you have good experience there please feel free to report here your comments. Snipre (talk) 15:17, 23 June 2016 (UTC)
It actually was a great experience, the people of Esino Lario were incredibly welcoming. There were 'We welcome Wikipedians' signs on every fourth house and there were even drive-by hollars 'I love Wikipedia'. The local bakery renamed its cookies to 'Wikipedia's cookies'. The talks were ok, they're accessible on youtube, but more important was meeting some of the wikipedians i only knew by writing and pinning a face and a character to their name. Cheers, --Ghilt (talk) 18:07, 29 June 2016 (UTC)
Thanks for comment. It is always a good thing when we have positive feedback: this can help us to take part to the events in the future. Snipre (talk) 07:11, 30 June 2016 (UTC)

Philadelphia ACS meeting

Hello! There will be a Wikipedia Edit-a-thon at the national ACS meeting in Philadelphia next month. Will anyone from this group be there, to show ignorant chemists such as myself how to contribute to chemistry on Wikidata? Would anyone be able to give a short talk on what Wikidata is and how it will (hopefully) be used within Wikipedia? Walkerma (talk) 22:52, 15 July 2016 (UTC)

@Walkerma: Sorry, I am living in Europe and without any project to have holydays in the next weeks. I can only propose that you start to read some some help pages for the general structure of WD and then once you have more detailed questions, I will try to answer them. My reading proposition:
Snipre (talk) 08:06, 20 July 2016 (UTC)
Thanks - I'll try to work through these. If I get anywhere, I may try to contribute a couple of slides on it to the Edit-a-thon, just to explain the concept to the chemists who show up.. Walkerma (talk) 02:59, 21 July 2016 (UTC)

GHS hazard statements

We already have items with H phrases and with P phrases. In my opinion unsourced hazard statements should get deleted.--Kopiersperre (talk) 14:40, 5 August 2016 (UTC)

I would like to import the first big chunk of P728 (P728) and P940 (P940). I think the only viable way is creating one item for every possible phrase or phrase combination (see the list).--Kopiersperre (talk) 14:42, 5 August 2016 (UTC)

The statements are strings, not items. --Izno (talk) 15:39, 5 August 2016 (UTC)
I know, but this should be changed.--Kopiersperre (talk) 08:31, 6 August 2016 (UTC)
  • BTW what will be the source of the statements you want to import? ∼Wostr (talk) 17:27, 6 August 2016 (UTC)
@Kopiersperre, Izno: Please don't use Wikipedia to import data in WD: we already have enough complaint about the quality of these data to look for other sources. For GHS data please use the data from the ECHA available here as excel sheet. You can use the CAS number and the EINECS number to identify the item before the importation. Thanks Snipre (talk) 09:56, 8 August 2016 (UTC)
As Snipre above. Neither wikipedias nor unofficial sources/database/MSDSs should be used for GHS properties. Only ECHA database and harmonised classification (not notified classification, as it varies greatly depending on the producer) should be included in WD. I think we should also use applies to jurisdiction (P1001) = European Union (Q458), because classification in other parts of the World can be diffrent from European classification and labelling included in CLP and ATPs (e.g. U.S. OSHA may have it's own official c&l for certain substances). ∼Wostr (talk) 13:13, 8 August 2016 (UTC)
I think we should perhaps change the data structure. My concern is about different sets of H phrases. For example, if source A says that compound C should be labeled with H202 and H400 and source B says that the labelleing for C is H201 and H401, how can we later retrieve the good set of H phrases according to only one source ?
Instead of having different statements P728 (P728) and to have to filter them in order to get one unique labeling according to one source, we should create a new property Safety classification and to group all H phrases as qualifiers.
Safety classification: GHS hazard statement (Q28360)
P728 (P728): H201
P728 (P728): H401
Stated in : Source B
Safety classification: GHS hazard statement (Q28360)
P728 (P728): H202
P728 (P728): H400
Stated in : Source A
Snipre (talk) 13:44, 8 August 2016 (UTC)

Table of valid phrases

May you please help me filling out this tables? Some phrases (*) were altered by later ATPs.--Kopiersperre (talk) 10:36, 8 August 2016 (UTC)

@Kopiersperre: Please have a look at [3], page 34. Can you find a tool to extract data from pdf ? Snipre (talk) 11:17, 8 August 2016 (UTC)
@Kopiersperre: I found a better way: go to [4], then select all H phrases, select one language and choose the button "Download selected phrases as PLS" and you get a excel sheet with all phrases. Repeat the same with other languages then copy paste the content of the different sheets in one document and you have your list. Snipre (talk) 11:32, 8 August 2016 (UTC)
@Snipre: Very good solution.--Kopiersperre (talk) 13:45, 8 August 2016 (UTC)

Approximate values of dipole moment

CRC Handbook of Chemistry and Physics (95th edition) (Q20887890) contains table with dipole moments, some of them are given with a good presicion, but some are marked with "≈" ("Values measured in the gas phase that are questionable because of undetermined error sources are indicated as approximate") or enclosed in brackets ("Values obtained by liquid phase measurements, which sometimes have large errors because of association effects"). How can I add this information to WD? In propyl formate (Q421045) I tried to use sourcing circumstances (P1480) with circa (Q5727902) for [1,89] D, but I don't think it's a good option – this is not an approximate value, but just an undetermined uncertainity. sourcing circumstances (P1480) would be better with values marked with "≈", but I'm not sure if this is a right use of this property. ∼Wostr (talk) 23:10, 17 August 2016 (UTC)

@Wostr: The best is to use the original references and not the Handbook for these values in order to define which is the error. Snipre (talk) 09:47, 18 August 2016 (UTC)
@Snipre: I checked the first value marked with "≈" in the original source and it is marked with "Q" = "Questionable value" (there is a serious question about the best value to select or where there is insufficient information on which to base meaningful estimate of accuracy (...) They may be regarded as giving a rough estimate of the magnitude of the moment but are not of sufficient accuracy for quantitative use). Tables in CRC is based on 68 sources, mainly published before 2000, so for some compounds there may be better measurements of DM in the literature, but for some there may not be any other value. ∼Wostr (talk) 13:07, 18 August 2016 (UTC)


I would like to add many printing and coating pigments (example Pigment Yellow 138 (Q26705718)). Am I right that there is no generic property for color?--Kopiersperre (talk) 16:18, 26 August 2016 (UTC)

Kopiersperre There is color (P462) for general colors description. But if you want to describe the color with a more detailed way there is sRGB color hex triplet (P465). Snipre (talk) 16:35, 26 August 2016 (UTC)
Colour Index International constitution ID (P2027). --Teolemon (talk) 15:10, 27 August 2016 (UTC)

DSSTOX substance identifier

Please also have a look at the proposed property for the EPA CompTox Dashboard identifier. User:ChemConnector has uploaded some 700 thousand InChIKey<>DTXSID mappings as CCZero to Figshare, and I want to include that information in Wikidata. For this, I will want to use a bot task, and will soon write up a task proposal. For go would then be to add mappings for Wikidata entries with matching InChIKeys, but I can also imagine creating new compound entries for InChIKeys not found in Wikidata yet. Comments on that second part most welcome. --Egon Willighagen (talk) 08:53, 28 August 2016 (UTC)

Import of ChEBI

Hello everyone, I will start importing all actual chemical compounds represented in ChEBI. Furthermore, I would like to import and maintain the full ChEBI ontology structure. This would enable a unique representation of chemical compounds in Wikidata and would highly improve the quality of chemical compounds in Wikidata. I have done that sucessfully with the Gene Ontology, which has a similar size and complexity and therefore have show that this is feasible.

For long term maintenance: The source code for this will be AGPLv3, available on our bitbucket repo [5] so in worst case, somebody else could take over and run the bot. Nevertheless, I would like to know your opinion on this. Best, Sebotic (talk) 20:43, 22 June 2016 (UTC)

@Sebotic: Not in favor of importing an external ontology in WD. Why do we have to maintain in WD an ontology defined and modified in another website ? The goal of WD is not to integrate everything from other databases but to link databases.
Same reasoning for importation of all chemicals from ChEBI. I don't see the interest of just being a mirror of another website. Better work at the interface of the existing databases than just copy-pasting data form one. I propose you instead of import data from one database to match data from different databases like ChEBI, ChemIDplus, ChemSpider, PubChem, ChEMBL or GESTIS and to import the data which are similar in all databases. ChEBI is just one database among several others so I don't understand why Wikidata should be the mirror of this database and not of the others. Snipre (talk) 11:38, 23 June 2016 (UTC)
@Snipre: Sorry for the delayed reply! The reason why I think ChEBI would be valuable is that it is the best chemical ontology currently available. It brings a ton of classification which could form the basis of futher work by the WD community. The only thing which maybe should not be imported is tautomers, as they have the same inchi (key). In general, I would want to import data from several source but certainly not as separate item per source but as a unified item with all the identifiers on it (CAS, Inchi key, Inchi, canonical SMILES, isomeric SMILES, CID, SID, ChEMBL, SureChEMBL, IUPHAR/GtoP, Drugbank, etc). The common id should be the InChI key, not perfect, but the best which is out there. Certainly, an important part is proper referencing, which is fairly easy as soon as the data sources have been determined. If we succeed, we would end up with the most high quality, open corpus of chemical compounds with most data/ids per compound anywhere to be found, which I think is great. Sebotic (talk) 01:13, 28 June 2016 (UTC)
@Sebotic: No problem for the delay. For the data I am sure you have good expertise. By only concern is to have a control process which work before the importation of data. I am really tired to correct statements and to merge duplicates each time large chemical data imports is done because people didn't do a correct job of data matching before importation. My recommendation are the next ones:
  • Before creating any new item check if another item already shares an identifier with your data set. And don't use label or page title of Wikipedia article as matching criteria.
  • Import data in one item only if you can match at least two identifiers between your data set and the data already present in the item.
  • If during the data import you detect the existence of an existing value for the property you want to import, compare the existing value with the value you want to import and if there is a difference don't import your data but create a conflict report in order to analyze the item later
For the question of the ontology, even if ChEBI is a good reference, we first have to check if the ChEBI ontology can match the overall Wikidata ontology. Wikidata can't be the sum of different ontologies if we want to have an unique way to query and to display data independently from the knowledge domains. For example, what happens if ChEBI ontology agrees to have items with both instance f/subclass of in an item but not Wikidata ?
I know that the ontology of Wikidata is very unclear but we need to be careful to keep a homogeneous system. Snipre (talk) 09:46, 28 June 2016 (UTC)
@Sebotic: I guess you have also seen the Mix'n'Match already? I love to see ChEBI fully in Wikidata. Now, ChEBI has a lot of ionic species (which becomes very clear when you run the Mix'n'Match in Game mode :) Do you also plan to include these? Also, will you include the links between the compounds, as the ChEBI ontology defined, particularly for these ions? --Egon Willighagen (talk) 08:40, 28 August 2016 (UTC)
@Egon Willighagen, Snipre: Well, after some more considerations and taking into account the concerns by Snipre, I think that importing all of ChEBI might not be too useful at this point. E.g. all the ions and enantiomeres do not have enough chemcial idenfiers to be really useful in Wikidata. Moreover, ChEBI has many edges which currently don't exist in Wikidata, so they would all need to be proposed and approved (subclass of and has role already exist, so most of the core graph could be imported). What I will do definitely is to make sure that all 'primary' (organic) compounds make it into Wikidata. That said, I would have the bot code ready to do a full import, only things missing are edges (WD properties) and a general consensus that the full import should be done. Sebotic (talk) 18:05, 29 August 2016 (UTC)
@Snipre, Sebotic: These two aspects of chemical compounds, along with pureness (compounds vs substance) are important. What about we start ironing out how Wikidata should model these things? Are ions notable enough (probably, given that other databases support them?)? Should compounds with unspecified stereochemistry be instances or subclasses? And, quite related, how will we model compound classes and other "things" that are more than one distinct (isomeric) chemical structure on Wikipedia? It seems to me, we have critical mass. It seems to me that @ChemConnector, Walkerma, Pigsonthewing: (first two have been very active in the Wikipedia Chemistry team along, and Andy has been at the Royal Society of Chemistry (Q905549)) will like join in these discussions too, and then we have critical mass. This defines a group of experienced chemists who think Wikidata should be used in science. I'd say, let's do it! Let's define the framework and do that final clean up. Within not too long, we can beat several popular scientific databases in quality. (And then we submit a paper to the Journal of Cheminformatics (Q6294930) with our results, along the lines of Wikipedia Chemical Structure Explorer: substructure and similarity searching of molecules from Wikipedia (Q21957425). This will undoubtedly attract more scholarly chemists!) --Egon Willighagen (talk) 05:22, 30 August 2016 (UTC)
@Egon Willighagen:
* All ions are notable
* compounds with unspecified stereochemistry are defined as subclasses of chemical compounds and compounds with specified stereochemistry are defined as instance of items describing compounds with unspecified stereochemistry (see relations between L-lactide (Q24757824), D-lactide (Q24757832), (R,S)-lactide (Q24757839), and lactid (Q421313))
* Next problems to solve:
  1. how isotopic compound (Q22332141) should be structured compared to chemical compound ? Is heavy water (Q155890) an instance of water (Q283) ?
  2. how should we treat tautomers ? Two items or an item ? Which criterion can be used to define if a tautomer can have 2 items or not ?
  3. what is the granularity of the structure for chemical compounds: can we consider ethanol as an instance or as a class ?
Snipre (talk) 10:10, 30 August 2016 (UTC)
@Snipre: Cool, thanks for the details! The first of the next problems is indeed interesting, because ontologically seen, an instance of an instance is not typically done. ChEBI actually models even water (Q283) as a class. That's not that unreasonable, as a water molecule instance is something in your mouth right now, and the 'chemical compound' water is just the concept of it. Tautomers is another hard one. Personally, I like to have all chemical graphs as separate entities, actually like ChEBI does. However, if you say chemical compounds have a 1-to-1 relation to the Standard InChI, then we have a problem. Worse, the Standard InChI does not consider everything a tautomer that a biologist/chemist would (it's an incomplete model). So, the current answer following from the compound<>InChI link is: both two and a single item. The third problem to solve is related to the first. But this is the discussion we indeed need to have. What is the central concept of a chemical compound? That has major implications for the identifiers side of this. To me, the more explicit we are, the better we serve the scientific community. --Egon Willighagen (talk) 12:08, 30 August 2016 (UTC)

Importing COSING

[Discussion with Magnus: Matching CoSing numbers using multiple identifiers]

The CoSing number has recently been created for Chemical compounds. It is the EU canonical identifier for Chemistry and Cosmetics, and as a result, there a 25 000 identifiers, as well as identifiers to all the other chemistry systems, and interesting info for properties and labels.

I had first thought truncating the file for import using Mix N'Match, but I wondered if someone is skilled to maximize the utility of the file.



COSING Ref No INCI name INN name Ph. Eur. Name CAS No EINECS/ELINCS No Chem/IUPAC Name / Description Restriction Function Update Date
38946 ZEA MAYS STARCH starch maydis amylum 9005-25-8 232-679-6 Zea Mays Starch is a high-polymeric carbohydrate material usually derived from the peeled seeds of the Corn, Zea mays L., Gramineae - ABRASIVE, ABSORBENT, ANTICAKING, SKIN PROTECTING, VISCOSITY CONTROLLING 15/10/2010

Property talk:P3073#Importing the identifiers

@Teolemon : I don't think Mix N'Match tool is necessary: the dataset contains CAS number and EINECS number so you can use those identifiers to identify the item for adding the CoSing number in WD. The best would be to check when possible if the CAS and EINECS numbers in the item are identical to the ones present in the dataset from CoSing database. Snipre (talk) 11:33, 29 August 2016 (UTC)

qualifier to indicate a conformer

Is there any way to indicate the conformer? Dipole moments are sometimes measured for specific conformer (gauche, trans etc.), but I do not think there should be different items for every conformer as there are the same molecule. ∼Wostr (talk) 21:56, 30 August 2016 (UTC)

From what I know, no. Snipre (talk) 22:10, 30 August 2016 (UTC)

Problem with mixture and solution

I have a problem with items describing mixture and especially aqueous solution of salts or other soluble substances. First these items can't be classify as chemical compound but can we classify them as chemical substance or as mixture ? My proble with items describing solution like barium hydroxide solution (Q809681), this is the large possible and different solutions which can be represented by this item. If I take the IUPAC definition of chemical substance, I read "Physical properties such as density, refractive index, electric conductivity, melting point etc. characterize the chemical substance". As I understand the definition, barium hydroxide solution (Q809681) can't be classified as chemical substance because I can't define one density or refractive index to item barium hydroxide solution (Q809681): the density is valid only for one solution, for example water 70%/barium hydroxyde 30%, but not for the solution water 99%/barium hydroxyde 1%.

So this already solves a problem: barium hydroxide solution (Q809681) is not an instance of mixture or chemical substances but a subclass of mixture/subclass of chemical subsatnce as barium hydroxide solution (Q809681) represents an infinity of solutions having different compositions from 0.0001 to 99.9999%.

Then next question: can we put as constraint that identifiers used to identify pure substance can't be used to identify aqueous solutions of the same substance ? Even if this is allowed in general by external rules outside of wikidata ? Example is CAS number which is used for pure substances and their aqueous solutions. But this creates a mess in our constraints report so I woul like to formalize the restriction of CAS number to only pure substances and exclude the use of the same CAS number for aqueous solutions. Comments ? Snipre (talk) 21:09, 13 September 2016 (UTC)

I am not sure if we should use 'chemical substance' as an opposition to mixture (or maybe we shouldn't use it at all). That's very unprecise term and its definition may depend on language, author/source etc. (e.g. in Polish chemical literature from 60s–70s chemical substances are divided into pure substances /compounds, elements/ and mixtures). Even the IUPAC definition is not as precise as it should be: the solution (mixture) of two substances with specified composition would be a mixture and a substance at the same time (both conditions are fulfilled: constant composition, characteristic physical properties). And we also have legal definitions: substance is a [chemical] mixture (pure substance + necessary additives + technological impurities) and mixture is a mixture/solution of two or more substances [EU CLP definitions].
You're right with barium hydroxide solution (Q809681): it should be classified as 'subclass of' mixture (but IMHO better 'subclass of' saturated solution -> solution -> mixture).
And yes, we should limit the use of CAS number to 'pure chemical substances' only. I think that no distinction between solutions and compounds in the CAS Registry is not intentional, but it's a result of practical reasons only; so there is no substantive reasoning behind it. ∼Wostr (talk) 20:00, 16 September 2016 (UTC)

Annotation in which species chemical compounds are found

I am adding this to shed some light of what I am up to with Wikidata. At the moment I am close to the first steps of developing a bot based on the User:ProteinBoxBot code base and made a first request. This bot can help import a lot of data, but also help add missing information. For example, Christopher Southan just reported a list of about 700 hundred PubChem CID (P662) for entries with SMILES: Pulling in this information is easy. For now, I will focus on the biology side of things, and plan to annotate compounds and the species they are found in, e.g. using knowledge in the WikiPathways database (see Wikidata:Requests_for_permissions/Bot/UreomiczBot 1). User:Wostr pointed out on my Discussion page that found in taxon (P703) can be used directly on the Wikidata entry being added/edited, so, instead of instance of (P31). There are quite a few species specific metabolite database where this can be sourced from. I stress how important it is to have this kind of information, because academic researchers now often face the problem that they have measured compounds from human samples of unknown chemical identity (in any typical untargeted metabolomics experiment). More info can be found in this report of a recent student project on Figshare ( and the H2020 project proposal Enabling Open Science: Wikidata for Research (Wiki4R) (Q26707522). --Egon Willighagen (talk) 08:49, 28 August 2016 (UTC)

@Egon Willighagen: Before doing any advertising to use WD in scientific research we have to implement a control system which allows to sell WD as reference database. To be able to reach that objective we should perform an unique step: for each "instance of: chemical compound", an unique value for InChI (P234) with the corresponding InChIKey (P235) has to be provided.
But currently we have
In one word, we have to be able once to propose in WD one fixed list of chemicals clearly identified with a coherent set of identifiers (mainly InChI, InChIKey and chemical structure) from the same source or generated from the same system. We are far away of that situation now so for me trying to sell WD as a tool for scientific research is just a bad idea and a way to loose any trust for the future. Snipre (talk) 12:25, 29 August 2016 (UTC)
@Snipre:, I am not claiming Wikidata is perfect yet. There are indeed a number of problems, but I like to see your results that show that Wikidata is doing worse than scientific databases. Many of the latter have a certain scope, and only a few use InChI as a basis. The above issues need a lot of attention, and the bot I am developing can help. E.g. it is trivial to add InChIs and InChIKeys for chemical compounds with a SMILES. Finding inconsistencies too. The fact that the number of "instance of: chemical compound" is currently higher than the number of InChIs does not worry me at all: many compound classes are annotated as "instance of: chemical compound" rather than "subclass of: chemical compound", and compound classes do simply not have an InChI. Furthermore, there are chemical substances annotated as compound, etc, etc. Yes, there is plenty to clean up, but that's why Wikidata should be at the center of science, as it is an open database where all scholars can contribute to, without having to worry of being able to reuse their own contributions later. I love to sit down with you and a few other Wikidata Chemists and iron out some ideas! What about a (virtual) meet up soon? --Egon Willighagen (talk) 13:07, 29 August 2016 (UTC)
@Egon Willighagen: The problem with the items defined as chemical compound without an InChI is that they are not completely identified. One quarter of our database is not fully defined and this why I prefer to slow down the use of WD by external users. We can always discuss about next steps but it would be great to put different options on the paper first in order to already have an idea about the possible work to perform before starting discussion. My proposition is developed there. The talk page can be used to add other ideas and we will update the page once an agreement will be found. Snipre (talk) 09:50, 30 August 2016 (UTC)
@Snipre: Great! Let's continue talking there then! Mind you, there are some people who want to solve this problem, including me and Sebotic. And I know for a fact ChemConnector has that interest too. These are scientists, not users, but developers and data providers. Over the next few days, I will run some scripts to quantify the current quality. There will be a lot of manual work to be done. Also, I think you overestimate with 'a quarter'... not everything now qualified as compound really is a compound that should have an InChI. More in that talk page asap! --Egon Willighagen (talk) 10:25, 30 August 2016 (UTC)
@Snipre: BTW, sn-glycerol 3-phosphate(2-) (Q26711901) is a new compound which, according to searching on PubChem CID and InChIKey, was not yet in Wikidata. Adding missing information (or correcting info, if needed) can be automated. Feedback on that new compound page is appreciated. --Egon Willighagen (talk) 12:26, 30 August 2016 (UTC)
@Egon Willighagen: Seems OK, but it would be perfect if you can follow the recommandation of Help:Sources#Databases and add at least the "retrieved date". Title is a good think too but less important. Snipre (talk) 22:15, 30 August 2016 (UTC)
@Snipre: Agreed about the 'retrieved data' but setting that requires an URL in the calendarModel property of the data, which causes the abuseFilter to overreact, so I cannot set that right now. See e.g. this log message. --Egon Willighagen (talk) 11:47, 31 August 2016 (UTC)
@Egon Willighagen: Please report the problem to the dev team or the abuse filter admin. Seems to be a programmation problem. Snipre (talk) 19:13, 31 August 2016 (UTC)
@Snipre, Egon Willighagen: This is the current count of core identifiers of chemical items in Wikidata. By chemical items, I mean items either instance of or subclass of chem compound, or just having a cas, cid, inchi(key), smiles but lacking an instance or subclass categorization. Snipre, I see your concerns, but I have have invested quite some time into the chem compound space in WD now and I am confident that there will be substantial improvements in the coming weeks. Sebotic (talk) 17:46, 29 August 2016 (UTC)
chem items 24809 subclass of or instance of chemical compound, or having cas, cid, inchi(key), smiles.
article 16391 links to
mass 637
chemSpider 11462
pubchem_cid 16906
unii 11826
mesh_id 927
kegg_id 4065
mesh_code 3
chebi 4464
drugbank 2682
chembl 5461
iuphar 1033
cas 19692
csmiles 15635
inchi 14470
inchi_key 14943
chemical_formula 19475
atc_code 1709
ismiles 16

Here is a list 240 items with conflicts on the structure level, where the InChI key does not match for some of the identifiers on an item. This is usually due to incomplete sterechemistry items added or just the wrong stereochemistry or the wrong compound in the first place. I think I can clean up a good share of those by just getting the majority vote of identifiers for an InChI key and then using this key to populate the item. This is certainly not error free. Otherwise, these 240 could be fixed by hand, what is required is just to delete any PubChem CID, InChI key, Chembl, chebi, Unii, or chemspider which is incorrect. After that, one valid identifier on an item is sufficient to let the item be populated by my bots. Btw: these 240 are result of a consistency check for 4000 items, so approximately 1/4th of all compounds with a PubChem CID in Wikidata. Sebotic (talk) 09:20, 20 September 2016 (UTC)

@Sebotic: Thanks for your work. But I can't work now on that curation, at least not before 2 weeks. Snipre (talk) 19:25, 22 September 2016 (UTC)
@Snipre: I will try to get as many of them resolved through other automatic means, so that finally, we end up with a number which can be handled more easily. Manual curation, in my experience, is a very time consuming process, so I think it should be the last resort. Let's see how quickly we can resolve it. The biggest challenge for some of these will be to choose the one with the most appropriate ('correct') stereochemistry. Sebotic (talk) 20:12, 22 September 2016 (UTC)

UPDATE: that is now the full list of 1,279 items in the space of chem items with pubchem ID which need inspection and curation, out ot 17,709 (7,2%)
UPDATE: Now 1.284 compounds, a few new items appeared and a few fixed ones where removed.
UPDATE: Managed to bring it down to 1,030, updated list accordingly. What I see frequently are InChI keys where most major resources agree on, but PubChem has a different one (different connectivity) and the one agreed on by the other resources can be found in the PubChem data provider supplied descriptions, that's not ideal... Sebotic (talk) 08:57, 30 September 2016 (UTC)     Corrected, confusion between digermane (Q5275604) and digermanium (Q27183266)      Corrected, confusion between Sodium percarbonate (Q420070) and sodium hydroperoxy(oxo)methanolate (Q27216890)      Not corrected, two ways to represent the molecule: with ionic bond or covalent bond between calcium atom anf nitrogen atom

Bot importations

@‎ProteinBoxBot, ‎SoCalChemBot, ‎TaxonBot:, @Doc Taxon, Sebotic, Andrawaag:. Please announce your importation campaign about chemicals and other proteins in this page in order to give the end of the campaign. This will help for the data curation and prevent any bot reimportation of bad data after a manual correction.

Then we have to think about the future: we can't just let the bots operate in the same manner in the future, after data curation. Even if a database is providing some data, we can't just erase what will be present in WD after a manual data curation. So next bot actions should avoid any data deletion and focus on data comparison with report generation indicating conflicts.

Then a remark for those bots adding molar mass as mass to chemicals. This is not a very good solution because this data can mix monoisotopic mass and average molecular mass. The best would be to provide only the number of the different atoms and to let the people calculate the molecular mass according to their own choice.

Thank you. Snipre (talk) 08:12, 7 October 2016 (UTC)

+1.--Kopiersperre (talk) 11:07, 8 October 2016 (UTC)
+1 --Ghilt (talk) 16:24, 9 October 2016 (UTC)
+1 --Mabschaaf (talk) 16:38, 9 October 2016 (UTC)
+1. A remark concerning TaxonBot: This task of adding ECHA InfoCard ID (P2566) based on CAS Registry Number (P231) was done based on pre-curated data. I am currently working on the remaining issues. --Leyo 17:40, 9 October 2016 (UTC)
@Snipre: Ok, so if that runs as planned, ChEBI compound imports should be done by Wednesday, after that, there will be an UNII compound import run of another 5 days, maybe I find a way to do imports faster. Regarding the curation: All of the newly imported/created items are in good shape, centered around an InChI key. The items which need human intervention are listed above (~1,030). These need to be centered around one InChI key too. In addition to those items, there are about another ~1000 where there are still some wrong IDs on them (CAS, UNII, ChEBI, ChEMBL), but SMILES, InChI (key), PubChem CID and ChemSpider are ok. I can remove these with a bot automatically.
For curation after a bot run: In order to make sure that a bot does not overwrite good curation, the curated values need to have good references according to the Wikidata ref guidelines for databases, otherwise, any curation work is futile, as this is essential for a bot to recognize human curation. But this only works for statements; labels, description, aliases do not have refs. In addition, it is not realistic to only do one time imports, because the original data sources evolve, improve, and expand.
So these need to be kept in sync. What happens if there is a one-time import of data to Wikidata and afterwards no constant sync, has been demostrated by importing chemical compound data from various infoboxes of various Wikipedias basically once and then not caring for continuous syncs any more, this is one major contributor to why there are still a ton of issues in the chem space of Wikidata.
So if we can agree on the mandatory requirement for good refs, I will modify my bots in a way to always keep the manually curated parts with good refs. Otherwise, there is no way to find out who made a good contribution. A list of user names is not a good way, because this will exlude anyone not on that list. Furthermore, keeping everything which has been contributed by users is also not a good way, because I have seen many wrong contributions because the users had either no idea of chemistry or were playing one of these curation games and got it wrong there. Ideas or suggestions?
Regarding the mass: What I import is the monoisotopic mass as stated in PubChem, I can add a qualifier to make that more explicit, but I cannot see how a data user should be able to calculate an average mass if the user does not know the isotopic distribution. But I can certainly add average mass as well (Or any other). Sebotic (talk) 07:58, 10 October 2016 (UTC)
@Sebotic: Thanks for your answer. My concern is currently about the duplication of items about the same chemical: the bots add data once in one item then in the second item. This can be solved by merging but the problem is to be sure that both items are about the same chemical. Then we arrive to the second problem: the confusion between mixture of stereoisomers and pure stereoisomers. This is a real problem especially for the CAS number. I have huge problem to curate CAS numbers because to few databases provide this identifier.
Concerning curation currently I delete and merge, no new addition. The problem is that sometimes the original databases are wrong (typical case of confusion between mixture of stereoisomers and pure stereoisomers) and I can't replace the wrong data by a correct one (typical case for CAS number). In that case reference doesn't help. And this why the sync is not a good idea: after data import and curation WD is not more a compilation of what is given by other databases, but is a database and should considered like this. Future imports are not more possible, only comparison and conflict reports should be generated. No more massive bot actions, only manual correction based on bot comparison. I agree with you about the references as key element to judge if a data should be kept but in the future bots will only play the role of data comparison and not more data import or correction (in large extend at least).
Bot work is not a problem but you have to agree that their actions will change after first import: sync is the not the goal, the goal is to provide a coherent set of data about one topic. If we agree on that, then we can go to the next step which is the definition of a system where bots provide some reports and contributors use them to curate and correct. So please don't work alone in your corner but try to work with this project when dealing with chemicals. The case of the mass is a good example of the lack of discussion: even if you are working based on a good reasoning, nobody knows which kind of mass you imported and no rules are defined for future data imports or addition. So the risk is very high that without guidelines, after some weeks people mix different data using the same property. Snipre (talk) 08:53, 10 October 2016 (UTC)
@Snipre: I agree that orienting around CAS is not a good idea, but this ID is so widely used that we should add it if we can. Therefore, I strongly advocate for using InChI keys, these are tied to the structure and uniquely identify a compound. So my basic premise is: The structure comes from scientific literature or chem/pharma companies, at Wikidata, we do not have the means to make a comment on the structure of a compound, it can be good or bad or incomplete. This is why I think that for some compounds, we will need to live with 2 or more versions of a compound, because the real, true structure is incomplete/wrong versions of the structure exist. The connectivity can serve as the common basis (in most cases there is no disagreement on that) but the isomerism might differ. These different isomers can be connected to each other using Wikidata properties, and can be detected by using SPARQL queries. And over time, hopefully, many of those will resolve, but certainly, we will not reach a point where each and every compound has a high quality structure. So if I add 2 compounds with the same connectivity but differeent isomer info, how do you know that one is better than the other, or they are just 2 different isomers? I see no problem in having parallel versions. We can also have a compound without any stereoisomeric info as a minimum requirement and one or more defined stereoisomers, ideally of good quality.
Regarding bot imports: I completely agree that Wikidata is an independent database and should not be the aggregator of other databases. In the contrary, we should make use of our flexibility and community curation. That said, senior figures in PubChem have told me personally that they are interested in taping/using the community curation done in Wikidata. That said, I think we need to find mechanisms to not touch the community curated things, but still import the improvements made in the original sources. Moreover, we definitely need the new compounds added from those resources, because these are usually compounds of high interest (e.g. in Drugbank, UNII) with high medical/biological relevance and of public interest and also with biologic activity. As I said above, I think proper references are one way of doing that.
Regarding import efforts and import of special data: I agree that I should have discussed the import of 'mass' beforehand. I will also put up any import campaigns beforehand.
For error detection, I think we should use SPAQL queries. I also have a bot which can continuously check if SMILES, and InChI (key) are consistent on an item and file a report if not. Sebotic (talk) 21:14, 10 October 2016 (UTC)
@Sebotic: I don't have any problem with CAS number, my problem is when you import the CAS number from Pubchem and I delete it in WD because someone did a mistake by importing the wrong CAS number in PubChem. I can't provide the correct CAS number because I don't have access to SciFinder and most of the time Google can't provide a good answer. The main problem is the data curation in PubChem: they should do the same as us and analyze their CAS number to check if they are correct with their structure.
Just have look at this compound in PubChem as example: someone put the wrong chemical formula as title for this entry in PubChem database. I can't change that wrong data so please don't reimport it with your bot. This is my only concern.
About stereoisomers I think we should focus only on two kind of compounds: the compounds which are completely defined and the one which not at all defined. The latter having a role of grouping all possible completely defined stereoisomers. We don't need to create items for all possible stereoisomers in a systematic way but when we have confusion about mixture and pure forms we should split the data in order to avoid the confusion in the future.
For your other remarks I think we agree together about the main principle. The only difference I think I have compare to your approach is the fact aboiut importing: I don't think we need to import data in WD to curate them after. Once we have a quite stable set of chemicals in WD we can work by comparison using bots and then create conflict reports. And only after a manual check we can import the data from others databases. If we agree on that we will avoid a lot of discussion later.
For now I am working with report of constraint violations: every day I can see the results of the curation. I don't need to use sparql for the moment. But if you want to create the querie just do it. Snipre (talk) 22:28, 10 October 2016 (UTC)
@Snipre: Regarding CAS numbers: I agree that PubChem does not do the best possible job here. But the CAS numbers I import are actually from UNII and ChEBI, not directly from PubChem. Still these could be incorrect. I have access to SciFinder, Reaxys, etc, but I am very sure that I am not allowed to do a systematic import of CAS or Beilstein numbers to WD. What could be a way to go, is to ask ACS directly if they would be willing to contribute a InChI key to CAS number mapping file.
On the importation: In principle, I agree that it's a good idea to detect and log conflicts instead of overwriting. Two questions here: Is that feasible for thousands of items? And where would I post such a list of conflicts, so it can be processed really in a fast manner? Text, which then needs to be copy and pasted around by a curator is not a good way. I also agree on the stereochemistry part, either fully defined stereochemistry or no stereochemistry. But I think we need some flexibility here, because for very many important compounds, the only stereochemistry which exists is partial. What I have seen, this is very common for naturally occurring, larger molecules. But for cases where several stereoisomers exist, take only the fully defined ones. Sebotic (talk) 23:40, 10 October 2016 (UTC)

ZVG number (P679)

I suggest importing the remaining ZVG number (P679) based on CAS Registry Number (P231) and/or EC number (P232). This task is unlikely to create significant issues. Of the 8745 ZVG numbers available in an Excel list from there, we currently have slightly less than half. --Leyo 15:19, 10 October 2016 (UTC)

@Leyo: I'm reluctant to import entries like 900063. The remaining ZVG entries seem not relevant to me, but when anyone imports data, we should do so, too.--Kopiersperre (talk) 16:14, 10 October 2016 (UTC)
I did not ask for the creation of new items based on this list. IMHO it is sufficient to add those with a match in either CAS or EC number (or both). --Leyo 17:46, 10 October 2016 (UTC)
@Leyo: Sorry for getting you wrong. The import was done by this Mix-n-Match catalog and can be resumed at any time. But I think, there is not much to do.--Kopiersperre (talk) 21:46, 10 October 2016 (UTC)
Before any importation can we once do a data comparison ? For example: take the list of CAS numbers from Gestis, match the items with their CAS number and then compare the EINECS number from Gestis with the EINECS number from WD. If both CAS number and EINECS number match then we can think about data importation if and only if the CAS number is used only once. My concern about Gestis is the fact that Gestis can use several times the same CAS number/EINECS number like for hydrogen chloride and hydrochloric acid solution (two ZVG numbers but one CAS number and one EINECS number).
But before any importation we have to solve all violations of the constraints for CAS numbers and EINECS numbers. Snipre (talk) 21:08, 10 October 2016 (UTC)
There are only very few contraint violations for the latter. Unclear cases should be skipped, and if possible, listed for manual review. --Leyo 23:37, 10 October 2016 (UTC)

Creating items for Cosmetic properties

In the COSING EU database about cosmetics, the Chemical components have one or several cosmetic properties. We might want to create those before importing COSING data --Teolemon (talk) 15:57, 16 October 2016 (UTC)

en:definition:Removes materials from various body surfaces or aids mechanical tooth cleaning or improves gloss

en:definition:Takes up water- and/or oil-soluble dissolved or finely dispersed substances

en:definition:Allows free flow of solid particles and thus avoids agglomeration of powdered cosmetics into lumps or hard masses

en:definition:Prevents corrosion of the packaging

en:definition:Helps control dandruff

en:definition:Suppresses foam during manufacturing or reduces the tendency of finished products to generate foam

en:definition:Helps control the growth of micro-organisms on the skin

en:definition:Inhibits reactions promoted by oxygen, thus avoiding oxidation and rancidity

en:definition:Reduces perspiration

en:definition:Helps protect against plaque

en:definition:Helps control sebum production

en:definition:Reduces static electricity by neutralising electrical charge on a surface

en:definition:Contracts the skin

en:definition:Provides cohesion in cosmetics

en:definition:Lightens the shade of hair or skin

en:definition:Stabilises the pH of cosmetics

en:definition:Reduces bulk density of cosmetics

en:definition:Reacts and forms complexes with metal ions which could affect the stability and/or appearance of cosmetics

en:definition:Helps to keep the body surface clean

en:definition:Colours cosmetics and/or imparts colour to the skin and/or its appendages. All colours listed are substances on the positive list of colorants (Annex IV of the Cosmetics Directive)

en:definition:Renders cosmetics unpalatable. Mostly added to cosmetics containing ethyl alcohol

en:definition:Reduces or masks unpleasant body odours

en:definition:Removes unwanted body hair

en:definition:Reduces or eliminates hair intertwining due to hair surface alteration or damage and, thus, helps combing

en:definition:Softens and smooths the skin

en:definition:Promotes the formation of intimate mixtures of non-miscible liquids by altering the interfacial tension

en:definition:Helps the process of emulsification and improves emulsion stability and shelf-life

en:definition:Produces, upon application, a continuous film on skin, hair or nails

en:definition:Gives flavour to the cosmetic product

en:definition:Improves the quality of the foam produced by a system by increasing one or 
more of the following properties: volume, texture and/or stability

en:definition:Traps numerous small bubbles of air or other gas within a small volume of liquid by modifying the surface tension of the liquid

en:definition:Gives the consistency of a gel (a semi-solid preparation with some elasticity) to a liquid preparation

en:definition:Leaves the hair easy to comb, supple, soft and shiny and/or imparts volume, lightness, gloss, etc.

en:definition:Colours hair

en:definition:Permits physical control of hair style

en:definition:Modifies the chemical structure of the hair, allowing it to be set in the style required

en:definition:Holds and retains moisture

en:definition:Enhances the solubility of substance which is only slightly soluble in water

en:definition:Helps eliminate the dead cells of the stratum corneum

en:definition:Reduces or inhibits the basic odour or taste of the product

en:definition:Increases the water content of the skin and helps keep it soft and smooth

en:definition:Improves the cosmetic characteristics of the nail

en:definition:NOT REPORTED

en:definition:Reduces transparency or translucency of cosmetics

en:definition:Provides cosmetic effects to the oral cavity, e.g. cleansing, deodorising, protecting

en:definition:Changes the chemical nature of another substance by adding oxygen or removing hydrogen

en:definition:Imparts a nacreous appearance to cosmetics

en:definition:Used for perfume and aromatic raw materials (Section II)

en:definition:Softens and makes supple another substance that otherwise could not be easily deformed, spread or worked out

en:definition:Inhibits primarily the development of micro-organisms in cosmetics. All preservatives listed are substances on the positive list of preservatives (Annex VI of the Cosmetics Directive)

en:definition:Generates pressure in an aerosol pack, expelling contents when the valve is opened. Some liquefied propellants can act as solvents

en:definition:Changes the chemical nature of another substance by adding hydrogen or removing oxygen

en:definition:Replenishes the lipids of the hair or of the top layers of the skin

en:definition:Imparts a pleasant freshness to the skin

en:definition:Maintains the skin in good condition

en:definition:Helps to avoid harmful effects to the skin from external factors

en:definition:Seeks to achieve an even skin surface by decreasing roughness or irregularities

en:definition:Dissolves other substances

en:definition:Helps lightening discomfort of the skin or of the scalp

en:definition:Improves ingredients or formulation stability and shelf-life

en:definition:Lowers the surface tension of cosmetics as well as aids the even distribution of the product when used

en:definition:Darkens the skin with or without exposure to UV

en:definition:Produces a feeling of well-being on skin and hair

en:definition:Protects the cosmetic product from the effects of UV-light

en:definition:Filters certain UV rays in order to protect the skin or the hair from harmful effects of these rays. All UV filters listed are substances on the positive list of UV filters (Annex VII of the Cosmetics Directive)

en:definition:Increases or decreases the viscosity of cosmetics

The best is to use the property use (P366) for that list. Snipre (talk) 20:01, 16 October 2016 (UTC)
I actually have a multilingual taxonomy with many more languages - --Teolemon (talk) 21:07, 16 October 2016 (UTC)

Importing Pigment CICN numbers (Colour Index)

  • The Colour Index International constitution ID (P2027) has been created a while ago.
  • Many CICN numbers are present in labels or infobox ("CI 12345", "C.I. 12345", "Colour Index 12345").
  • So far, as I was looking for Mix N'Match import candidates, I've found short lists of pigments that are 20-100 values long.
  • However, the labels and external databases seem to have most of them.

Is it possible to source them from an external db or to REGEX them out from labels ?

It would be tremendously useful for Open Beauty Facts, that way we could decypher what's in your shampoo: List of ingredients of your favorite shampoo

--Teolemon (talk) 08:46, 17 October 2016 (UTC)

Some data can be found here but CI number is a non-free system. Snipre (talk) 10:00, 17 October 2016 (UTC)
My understanding is that having the identifiers on an item to link to their system is not an issue ? If they claim it's proprietary, this is very disturbing, since it is used on all the cosmetics you use daily, as if it was a standard… --Teolemon (talk) 12:11, 17 October 2016 (UTC)
List ready at the bottom of the page. What I think is the CAS number and the CICN. I'm not quite sure how to add statements based on another statement. (talk) 12:53, 17 October 2016 (UTC)
The problem is not to import the data, the problem is to access the data. I don't think there is a free database with all CI numbers. We can use the numbers but we can't import all the database in WD. Snipre (talk) 14:02, 17 October 2016 (UTC)

Creating Wikidata Items for GHS hazard statements

Currently the GHS hazard statements are stored as strings into Items. I feel that creating items for each GHS hazard statement could be interesting, esp ecially since the H302 will translate not only to an English sentence, but to sentences in many languages. Here's what I have done for Open Beauty Facts

--Teolemon (talk) 15:41, 16 October 2016 (UTC)

We can create the new properties under Wikidata:Property_proposal/Natural_science#Chemistry. Snipre (talk) 20:04, 16 October 2016 (UTC)
@Teolemon: That's what I proposed here. I think we need no property, we should just change P728 (P728) and P940 (P940) from string to property.--Kopiersperre (talk) 20:08, 16 October 2016 (UTC)
@Kopiersperre: We can't change the datatype of a property: we have to create a new one. Snipre (talk) 09:22, 24 October 2016 (UTC)
Since the properties are used in very few items, we may remove them all and then change the datatype. --Leyo 17:56, 24 October 2016 (UTC)
My source is : --Teolemon (talk) 21:31, 16 October 2016 (UTC)
Well, this is about the [[system previous to GHS (CLP in the EU), i.e. Dangerous Substances Directive (Q899329). --Leyo 17:56, 24 October 2016 (UTC)

Some resources

Just to invite you to share your tools I create a section for the SPARQL queries related to chemistry in Wikidata:WikiProject_Chemistry/Tools#SPARQL_queries and to avoid reimport of wrong data from external databases, please report all errors in Wikidata:WikiProject_Chemistry/References#Report_of_errors_in_reference_databases. I hope we will find a way to contact once the administrators of the different databases to inform them about some problems in their dataset. Snipre (talk) 11:46, 19 October 2016 (UTC)

IECIC id (cosmetics in China)

Please review --Teolemon (talk) 07:18, 20 October 2016 (UTC)

Qualifier for reactions

Jasper Deng
Egon Willighagen
Denise Slenter
Daniel Mietchen
Emily Temple-Wood
Pablo Busatto (Almondega)
Antony Williams (EPA)
Devon Fyson
Samuel Clark
Tris T7
  Notified participants of WikiProject Chemistry Has anyone figured out how to document reactions? Or is there a place this is already being discussed? Over on meta wiki I discovered there has been a proposal for a Wikichem (see discussion) which I think is a great idea but could use more feedback. And I foresee it's implementation depending on how much wikidata can support. Devon Fyson (talk) 05:25, 11 November 2016 (UTC)

@Devon Fyson: With WD, we don't need to create a new structure Wikichem. And when you see the activity in this project, which is the most similar to a Wikichem, I think you can easily deduce that a Wikichem will have very few contributors.
About reaction, this is no rule or model but if you want you can start a section under Wikidata:WikiProject_Chemistry/Tools and put a draft of reaction model. Snipre (talk) 20:06, 15 November 2016 (UTC)
Return to the project page "WikiProject Chemistry/Archive/2016".