Wikidata talk:WikiProject Chemistry

Icône de rangement Old discussions are archived in Archive 2013, Archive 2014, Archive 2015, Archive 2016, Archive 2017, Archive 2018, Archive 2019, Archive 2020, Archive 2021.

Issues related to chemistry-related properties or items edit

When cleaning up items regarding the work on metaclasses, I also described several issues on separate subpages of this WikiProject. Details in the sections below. These are suggestions on how to solve a given problem, so a discussion is highly recommended and any comments and remarks are most welcome. I know that there are still many other issues (like the issue with tautomers) that need to be addressed. Wostr (talk) 14:25, 11 August 2023 (UTC)Reply

Issues with InChI and InChIKey edit

Right now I described only the problem with InChI (P234) and the 1500-character limit, which I moved and expanded from the property discussion page. For many months now I always deal with this by adding:

InChI
  some value
reason for deprecated rank exact value exceeds the maximum number of characters
0 references
add reference
Value of this statement is set to some value, rank is set to deprecated with a specific reason for deprecation.
add value

and I propose to make this a (temporary) 'good practice' for this issue. The best option would be to increase the max limit of characters, but it's probably not possible to the extent this property would need (about at least 3–4 times the current limit). The other solution would be to split long InChIs and add in fragments. However, this raises some problems: how to do it (separate statements with series ordinal (P1545), in the form of qualifiers?) and whether it will be re-usable for users at all. Wostr (talk) 14:25, 11 August 2023 (UTC)Reply

Labels exceed the character limit edit

Problem similar to above. There is currently a limit of 250 characters for labels and aliases. In many cases, the chemical entities described in WD have only systematic names that are well over 250 characters long. In these cases, most often the items: (1) have no name, (2) have the name in the form of InChIKey, (3) have the name in the form of a different identifier (e.g. CID, UNII), (4) have the wrong name.

There seem to be two solutions here:

  1. don't set any name at all and leave labels and aliases blank
  2. use InChIKey as a temporary name.

Of the two options, I would suggest using the latter. InChIKey is a short, non-proprietary identifier that uniquely identifies a chemical structure. In addition, it would then be possible to automatically check the number of such cases by comparing label and InChIKey (P235).

The problem here is also automatic and semi-automatic changing of labels. In some cases, there are short names in the databases, which, however, turn out to be incorrect and misleading (e.g. it is a correct name, but for a structure with a different spatial configuration). Therefore, along with the proposal to use InChIKey in such situations, I suggest that changing labels from InChIKey to other names should be done only manually. Wostr (talk) 14:25, 11 August 2023 (UTC)Reply

ChemSpider issue with 'group of stereoisomers' edit

In ChemSpider there are entries for both 'undefined' and 'unknown' stereocenters generated using non-standard InChI. In WD we do not distinguish between such entries, as we mainly describe chemical entities based on standard InChI. From our point of view, both identifiers in ChemSpider are valid and refer to the same item.

I propose to solve this by adding a proper qualifier and mark one ID as preferred:

As 'preferred' would be marked an ID that uses '?' symbol for stereocenter (just as in standard InChI), which usually (always?) has a lower ID. With 'normal' rank would be marked an ID that uses 'u' symbol for stereocenter (which is present in ChemSpider to deduplicate structures). Wostr (talk) 14:25, 11 August 2023 (UTC)Reply

Multiple entries in databases about ionic entities edit

In some databases (PubChem, ChemSpider) chemical entities that exhibit predominantly ionic character may have more than one entry. This is due to the fact that it is difficult to show the ionic character of a bond in the form of a structural formula, thus the SMILES or InChI generation methods in principle allow to show either the ionic character or the covalent character of the bond. This leads to a situation where one chemical entity is described in databases in two ways. This therefore results in duplication of (1) identifiers, (2) structure-related properties (SMILES, InChI).

I consider it a mistake to describe such representations of chemical structures in separate items, and I believe that only one item should exist in WD in such situations, but with two sets of identifiers.

In the case of duplicated identifiers, I suggest adding appropriate qualifiers and marking one of the identifiers as 'preferred'.

Why the 'normal' and 'preferred' rank instead of 'deprecated' rank of one of the identifiers? The entries in the database are not incorrect per se, one describes the chemical structure more accurately than the other.

In the case of duplicated structure-related properties (SMILES, InChI, InChIKey), I suggest to 'deprecate' one of the statements:

canonical SMILES
  [O-]S(=O)(=O)[O-].[Co+2]
0 references
add reference
  O=S1(=O)O[Co]O1
reason for deprecated rank covalent structure for a chemical entity with a predominantly ionic character
0 references
add reference
First value shows a correct representation of a ionic compound, no need to set rank as preferred. The second value shows a covalent representation of a predominantly ionic compound, thus it is set to deprecated with a proper reason stated.
add value

In this case, one of the representations of the chemical structure is generally much less correct than the other, hence the 'deprecated' rank with the appropriate qualifier. Wostr (talk) 14:25, 11 August 2023 (UTC) Wostr (talk) 14:25, 11 August 2023 (UTC)Reply

In some cases, which structure (ionic vs. covalent) is more correct isn't totally clear, because the structure differs between states of the substance. E.g.: in the CoSO4 example, the anhydrous compound might actually be a coordination polymer with sulfate ligands. NaCl is ionic in condensed phases, but molecular as a gas. Also, sometimes there are multiple solid-state polymorphs, e.g. sulfur trioxide. In principle, we could create a separate item for each structure, but that would quickly become unwieldy, and most sources don't specify precisely which polymorph they're talking about. I'm not sure of a great solution in general.
Separately, there are also some other kinds of chemical relationship that result in multiple entries. For example, in Talk:Q1792796#Bad data from Pubchem there seem to be two different entries with the correct atoms and linkages; if I'm reading them correctly, they're two legitimate resonance contributors to the same structure. 73.223.72.200 22:56, 11 August 2023 (UTC)Reply
While I can't agree that e.g. NaCl forms a covalent molecules in gas phase, I agree that the presented approach would be at best true for normal conditions. Maybe the better approach would be to set all such statements as 'normal' and add proper qualifiers (like entry in a database describing the character of a chemical entity as covalent (Q121136454)) to both external-IDs and to structure-related properties. I'm too not sure about creating multiple items for each structure. We are doing this for e.g. carbohydrates and some tautomeric forms, but apart from carbohydrates, other tautomeric forms cause a lot of problems in WD. It would be similar if we wanted to duplicate items for each form of representing a chemical structure, especially since, unlike tautomers, it is not even possible to isolate this type of structures here, their existence is only the result of problems with generating a structure by certain tools or software. copper(II) acetylacetonate (Q1792796) is an example that in PubChem the same chemical compound has three different entries (I think all three IDs are correct here and the differences are due to the imperfections of the tools responsible for generating chemical structures). Wostr (talk) 13:45, 12 August 2023 (UTC)Reply
If we want to go further with the above problems, we have to treat in the same discussion zwitterion and tautomers like ketone/enol, imine/enamine, lactam/lactim... Most duplicated entries in WD are due to these two majors origins. The correct way would be to chose the more stable form at ambient conditions (covalent bond for NaCl in gas phase is not the common state of the molecule, so this can't be used to represents this salt), but there is not sufficient reference to confirm which is the more stable state. I would prefer therefore fix one form in the case of zwitterion and tautomers as the reference form for InChI/InChIKey/SMILES representations.
Then whatever is the chosen solution, we have to fix the constraint violations: there is no sense to keep that kind of tools if there is no consistency between constraints rules and practice. If we decide that InChIKey is a single value property (see InChIKey (P235)), then we have to respect that choice and delete multiple values even with deprecated status.
As general trend, I prefer to avoid multiple values with qualifiers like proposed by Wostr above. This is just a mess especially when retrieving data from an external system like WP. Reality can't be modeled in all details and I prefer to simplify the data structure and work more on adding valuable information than maintaining a complex structure of possible representations. Snipre (talk) 12:09, 14 August 2023 (UTC)Reply
Starting from the end: leaving some statements in the items with 'deprecated' rank does not pose any problem with retrieving the data. That why the ranks are in WD: the statement is 'deprecated', so (1) there is no risk that somewhere in WD new item will be added based on such statement, (2) such deprecated statement allows the item to be found by any user, but (3) any re-user know (or should know, based on general WD model) that only 'preferred'/'normal' rank statements should be used.
I agree that the problem is broader, but the best solution in one area may not be the best in another. Even if we agree that some structural representations should not be placed in items in WD (InChI, SMILES), we cannot do the same with identifiers. At this stage, it seems to me inevitable that in some items we will have more than one external identifier due to the fact that not every database treats data in the same way, and on the other hand, in accordance with the general rules of WD, for each identifier, e.g. in PubChem, ChemSpider or ChEBI you can create an item and this item will be notable. So it seems impossible to have a 'single value' constraint in these situations, the only solution would be to go in the direction of 'one is preferred' (and establish which one we should mark as 'preferred'). In other databases the relation between entries is also not 1:1, 1:1 relationship between WD item and other databases is a nice idea, but not feasible.
What's more, data consistency does not require to use 'single value' for identifiers. There are many options of simple constraints, we can always add some complex constraints and maintain the data consistency that way. Wostr (talk) 12:59, 14 August 2023 (UTC)Reply
Even if we agree that some structural representations should not be placed in items in WD (InChI, SMILES), we cannot do the same with identifiers
Why ? We have no obligation towards external databases to include all their entries. Instead of trying to merge all entries of external databases, we can, if we want, define what we as community of WD defines as possible items/values. If we define that zwitterions are not valid for creation of an item/statement, then we can excludes all external identifiers representing zwitterions. WD is not the phone book of external databases trying to connect everything. Having an external identifier is for me not sufficient if we have a clear policy. Accepting everything is the perfect example of no internal policy.
I just read one article regarding the tautomerism and there are 86 possible cases of tautomers. As described by the article, most databases are not consistent regarding the treatment of tautomers (see here. So accepting the existence of external identifiers as the only rule for item/statement creation will just import the mess of all databases in WD.
Then I still waiting on the effect of your data maintaining based on simple indicators: the number of constraint violations of the following page. See
Wikidata:Database reports/Constraint violations/P662
Wikidata:Database reports/Constraint violations/P235
Wikidata:Database reports/Constraint violations/P231
The quality of WD is not the capacité of integrating all identifiers of most databases, the quality of WD is to have a set of data well organized and following a understandable policy by most of external people. That's my opinion. But this is the value of a database in general. Snipre (talk) 14:47, 14 August 2023 (UTC)Reply
The exception from the general notability rules (p. 2) would probably require a full-project discussion. Without it, anyone can add an item about a zwitterion or about a tautomer and frankly, we can't do anything about it, as such items are notable enough to be included in WD.
InChI V2, which you've mentioned, is likely to have better recognition of tautomeric structures, however, still in non-standard InChIs.
Leaving a certain part of external chemical databases outside WD will only result in one thing: constantly importing more items about these records from external databases. I have not seen any attempts to control this procedure so far, having already over 1.2M items, manual control over it is impossible. The only solution I see is a proper policy in which cases we should have 'one-to-many' linking to external-databases and how to qualify these external-ids so that it is understandable also in an automatic way.
I've seen many times where 'deprecated' or duplicated IDs have been removed from items. This does not lead to anything good, because these external-IDs will reappear at some point, it will only be months or years before someone discovers them among over a million other items. That's why I say, and I will always say, removing something like this is short-sighted and will only lead to more work. I have never seen in any other database that they remove, even duplicated, links to other databases. It works like redirects in WP – it's better to have more, even 'deprecated' ones, because it allows you to find these items and prevents them from being imported in the future. Wostr (talk) 16:41, 14 August 2023 (UTC)Reply
Speaking as someone who has grappled with these problems for many years before my retirement (from Syngenta), some issues can be "solved" by using StdInChI and StdInChIKey rather than InChI and InChIKey. The former pair render all possible tautomers of a compound into an identical string and in my opinion this is correct when discussing chemical substances: tautomerism is a property of samples that also depends on temperature/solvent etc. I can't see any situation where there would be multiple articles in Wikipedia to cover multiple possible tautomers — they would always be merged into one article. Likewise with Zwitterions: we don't have an article for glycine as H2NCH2COOH and H3N+CH2COO- despite the latter certainly being the reality for all glycine samples in aqueous solution. Going a step further, polymorphism is also a property of a sample, not a substance, although in rare cases (e.g. ice, sulfur, phosphorus) Wikipedia has multiple articles to cover these. That isn't the general case: I authored ROY, = Q27281324, with >= 13 polymorphs and I don't think it would be helpful to have a Wikidata item for each. Michael D. Turnbull (talk) 15:09, 14 August 2023 (UTC)Reply
But our items are not about substances only. Many, or even most of the statements refer to the molecular entities. And therefore, fortunately or unfortunately, our items are a combination of both definitions, substance and molecular entity. InChI/InChIKey has its limitations as even this identifier sometimes fails to properly describe a structure (there are situations in which the same substance has different InChIs as a result of incorrect recognition of tautomeric structures in the InChI software).
The reference to Wikipedia, on the other hand, I think is fundamentally incorrect. By definition, Wikidata is intended to describe the world in much more detail than encyclopedic articles. Chemistry is no exception here. Since Wikipedia only lists isotopes, should we do the same in Wikidata and remove all items about them? Because Wikipedia describes racemic mixtures, and there are no separate articles on individual stereoisomers, should we remove items about stereoisomers? Wikipedia is not an indicator of how Wikidata is supposed to work, much less Wikidata is not an information base for Wikipedia. Wostr (talk) 16:41, 14 August 2023 (UTC)Reply

Items with IUPAC Gold Book ID? edit

IUPAC Gold Book ID (P4732) is (to me) an important property: it links to a very important chemistry compendium (glossary/dictionary), Compendium of Chemical Terminology (Q902163). I don't remember the details, which is why I start this discussion, but I seem to remember a discussion on how to model this: should P4732 be used on items that represent a compendium entry (as in something like "described by encyclopedia page"). Indeed, compendium entries started to appear as images at https://commons.wikimedia.org/wiki/Category:Files_provided_by_IUPAC (I need to talk with Martin about this).

So basically, I see two options: the IUPAC Gold Book ID (P4732) is linked to the Wikidata item about the concept, as the property description says (and then the images from Wikimedia Commons would be linked to the term directly, e.g. like for glass transition (Q825643)). Second option is that each compendium item has a separate Wikidata item and it linked to the concept via "main subject" and/or "described by" (and then the images are linked to the item about the Compendium entry). These are alternative ways to model this. I like to model the preferred model as EntitySchema. What do you think? Egon Willighagen (talk) 05:57, 10 October 2023 (UTC)Reply

My impression is the current usage of IUPAC as identifiers suffices. Very much like IEV identifiers. Sorry if I'm missing your point. Fgnievinski (talk) 07:00, 10 October 2023 (UTC)Reply
I agree. It should stay as an external identifier to be used in Template:GoldBookRef (Q7204578). Regards Matthias M. (talk) 11:40, 10 October 2023 (UTC)Reply
To be honest, converting the text into images is bad for accessibility. I think commons:Category:Files provided by IUPAC should not be used in Wikipedia. Matthias M. (talk) 11:59, 10 October 2023 (UTC)Reply
There was a discussion somewhere (I think on this discussion page) about this problem as there were (and sometimes still are) entries about GoldBook definitions that were imported based on DOI. The result of this discussion was to delete such entries and use IUPAC Gold Book ID (P4732) only in items about the concepts.
c:Category:Files provided by IUPAC, however, seems very problametic due to accessibility issues mentioned above. These files should be included in Commons, but I'm not sure that they should be used anywhere in Wikimedia projects and I'd be against using these files in WD. Wostr (talk) 18:11, 10 October 2023 (UTC)Reply
In my opinion we should continue to use the property as proposed, on the concepts themselves. These are the key hubs for linking the data across domains. I would not favour making a mass of items about the Goldbook entries themselves. Since we have the images, I have no problem with the ID also being used as structured data on those images in Commons. --99of9 (talk) 23:07, 10 October 2023 (UTC)Reply
Thank you, everyone, for the great feedback. I will pass this on. --Egon Willighagen (talk) 09:48, 15 October 2023 (UTC)Reply

@Egon Willighagen:, @Matthias M.: @Wostr:- Sorry I missed this discussion at the time, but I'd like to follow up because we are currently working on this, and in fact some images were recently uploaded (I believe in the wrong format). I'm mostly focused on Wikipedia for now, but it could affect Wikidata too. Concerns were raised above about the use of images in Wikipedia because of accessibility, and I'd like to try to address this if we can. Unfortunately the use of images was due to a copyright concern forced upon us by the Wikipedia Community and IUPAC. WP editors deleted dozens of IUPAC definitions, stating they were plagiarised from the IUPAC website (which, in effect, they were - with the full blessing of IUPAC!). The IUPAC chemists then persuaded the IUPAC leadership to clarify the license for sharing, but IUPAC naturally didn't want their definitions altered, so they insisted on a CC-ND license. That was of course unacceptable to WP, which led to an impasse. The solution we came up with, acceptable to both sides, was to have an image containing the Gold Book definition released under a CC-BY-SA, linked directly to the Gold Book. I think the assumption was that anyone visually impaired would click through to the Gold Book site which I presume is fully accessible. The only concern raised by WP chemists was that there should be a link to the image file, not just to the Gold Book entry.

I worked with the Gold Book web guru to compare different formats for the image - please see these w:User:Walkerma/sandbox on my Sandbox page - and we agreed to use the format Image with caption link. Is this the best that is possible, given our constraints? If not, can you suggest a viable option that improves accessibility, and maybe add that into my Sandbox page? We hope to add a lot more definitions over the coming years, and we want to get the format right. Your input is most welcome. Thanks, Walkerma (talk) 19:03, 24 February 2024 (UTC)Reply

You fundamentally misunderstood how Wikipedia works. We write articles citing sources, with IUPAC Gold Book being one of the most valuable ones because it is an authority source, and it has stable DOI links. We don't copy the sources in inadequate image formats and paste that into related articles. This only creates additional work to clean up using mass deletion requests, which then also need to be discussed. Matthias M. (talk) 19:17, 24 February 2024 (UTC)Reply
I'm not sure how this problem is related to Wikidata. I see no use of such images in Wikidata, in fact I see no use of them even in Wikipedia, but I'm not an editor of en.wiki, so I won't question this – although I would be strongly opposed to this on my home pl.wiki. My experience is that any quotes should be avoided if possible. So far, we have easily managed to write articles in pl.wiki in such a way that it was not necessary to quote IUPAC (or our Polish translations of their publications) in the articles. Appropriate writing of definitions in the article based on the IUPAC definition was completely sufficient and IUPAC definitions are always accessible through references. Wostr (talk) 19:38, 24 February 2024 (UTC)Reply
I disagree. I do agree that Wikidata should work this way, but certainly en:Wikipedia and even pl:Wikipedia is full of short quotations of this type, such as in w:First_Amendment_to_the_United_States_Constitution. Right from the early days of the English Chemistry WikiProject, there were chemists actively adding IUPAC definitions into articles, and in fact there was a task force devoted to it. I also don't understand why it must be "inadequate" if designed properly, and that's what I was really asking for help with. But maybe I should just limit my query to Wikipedia, because I know the discussion of lesser interest to the Wikidata community. Walkerma (talk)
That probably results from different approach to copyright law in the US and in Poland, the fact that in pl.wiki fair use is not allowed and the view that even legal use of the right to quote can sometimes cause problems with further use of the content. Of course legal acts are not subject to copyright, so your examples are not adequate here, but we have many discussions in pl.wiki that quotes should be used only when necessary and to a very limited extent – and from my point of view, this is not the case at all here. IUPAC definitions are published under incompatible license and those images that already have a compatible license will not help at all – they have accessibility problems and are inconvenient to use (in order for the text to be comparable in size to the text of the article, they must occupy more than half of the width in the new Vector skin). I would say that since IUPAC defends its definitions so much, you should do like every other wiki in the world – simply create article definitions based on IUPAC definitions, and not try to include them in some strange way as a quote. Wostr (talk) 21:34, 24 February 2024 (UTC)Reply
Thanks - that's helpful. The dilemma is that if we create our own definitions, or allow free editing of definitions derived from IUPAC definitions, we end up with original research (at best), and completely incorrect definitions (at worst). I came across a citation last month where someone had edited the text to say the opposite of the citation, and that is not uncommon. But if we use these official definitions verbatim in pure text, we fall afoul of plagiarism rules. I know images can be made accessible, and I can ask for the images to be made smaller and less bright, if you think that would be preferable - to look more like the first amendment article I cited. Cheers, Walkerma (talk) 22:27, 24 February 2024 (UTC)Reply
I don't think I follow, because creating article text (so also definitions) based on sources is the core foundation of Wikipedia and it's a long way from OR. In pl.wiki we have FlaggedRevs so it is easier to catch any vandalism, but as far as I remember, there were boxes with IUPAC definitions in the articles in en.wiki that could be vandalised as easily as any other text. But leaving that aside, if anything is to come out of this discussion here, I have a few points:
1. I am using the new Vector skin, in my case e.g. in en:aerogel this image takes up about 60% of the width – maybe it should be placed in the middle? or in a separate section? maybe there should be a template that would expand this image after clicking on "official IUPAC definition [show]"?
2. accessibility – I do not think that the alt text in the form of "IUPAC definition for aerogel" is sufficient, because it in no way replaces an image, which is the purpose of an alt text. En.wiki is one of few Wikipedias that have an Accessibility Wikiproject – maybe they will tell how to best handle such images?
3. in many projects, not linking to the image page is only allowed for photos in the public domain. Similarly, it seems to me that external links should not appear here either. Wouldn't it be better to leave a link to the image page (where is a proper license information) instead of a link to the IUPAC website, and put "Official IUPAC definition" as the caption with a footnote (with DOI and other info)? Wostr (talk) 00:37, 25 February 2024 (UTC)Reply
You actually put File:IUPAC definition for aerogel.png under a Creative Commons license where anybody can edit that definition. Then you are not an authoritative source anymore. I don't know which problem you are trying to solve. The IUPAC gold book is already very accessible and wildly cited, even without the image spam in English Wikipedia. Matthias M. (talk) 10:41, 26 February 2024 (UTC)Reply

I now filed Wikidata:Requests for deletions#Bulk deletion request: Entry in IUPAC's Gold_Book. Regards Matthias M. (talk) 13:52, 3 March 2024 (UTC)Reply

Can someone reimport chemical formula (P274)? edit

  Notified participants of WikiProject Chemistry see Wikidata:Project_chat. Midleading (talk) 09:26, 19 October 2023 (UTC)Reply

I think I found the chat you referred to. Reimporting PubChem is for me not on the table right now. It sounds very complicated, and requires evaluating the history of an item, and see if edits have been made since the import. What I can help with, is create curation lists. -- Egon Willighagen (talk) 09:48, 19 October 2023 (UTC)Reply
I really don't see a problem here. Formulae imported from PubChem are correct, but the notation may not be the one you are looking for. There are many ways to write a chemical formula, Hill notation is the best to use in databases, but may not be preferred in other uses. So the problem here is the lack of other formulae you are looking for. Wostr (talk) 00:00, 20 October 2023 (UTC)Reply
PubChem is just an aggregator (like WD). If you notice valuable annotation in PubChem, it often comes from data sources we already have or want to import directly, instead of from PubChem. That said, data sources we cannot import directly are patents, scrapings from the literature, and SID submissions from the industry (if they have no other source). Concentrating on these cases would be valuable. --SCIdude (talk) 07:31, 21 October 2023 (UTC)Reply
Hi,
I just started the deletion of 22,959 chemical formulas that were violating the constraints. (See https://quickstatements.toolforge.org/#/batch/215303, https://quickstatements.toolforge.org/#/batch/215304, https://quickstatements.toolforge.org/#/batch/215305, https://quickstatements.toolforge.org/#/batch/215306, and https://quickstatements.toolforge.org/#/batch/215307)
I am also trying to complete the different masses/formulas from compounds where they are not present and can be easily calculated at the same time.
The next step is to see if the formula is matching the SMILES but this will take more time. AdrianoRutz (talk) 14:33, 26 October 2023 (UTC)Reply
Many of these formulae seemed valid, but where imported with minor errors (like not using the subscript). I'd be good to check in which cases some formula can be reimported and which items will be left without a formula. Wostr (talk) 18:19, 26 October 2023 (UTC)Reply
I will do it when all the correctly formatted ones will be finished re-importing.
Or I could post the list here so it can be fixed before re-import. Else I will simply re-import the violating ones, but representing probably only a few percent in comparison to the original amount before this operation. AdrianoRutz (talk) 10:05, 27 October 2023 (UTC)Reply
@AdrianoRutz: this doesn't seem like a right way to approach constraint violations. Also I don't see based on what you gathered these 22,959 formulas/items really. You removed many formulas that seem perfectly valid, including very simple ones like B(OH)₃ in sassolite (Q424769). E.g. in lanthanite-(La) (Q3826951) I restored the formula and as far as I can see there is no constraint violation. In some items a simple fix was needed, e.g. in walpurgite (Q1531254) it was probably only needed to replace * with ·. 2001:7D0:81DB:1480:45C:F54A:B289:F778 09:52, 27 October 2023 (UTC)Reply
Hi, I took the regexp of the property (https://www.wikidata.org/wiki/Property:P274#P2302). I am also re-importing multiple thousands of them formatted correctly in parallel.
Some might still remain without formula because of the unability to generate it but this should remain minor. I have downloaded all formulas locally as safety. Happy to re-import all the ones that will be still missing after curation in case. AdrianoRutz (talk) 10:03, 27 October 2023 (UTC)Reply
It does not look like you used (only) this particular regex. See above an example that doesn't yield a constraint violation using the very same regex, many others in your batches probably also don't. In case of a reimport why it was needed to remove the statements in first place anyway? Also reimport from where? PubChem? Examples that I checked (minerals) are without PubChem links and are not expected to have ones. 2001:7D0:81DB:1480:45C:F54A:B289:F778 10:25, 27 October 2023 (UTC)Reply
Re-import the exact same formula I deleted, nothing else relying on external sources. I have them locally so do not worry, no item having a formula previously will be left behind.
I agree some edits could have been avoided but the end result will be cleaner.
For the rest, I am really happy to see that you care about those cases so much that you seem to forget the rest. AdrianoRutz (talk) 10:43, 27 October 2023 (UTC)Reply
Well, these are not only a few cases. So far e.g. 4038 mineral species[1] in particular lack a chemical formula due to your recent edits. 2001:7D0:81DB:1480:45C:F54A:B289:F778 11:04, 27 October 2023 (UTC)Reply
Well, this is over 4038 over 6049. Indicates an issue with the chemical formula of mineral species not conforming to the actual constraints to me.
In order to make clear that I never had intention of deleting formulas without re-importing them properly, I prioritized their re-import: https://quickstatements.toolforge.org/#/batch/215387
For the rest, I will still wait my other imports to finish before re-importing them.
I am happy my edits at least drew the attention to things that were left for years untouched. AdrianoRutz (talk) 11:42, 27 October 2023 (UTC)Reply
Alright, thanks for restoring. Just in case I mention that so far you didn't restore references, e.g. here. 2001:7D0:81DB:1480:F4F9:6B85:4F2E:EFED 08:25, 28 October 2023 (UTC)Reply

@AdrianoRutz: I assume your (re)imports are done now but lost refernces are still a significant issue. I made some more queries and caught 1354 mineral species items where reference(s) got lost in your batches. In case you don't have the list, items are the following:

Additionally I found 13 mineral species item where P274 statements hasn't been restored yet: Q3129310 Q973557 Q115520207 Q123167546 Q123169008 Q123155195 Q123163486 Q123168695 Q3782486 Q284146 Q123152967 Q123170689 Q401047. Please further process these 1354+13 items too.

Note that I checked only mineral species. It is very likely that similar problems concern other items in your batches too. So it might be still better if you undid everthing and then redo it in a less messy way. 2001:7D0:81DB:1480:194B:C6D3:7F6:3878 08:26, 29 October 2023 (UTC)Reply

Thank you for your permanent concerns and feedback. Seeing someone this motivated to improve the content of Wikidata is cool.
1. I did not only check the mineral species, but also other instances, such as mineral varieties, for example, that I already fixed. In case I missed some, happy to hear your feedback.
2. No, the editing was not over, your engagement of having these things now fixed is faster than my capacities to edit. Still, I again prioritized your concerns: https://quickstatements.toolforge.org/#/batch/215583
3. These were 1360 and not 1354, additionally, in the meantime, the number of formulas got up by more than 100,000 and incorrectly formatted ones down by 20,000.
4. For our next interaction I would appreciate less judgemental sentences and eventually a bit more positiveness.
Best, AdrianoRutz (talk) 12:53, 29 October 2023 (UTC)Reply
P.S.:
I checked the 13 additional items you mentioned manually and thank you for pointing them out. These have gone under my radar as they contained 2 different chemical formulas, which was unexpected. I reverted the edits and hope someone more knowledgeable than me in the field will curate them. AdrianoRutz (talk) 12:59, 29 October 2023 (UTC)Reply

How to handle incorrect PubChem entry edit

Q123257271 "6-bromo-2-mercaptotryptamine" matches the PubChem record titled as this chemical-name. But this structure is mercaptomethyl not mercapto and SciFinder has no such structure. Instead, SciFinder has the actual mercapto structure at this name (CASNo 808113-54-4) and there is no PubChem entry matching that CASNo. And the cited refs at en:BrMT also are actually the mercapto structure. How should Wikidata handle this? Should this WikiData item be a clone of the presumably-incorrect PubChem record, and a new WikiData item be created for the correct structure? Confusing to have two different items with a same chemical name. Or should this WikiData item be updated itself (and the PubChem link omitted)? DMacks (talk) 05:56, 7 November 2023 (UTC)Reply

Hi. Basically, both Wikidata and PubChem are based on the idea that each record has a unique InChI/-Key. When there is a mismatch between name and InChIKey, I normally resolve it like this: if there is a Wikipedia sitelink, follow what that says (because moving sitelinks is harder, and because Wikidata started out as database linking Wikipedias); if there is not, I tend to follow the InChI/-Key (and the matching SMILES) and update/fix the name. The matching of the PubChem CID follows the InChI/-Key. When I pass the name through OPSIN (Q26481302), the it confirms indeed that the name and structure do not match. I suggest to update the name. @Marbletan: maybe you can shed further light into this Wikidata item? --Egon Willighagen (talk) 06:14, 7 November 2023 (UTC)Reply
I created this Item based on the content in the English Wikipedia article. I did not catch that the article conflated two different chemical compounds. I'm sorry for bringing the confusion from there to here. I think there should be two different Items to describe the two different chemical compounds, but I don't have a personal preference for which way it goes. Pubchem's incorrect chemical name shouldn't be used. We can also use the "different from" property (P1889) to mitigate the potential for future confusion. Marbletan (talk) 13:32, 7 November 2023 (UTC)Reply
I updated the name to match the structure. Is there a tool/bot for creating new Wikidata entries for newly created chemical pages? DMacks (talk) 01:04, 8 November 2023 (UTC)Reply
I created Q123370393 manually. I'm not aware of a tool to automate it. Marbletan (talk) 13:24, 8 November 2023 (UTC)Reply
Thanks. DMacks (talk) 14:09, 8 November 2023 (UTC)Reply

Is model for and Modeled by edit

Please consider supporting Wikidata:Property_proposal/model_for. Thanks. Fgnievinski (talk) 02:57, 12 November 2023 (UTC)Reply

Wikidata items for chemicals where Wikipedia may have more information edit

@AdrianoRutz and I worked out a federated SPARQL query with DBpedia to find Wikidata pages with a English Wikipedia with a ChemBox, but without a SMILES (can/iso/cx): https://w.wiki/8iUp This lists currently still over 500 Wikidata items that could have more information. I have started adding missing information, but despite my https://github.com/egonw/ons-wikidata/blob/main/Wikidata/createWDitemsFromSMILES.groovy script, it still is manual work for each item. So, feel free to help out. Egon Willighagen (talk) 08:34, 4 January 2024 (UTC)Reply

Simple substances edit

We have e.g. 3 items: hydrogen (Q556) as chemical element (whatever it would be), dihydrogen (Q3027893) as one of simple substances consisting of it, hydrogen molecule (Q19822725) as one of type of molecules for them. In order to consistently model physical properties (temperatures, elasticity modules and so on) I want to propose to use them only at simple substance items, not at chemical element (which has e.g. atomic number (P1086) and electron configuration (P8000)) or molecules (which has ionization energy (P2260) and electric dipole moment (P2201)). As one of steps I propose to change constraints at melting point (P2101)/boiling point (P2102) from

to

. Is it ok? Infovarius (talk) 13:57, 9 January 2024 (UTC)Reply

Having a quick look to what are subclasses of `substance` (see https://w.wiki/8njM), this looks like a not so good idea. AdrianoRutz (talk) 14:08, 9 January 2024 (UTC)Reply
Somehwere in the archives of this discussion page there are entries about this problem. There was a time we had more items describing simple substances, but many of them were merged into items describing chemical elements. It is true we should have these concepts described in different items, however, this would require a thorough discussion on how to do it, which statements should be put in which items etc. Wostr (talk) 15:21, 9 January 2024 (UTC)Reply
Well dihydrogen is the hydrogen molecule. Trihdyrogen also exists as a rarity, and tehre are ions, and there is a dihydrogen cation, and a trihydrogen cation. "hydrogen" could be the more general item, with included items, a distinction between atomic hydrogen and the other ionic or molecular forms. Graeme Bartlett (talk) 07:58, 17 February 2024 (UTC)Reply

Physical properties edit

Related problem is that I can't build a table of physical properties for all simple substances as they are mixed with elements:

SELECT distinct * {
  {
    SELECT * {
      ?item wdt:P279|wdt:P31 wd:Q2512777 .
    } LIMIT 1000
  }
  OPTIONAL {?item wdt:P2101 ?melt.}
  OPTIONAL {?item wdt:P2102 ?gas.}
  OPTIONAL {{?item wdt:P1086 ?number.} UNION {?item wdt:P527/wdt:P1086 ?number.}}
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "ru" .
    ?item rdfs:label ?label . ?item schema:description ?description
  }
}
Try it!

. I got less than 50 results. --Infovarius (talk) 14:03, 9 January 2024 (UTC)Reply

I am not sure to understand what you are trying to achieve, but it the issue is that simple substances are mixed with their corresponding elements, I would rather work in clearly distinguishing both and modelling them correctly than modifying the actual constraint? AdrianoRutz (talk) 14:16, 9 January 2024 (UTC)Reply
I just want to see a comparison of all melting temperatures of "chemical elements" (colloqually saying). As I interpret it should be for simple substances. For example, when I want to know phase transition temperatures of xenon (Q1106) where should I look? Yes, this problem is a consequence of upper post. --Infovarius (talk) 15:48, 10 January 2024 (UTC)Reply
There is the concept of allotropes that allows linking the chemical element to the simple compounds consisting of it. I agree we should not have physchem properties on elements. Egon Willighagen (talk) 13:05, 13 January 2024 (UTC)Reply

Chemical Reactions edit

  Notified participants of WikiProject Chemistry

Dear all, I was trying to see if some people started linking chemicals to chemical reactions... So I started by looking at https://w.wiki/8uGb which give some results but it looks like there is quite a mess between P31 and P279 (as often) so it is hard to "easily" retrieve "true" chemical reactions. Further, it looks like many of them are not used at all (https://w.wiki/8uGj)

Is there someone experienced that could guide me?

Someone that could help cleaning up things?

Any good examples to follow to map such ideas, like the chemical reaction(s) leading to the dimerization of some dimeric alkaloids? AdrianoRutz (talk) 10:27, 20 January 2024 (UTC)Reply

It would indeed be great to better document chemical reactions here ! Encoding them could be a first step. I have created a page for SMIRKS (Q124450357) and proposed the corresponding property https://www.wikidata.org/wiki/Wikidata:Property_proposal/Natural_science#SMIRKS. This could be a way to better detail the chemical reactions listed in your previous queries ? GrndStt (talk) 13:14, 7 February 2024 (UTC)Reply

Saturation (chemistry) edit

I was pointed last night to saturation (chemistry) (Q1766412) which mixes up disambiguation, subproperties, etc. It's not an easy fix, and I will not get to cleaning this up in the next three weeks. I think it will cost a bit of time as there is some modelling involved and plenty of lookups what the sitelinks refer too. For example, the English WP sitelink redirects (should be removed?), the French WP lists a number of subproperties, it seems. Etc. --Egon Willighagen (talk) 14:54, 16 February 2024 (UTC)Reply

I see at least 3 or 4 different concepts here (including disambiguation) and any attempt to solve this will probably be met with complaints from Wikipedia users that we are messing with sitelinks. Nevertheless, I'll try to clean this up. Wostr (talk) 17:37, 21 February 2024 (UTC)Reply
I tried to clean this up a bit, all different concepts are now listed in different from (P1889), but it would be good to check these items after my edits. Wostr (talk) 18:39, 21 February 2024 (UTC)Reply

Modelling: refine "group of stereoisomers"? edit

  Notified participants of WikiProject Chemistry

Shall we split group of stereoisomers (Q59199015): set of several stereoisomers into two items, namely "Group of fully undefined stereoisomers" and "Group of partially undefined stereoisomers"?

An example would be 20-Hydroxy-6,10,23-trimethyl-4-azahexacyclo[12.11.0.02,11.04,9.015,24.018,23]pentacosan-17-one (Q105173494): group of stereoisomers with the chemical formula C₂₇H₄₃NO₂ and (1R,2S,6R,9S,11S,14S,15S,18S,20S,23R,24S)-20-hydroxy-6,10,23-trimethyl-4-azahexacyclo[12.11.0.02,11.04,9.015,24.018,23]pentacosan-17-one (Q105173489): group of stereoisomers with the chemical formula C₂₇H₄₃NO₂ AdrianoRutz (talk) 08:31, 21 February 2024 (UTC)Reply

I don't have an opinion right now. Some time ago we have e.g. pair of enantiomers metaclass (submetaclass of group of stereoisomers) – which I created with the thought that it would help organise these classes better, but it did not work at all. I also thought about a qualifier to subclass of (P279)group of stereoisomers (Q59199015) with a number of defined/undefined stereocentres. However, I did not even propose that, because of a number of issues: (1) we still have thousands of items incorrectly classified, because we lack a tool to properly analyse them (their structures), at least I don't have any knowledge of such tool, (2) even existing tools (that are not used here in WD) have potential problems with properly analysing some structures ([2], [3]) as chiral/achiral, (3) in the past I tried to add some group of stereoisomers metaclasses based on InChI, but it could be done only for those entities that have partially defined stereochemistry (i.e. have /b or /t layer with at least one ?) and proved to be much more complicated than I anticipated, with a bunch of errors, which I had to correct manually. That's why I can't support or oppose this right now – while we still don't have a proper solution to the existing problem of how to automatically distinguish structures that are isomerically defined from those which have at least one stereocentre undefined. Wostr (talk) 17:34, 21 February 2024 (UTC)Reply
If we do split this, I would call the latter item "Group of partially defined stereoisomers". --99of9 (talk) 22:44, 21 February 2024 (UTC)Reply

How to use P2874 identifier? edit

Please check out this thread. Thank you! Horcrux (talk) 09:36, 13 March 2024 (UTC)Reply

The original discussion suggests two ways: one is as external identifier for an assay and the other as reference. How do you want to use it? Egon Willighagen (talk) 18:59, 13 March 2024 (UTC)Reply

Modelling of canonical/isomeric SMILES on isotopes/isotopocules edit

  Notified participants of WikiProject Chemistry

Currently, canonical SMILES (P233) and isomeric SMILES (P2017) are mapped inconsistently on isotope (Q25276) and Isotopocule (Q115801582).

For examples of the actually different co-existing versions, see protium (Q12830437), deuterium (Q102296), tritium (Q54389), carbon-14 (Q840660) or CTK8F2337 (Q82300470). We should coordinate to have a single mapping?

I am happy to do some cleanup once we agreed on the best way to do it. AdrianoRutz (talk) 14:22, 22 April 2024 (UTC)Reply

Return to the project page "WikiProject Chemistry".