About this board

I'm quite busy in my real life, so I may not respond swiftly to comments on this page.

Archived discussion (from before March 10, 2018) is available here.

Egon Willighagen (talkcontribs)

Hi, thanks for catching this. The SMILES matches the InChI/InChIKey and PubChem CID. These also seem wrong.

Wostr (talkcontribs)

I didn't notice that :(

Egon Willighagen (talkcontribs)

I need to run more consistency checks, it seems... so much to do, so little time :( (Overall, I think it's in pretty good shape :)

This post was hidden by Wostr (history)
This post was hidden by Wostr (history)
SCIdude (talkcontribs)

Please correct this How many did you do wrong in this batch?

Wostr (talkcontribs)

Yes, you're right about the fact that this item describe stereochemically defined compound. I did not know that there may be InChIs with "?" in sublayers /b or /t that are not an indication of an undefined stereocenter. The problem is the 3-iminopyrazol-1-yl group. InChI from PubChem indicates that double-bond stereochemistry of H-N=C< is undefined. However, it's hard to reproduce this in any software available to me and even redrawn PubChem structure in ChemDraw gives different InChI.

I'll check all the InChIs in these 4 batches for possible 'false positives' in /b sublayer. I can't tell you right now how big is this problem, but I'll contact you as soon as I have that kind of knowledge. Then I'll correct all incorrectly changed statements, but I can't tell you right know if these kind of errors are occasional and I'll correct them manually, or I'll have to use semi-automatic tools.

SCIdude (talkcontribs)

It may be a problem with InChis from ChEBI. If so, we should replace them in bulk if that solves the problem.

Wostr (talkcontribs)

After a quick check I see that there may be about 20–30 items in these batches that have to be checked. I'll do that manually (however, I don't know when — probably tomorrow od the day after as I have a really hot week in work).

Wostr (talkcontribs)

There are 30 items that I'll be reviewing, all are listed here. It seems that most of the problems is a result of double bond on nitrogen atoms or double bonds in rings. I'm not sure why InChI in these items shows e.g. oxime group HO-N=C< as a group that should have defined stereochemistry. However, there are situations like in Heme O (Q620211): InChI from PubChem shows undefined configuration of many double bonds of porphyrin, while InChI from ChemSpider shows all that double bonds as stereochemically defined.

I'll try to check whether these InChIs from PubChem are correct for these items. Maybe we should have more than one InChI is such situations (even with a deprecated rank).

SCIdude (talkcontribs)

This problem also showed with my current ChEBI InChi key fixes, resulting in different keys. I agree multiple InChis and keys are unavoidable. But, when using different ranks, we should have a consistent way to assign these. For example, do we prefer to not specify oxime bonds? What about diazo -N=N- bonds, PubChem usually leaves them unspecified (I agree with this). As to porphyrin bonds, is the (E) configuration geometrically possible? If not, the bond does not need to be specified. This has to be defined on some project page. Could you please do this?

SCIdude (talkcontribs)

BTW I think I found out why the ChEBI InChis may have a problem. Take the SMILES "C1C[C@@H]2CC[C@H]1C2" which is norbornan with redundant stereo information (the centers are potentially stereogenic but not in this case). When input in PubChem, they automatically remove the stereo specs, input in ChEBI does not. From this the InChis become different. So effectively it's a ChEBI software problem.

Wostr (talkcontribs)

As I thought that ~30 incorrect items is a very low number given the scale of QS batches, I checked the whole batch in Excel rather than trying to query it using SPARQL from WD.

I found 774 potential InChIs that may have ? in /b sublayer and may not be a group of stereoisomers. I've manually checked all the items (unfortunately, most have only one source – DSSTOX database – because were created by GZWDer imports) and found:

  • about 58,5% are correct (mostly undefined configuration of C=C bonds or substituted diazo bonds)
  • about 23,8% have to be checked more carefully, however I think that most are correct (about 95% of these are undefined configuration of double bonds in eight or more membered rings – I checked that it is possible to have an eight-membered ring with at least one E double bond, so probably these items are correctly described as 'group of stereoisomers')
  • about 17,7% are probably incorrect (about 85% of these are unsubstituted imino groups that are treated in many databases as stereochemically undefined, however at least 3 different InChIs can be assigned for such situations; the rest are unsubstituted diazo groups, heterocyclic compounds or some weird borderline cases + some items in which InChIs from different sources differ).

I'll post on Wikidata:WikiProject Chemistry discussion page in a few days about this problem. Most incorrectly added 'groups of stereoisomers' for compounds with unsubstituted imino group can be reverted using QS, so it won't be a problem to do it technically, but we have to do it in uniform way for all cases.

The problem you've mentioned about norbornan and redundant stereo descriptors may cause additional problems in the future. I added 'group of stereoisomers' to items that have ? in InChI sublayers /b or /t. However, there are also many InChIs for groups of stereoisomers that lack these sublayers (if a compound have 2 stereocentres and 1 is undefined – there is a /t sublayer with a ? descriptor for one stereocenter; if all 2 stereocentres are undefined – there is no /t sublayer – so we stil have thousands of 'groups of stereoisomers' (with all stereocentres undefined) that are classified as 'chemical compounds'. I asked Egon Willighagen about the script he wrote in 2019 – items like norbornan may be false positives that we should try to exclude.

Reply to "Wrong group statement"
Charles Matthews (talkcontribs)

Let me explain about the "inorganic" definition that was used there. It comes from https://www.ncbi.nlm.nih.gov/mesh/68007287, i.e. from the {{P|486}} system. The definition may not be standard in relation to its treatment of carbon compounds; but MeSH is a major system for searching the medical literature. I have used that, to add several thousand links to the item: see for example https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q73543107&limit=2000. where links to Q73543107 persist. I suppose those links will be updated soon.

The information in the corresponding {{P|921}} statements has an exact scope as defined by the MeSH page, and so with the given description. Because this information has value, I would like to undo the merge you did, and update Wikidata:Do not merge. I think where Help:Merge#Check to be sure talks about "subtle differences", that advice applies here.

Wostr (talkcontribs)

I see no differences between these two items and in fact there is no difference. Inorganic compounds do not have only one definition and this class is defined usually as "any compound that is not organic" = "any compound that do not consist of carbon, with exception of carbonates, ..., ...". Keeping these two items separated is a mistake, because this is the same concept.

You may expand the definition in Q190065, because both definitions can be used interchangeably.

Charles Matthews (talkcontribs)

I think we are not understanding each other.

It does seem, from the English Wikipedia article "inorganic compound", that chemists are not very interested in the inorganic/organic boundary. They probably aren't concerned with it, in a practical way.

MeSH, Medical Subject Headings, is a "comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences". The vocabulary is controlled by scope notes that define how it is properly used. In other words the index terms from the vocabulary carry the value of fairly precise definitions.

Here there are indeed multiple concepts. I work with a bot that imports these terms from the PubMed repository, where they are attributed by the MEDLINE indexing system. That indexation is carried out by human experts.

I certainly wish that in giving main subjects to biomedical papers, referencing PubMed, I'm able to attribute exact meanings.


Wostr (talkcontribs)

I'm not interested at all what is written in any language version of Wikipedia. Sitelinks to other Wikimedia projects are only an addition to Wikidata and do not define the Wikidata content.

The definition of inorganic compound in chemistry may be written as follow: "any chemical compound that is not organic, i.e. does not contain carbon with exception of some compound classes traditionally defined as inorganic, e.g. carbon monoxide, carbon dioxide, carbon disulfide, carbon diselenide, carbides, hydrogen cyanide, carbonic acid, cyanic acid, isocyanic acid, fulminic acid and its salts like carbonates, hydrogen carbonates, cyanides, cyanates". This definition covers both English Wikipedia definition and MeSH definition.

Charles Matthews (talkcontribs)

OK, thank you for the description, which I shall add to the item. We shall have to agree to disagree on other matters.

Wostr (talkcontribs)

Maybe there is some difference between these two concepts that I just can't see, but it must be something other than the definition from https://www.ncbi.nlm.nih.gov/mesh/68007287. MeSH definition is one that is widely used for inorganic compounds.

BTW I'm not sure about items describing compounds of a specific element. I've just noticed that all entries in MeSH like this one: https://meshb.nlm.nih.gov/#/record/ui?ui=D017610 are for inorganic compounds only. However, in WD we don't have items for inorganic compounds of a specific element yet. So Q12548019 has MeSH id (inorganic) matched to 'calcium compound' (both organic and inorganic), Q74819737 has description about inorganic compounds, but is matched to category for both organic and inorganic compounds.

Is it intentional (and narrow match (Q39893967) as a qualifier should be used) or maybe I should create items for 'inorganic compound of ...' and move MeSH id to such items?

Charles Matthews (talkcontribs)

For a very accurate analysis of the MeSH identifiers for compounds, one can look at the MeSH tree codes (P672). E.g https://www.ncbi.nlm.nih.gov/mesh/68058085 for iron compounds has two, D01.490 and D02.691.550, where D01 means inorganic, and D02 organic. https://www.ncbi.nlm.nih.gov/mesh/68017612 for gold compounds has only D01.379, because the scope is inorganic only. So, yes, it would be possible to make these distinctions, and check them with queries.

At present I'm working to complete the P486 dataset, and this is not my major concern. There are some thousands of statements still to add.

Jim Hokins (talkcontribs)
Wostr (talkcontribs)
Jim Hokins (talkcontribs)
Wostr (talkcontribs)

No, you're not right. Check the IUPAC Recommendations (https://www.degruyter.com/view/j/pac.2008.80.issue-2/pac200880020277/pac200880020277.xml). Any unlabelled atom is assumed to be carbon atom with max. hydrogen atoms attached. Only terminal carbon atoms are labelled in the preferred style; acceptable is not labelling any carbon atoms like in File:Diethyl ether chemical structure.svg. And this File:Ethyl functional group.svg is ethyl group, which may be drawn like this File:Ethyl group.png, but this style is not preferred.

Jim Hokins (talkcontribs)

Ok. But Wikipedia readers may not be aware of this recommendation. Wikipedia readers need a complete formula. This is my humble opinion. I will not insist on my opinion. Regards, Jim Hokins.

Wostr (talkcontribs)

This is not about some unknown recommendations. This is the way chemical structures are drawn in chemistry worldwide. It is well understood – or at least it should be — by anyone from secondary school upwards.

Local template problems with your changes

4
Def2010 (talkcontribs)

this change - and similar broke down template working in ruwiki. Maybe, if you need url property for some reasons, you should duplicate the value for new one property, but not to remove old one? Def2010 (talk) 10:17, 6 February 2020 (UTC)

Wostr (talkcontribs)

You should change the template then, check the constraints (and then probably modify all the statements using this incorrect property or allow both properties in your infobox). reference URL (P854) is for sources only and cannot be used as a qualifier, the proper qualifier is URL (P2699). I don't need this anywhere, I'm just correcting constraint violations.

Def2010 (talkcontribs)

The template is protected and there are not too much volunteers that has time to correct it accordingly wikidata changes due to it's complexity. Another one for NFPA 704 has not been updated for some years, despite my repeated requests, only recently one man had fixed it. But the template has an important feature - it is just working if the property is consistent, so I will have to duplicate to ensure it will continue to do so. Def2010 (talk) 10:57, 6 February 2020 (UTC)

Wostr (talkcontribs)

What I'm saying is that sooner or later these incorrect qualifiers will be fixed, with a prior notification or without it (as it is just correcting incorrect statements – not changing the whole model – I don't think it is an obligation to inform anyone outside WD). I can refrain from correcting these particular issues, but other users may want to fix these statements either manually or (semi-)automatically in the future.

Nadzik (talkcontribs)

Dzięki za zwrócenie uwagi, musiało mi coś uciec, prawdopodobnie chciałem napisać "jest podklasą dla", albo coś podobnego. Pozdrawiam! ~~~~

RE - 'added CA prop 65 relation info'

3
Gtsulab (talkcontribs)

Thanks for your input. I was imitating what I saw with the imported NIOSH data (instances of occupational carcinogens). I can change the relationships from instances to 'subject has role'. On a partially related note, is this also the preferred method of relating drugs to drug classes? Is there a good database for drug-drug class relations that can be added to WD?

Wostr (talkcontribs)

As I wrote – that is only what I think would be best. There is no guidelines in chemistry or medicine regarding this. Everyone is adding new statements based on their own judgement, so sometimes related statements are added using 'instance of' and more precise properties in the same item.

Gtsulab (talkcontribs)

To be consistent with the NIOSH data which was added by a Wikimedian in residence at the time, I'm going to finish adding the CA Prop65 data in this manner. Once the chemistry or medicine communities arrive at a consensus about how to handle this, I'll write a script to make the necessary changes.

Piastu (talkcontribs)

Hmm… wydaje mi się, że jest ok – chodziło o wykonywany zawód, nie konkretną formację z naszego grajdołka – w tym wypadku dla Q11698741 i occupation (P106).

Piastu (talkcontribs)

Ok, to już jak uważasz – ja dodając kierowałem się rozstrzałem w angielskich aliasach – cop, fed, constable czy bobby.

Scs (talkcontribs)

In this edit you wrote, "en.wiki does not want to use WD data". Once upon a time, I thought that the whole point of Wikidata was to be used by the 'pedias, and while it took a while for that linking to get off the ground, at one point it seemed like it was actually picking up. So I'm more than a little dismayed to hear that there are strong wishes not to use WD data at all, after all. Do you have any links to discussions or decisions formalizing this attitude?

(I'm not arguing with you or doubting you, though; as a matter of fact I just came across w:Wikipedia:Short description, which explains that "Initially short descriptions were drawn from the Description field in Wikidata entries, but because of concerns about including information directly from another project, the Wikimedia Foundation (WMF) made provision for these to be overwritten by short descriptions generated within Wikipedia", followed by the direction that "Eventually, all articles should have a short description template", i.e., according to that page there should be no descriptions drawn from Wikidata at all. Foo.)

Wostr (talkcontribs)

I can't give you any links, because — as I'm not a regular en.wiki user — I don't know where such discussions were held and I'm not familiar with en.wiki discussion/archive page structure. However, I remember for example a RFC in en.wiki some two years ago(?) about use of WD in en.wiki infoboxes and the most popular option was to not use WD at all (the summary of this RFC was something like: WD could be used in en.wiki if there is some assurance that the data meets en.wiki rules /verifiability etc./). I think there were notifications in Project chat about several discussions in en.wiki about use of WD data and I never heard that en.wiki users are eager to use WD. Frankly, I can't blame them, I have many reservations about use of WD data in my home wiki (pl.wiki), because policies about verifiability and quality of data are much less restrictive here than in pl.wiki. However, I try to use WD data where I have a lot of confidence that articles will not lose quality or verifiability, but also to use article's (infoboxes) data in such a way as to verify the WD data or add sources to WD statements.

About 'Short description': this is how it should be done in every project. Wikidata 'description' wasn't meant to be a description for Wikipedia articles, but the WMF thought otherwise. This WMF mistake resulted in more people now opposing WD data at all, because they see that WD description can be vandalised from IP without any knowledge of Wikipedia users and sometimes such description remains unreverted for days, weeks or even months.

Scs (talkcontribs)

All fair points. Thanks.

Powerek38 (talkcontribs)

Hej! No to jest być może trochę akademicka dyskusja, czy w dokumentacji właściwości ważniejsze jest property constraint (P2302) czy raczej subject item of this property (P1629). Osobiście stoję na stanowisku, że ważniejsze jest zmieszczenie się w aktualnych ograniczeniach właściwości (zresztą tak też ustawione jest Harvest Templates) i w tym sensie moje edycje uważam za poprawne. Raczej nie zgadzam się z Tobą, że ograniczenia ustawia się po to, żeby nie wyświetlały się błędy. Moim zdaniem to jest patrzenie na problem od końca - komunikaty o błędach są tylko technicznym efektem dużo poważniejszych decyzji co do konstrukcji Wikidanych jako bazy danych, które znajdują swój wyraz w takim, a nie innym, ustawieniu ograniczeń. I nie zgadzam się też, że wcześniej nie było takich importów - sam już ze dwa lata temu tym samym narzędziem importowałem stopnie również policjantów w tej samej właściwości. Krótko mówiąc - jasne, jest to temat do jakichś dalszych dyskusji w szerszym gronie, ale na dziś nie uważam tych deklaracji za "ewidentnie nieprawidłowe". Natomiast pełna zgoda, że warto tu rozszerzyć P1629, żeby nie było bałaganu w ramach jednej strony P. Pozdrawiam!