About this board

I'm quite busy in my real life, so I may not respond swiftly to comments on this page.

Archived discussion (from before March 10, 2018) is available here.

ואליום (talkcontribs)

Dear Wostr,

now I understand, why I'm thinking your work destroys and you probably think the same of my recent changes. Sorry for that!

The German articles, and the most small languages equally, always deal with both family of isomeric compounds (Q15711994) (isomer family) and structural class of chemical compounds (Q47154513) (structural class).

How should your system work? Lets make an example: ciprofloxacin (Q256602) has many substructures, which are substructures for themselves:

Should only the most specific or all valid substructures be tagged?

Using a specific property "has substructure" (not P31, not P279)

  • could save us from concepts incompatible between languages
  • would avoid the need to always have two same-named items for the isomer family and the structural class

(I had to change my user name because somebody said it would be offending)

Wostr (talkcontribs)

1. Wikidata is much much more detailed than any Wikipedia. Plus, Wikidata is an independent project that does not reflect Wikipedia structure of categories or articles. That's why Wikipedia sitelinks are only an addition to Wikidata items and these sitelinks does not define the scope of Wikidata item, statements define the scope.

2. In Wikidata we cannot have two concepts mixed into one item, thus we cannot merge items on a basis that 'in Wikipedia one article covers both concepts'. This in not Wikipedia, it's Wikidata.

3. The current system of chemical classification in Wikidata is flawed, but is will be corrected in the future when properly discussed in WikiProject Chemistry. Right now only the lowest class should be included in each item using P31. I don't know of any 'concepts incompatible between languages' right now, chemistry-related terms are usually language-independent, however, problems arise when users try to impose Wikipedia structure of articles upon Wikidata items — there are situation when 1 article in Wikipedia covers dozens or even hundreds of Wikidata items.

4. I'm not sure how this 'substructure' classification would work. Wikidata is not a project where we create some original ways of classification, we try to adapt solutions that are present in reliable sources. Wikidata classification reflects much of a ChEBI classification.

5. As I wrote, the same name of an items in not a problem. Description of an item is used here for disambiguation, so the same name can be present in many items.

ואליום (talkcontribs)

> there are situation when 1 article in Wikipedia covers dozens or even hundreds of Wikidata items

I'm fully on your side that Wikidata shall not exist for Wikipedia's purposes; it is a mere ontology project.

--

How would you perfectly classify ciprofloxacin (Q256602) (after all discussions have finished)?

It would be great if you could draw a SVG diagram showing all P31 and P279 statements regarding structural classification of an example compound.

--

PS: Are you mainly a chemist or a programmer?

Reply to "Language differences"
Сексмонстр (talkcontribs)

Dobry wieczór,

the German article of is about Q110811139 and not 2-Oxazolines like Q413567.

Shall we switch the meaning of the Wikidata items or their sitelinks?

Wostr (talkcontribs)

Usually, the sitelinks should be moved between items, because items are connected to other items and changing the scope or definition of an item would cause some statemaments to be false. However, you should first check which item is correct: oxazoline (Q110811139) or oxazoline (Q72498300).

Reply to "Oxazolines"

Call for participation in a task-based online experiment

1
Kholoudsaa (talkcontribs)

Dear Wostr,

I hope you are doing good,

I am Kholoud, a researcher at King's College London, and I work on a project as part of my PhD research, in which I have developed a personalised recommender system that suggests Wikidata items for the editors based on their past edits. I am collaborating on this project with Elena Simperl and Miaojing Shi.

I am inviting you to a task-based study that will ask you to provide your judgments about the relevance of the items suggested by our system based on your previous edits.

Participation is completely voluntary, and your cooperation will enable us to evaluate the accuracy of the recommender system in suggesting relevant items to you. We will analyse the results anonymised, and they will be published to a research venue.

The study will start in late January 2022 or early February 2022, and it should take no more than 30 minutes.

If you agree to participate in this study, please either contact me at [] or use this form https://docs.google.com/forms/d/e/1FAIpQLSees9WzFXR0Vl3mHLkZCaByeFHRrBy51kBca53euq9nt3XWog/viewform?usp=sf_link

I will contact you with the link to start the study.

For more information about the study, please read this post: https://www.wikidata.org/wiki/User:Kholoudsaa

In case you have further questions or require more information, don't hesitate to contact me through my mentioned email.

Thank you for considering taking part in this research.

Regards

Reply to "Call for participation in a task-based online experiment"
Gremista.32 (talkcontribs)

I did something wrong by putting a space

I don't know if you have any recommendations

Reply to "Oxidation numbers"
SCIdude (talkcontribs)

FYI I have added items with 13C and 14C to the aldehydo-hexoses:

aldehydo-hexose (Q105024342) ↑

├──aldehydo-(¹³C₆)hexose (Q82881663)

│   -aldehydo-L-(¹³C₆)idose (Q82842547)

│   -aldehydo-D-(¹³C₆)glucose (Q105108360)

├──aldehydo-galactose (Q100602655) ↑

│   -aldehydo-D-galactose (Q27102217)

│   -aldehydo-L-galactose (Q27117209)

├──aldehydo-allose (Q100604517) ↑

│   -aldehydo-D-allose (Q423216)

│   -aldehydo-L-allose (Q27117249)

├──aldehydo-gulose (Q101095964) ↑

│   -aldehydo-D-gulose (Q423227)

│   -aldehydo-L-gulose (Q27117231)

├──aldehydo-altrose (Q106941265) ↑

│   -aldehydo-D-altrose (Q423207)

│   -aldehydo-L-altrose (Q72437509)

├──aldehydo-glucose (Q106941538) ↑

│   -aldehydo-L-glucose (Q3266724)

│   -aldehydo-D-glucose (Q21036645)

│   -aldehydo-D-(6-¹³C)glucose (Q82694157)

│   -aldehydo-D-(1,6-¹³C₂)glucose (Q82694158)

│   -aldehydo-L-(1-¹⁴C)glucose (Q82877697)

│   -aldehydo-D-(6-¹⁴C)glucose (Q83061294)

│   =aldehydo-D-(¹³C₆)glucose (Q105108360)

├──aldehydo-idose (Q106947809) ↑

│   -aldehydo-D-idose (Q423179)

│   -aldehydo-L-idose (Q27277756)

│   =aldehydo-L-(¹³C₆)idose (Q82842547)

├──aldehydo-mannose (Q106964021) ↑

│   -aldehydo-D-mannose (Q27117223)

│   -aldehydo-L-mannose (Q27117227)

│   -aldehydo-D-(1,2-¹³C₂)mannose (Q82694075)

└──aldehydo-talose (Q107080868) ↑

    -aldehydo-D-talose (Q423195)

    -aldehydo-L-talose (Q27158868)

    -aldehydo-D-(2-¹³C)talose (Q82694379)

Wostr (talkcontribs)

Thanks for info. However, with no support for Wikidata:Property proposal/isotopically modified form of there should be some way to link e.g. aldehydo-D-(2-¹³C)talose (Q82694379) with aldehydo-D-talose (Q423195). With no dedicated property, there is probably one one way to do it: aldehydo-D-(2-¹³C)talose (Q82694379) subclass of (P279) aldehydo-D-talose (Q423195), but it's not possible right now with chemical compounds modelled as instances... I'll try to finish my proposal to switch 'instance of' to 'subclass of' regarding chemical compounds and post it in WikiProject:Chemistry discussion page (sorry, I didn't have time to review your proposal there, I will probably have some days off at the end of the week).

Reply to "isotopic compounds"

Call for participation in the interview study with Wikidata editors

1
Kholoudsaa (talkcontribs)

Dear Wostr,

I hope you are doing good,

I am Kholoud, a researcher at King’s College London, and I work on a project as part of my PhD research that develops a personalized recommendation system to suggest Wikidata items for the editors based on their interests and preferences. I am collaborating on this project with Elena Simperl and Miaojing Shi.

I would love to talk with you to know about your current ways to choose the items you work on in Wikidata and understand the factors that might influence such a decision. Your cooperation will give us valuable insights into building a recommender system that can help improve your editing experience.  

Participation is completely voluntary. You have the option to withdraw at any time. Your data will be processed under the terms of UK data protection law (including the UK General Data Protection Regulation (UK GDPR) and the Data Protection Act 2018). The information and data that you provide will remain confidential; it will only be stored on the password-protected computer of the researchers. We will use the results anonymized to provide insights into the practices of the editors in item selection processes for editing and publish the results of the study to a research venue. If you decide to take part, we will ask you to sign a consent form, and you will be given a copy of this consent form to keep.

If you’re interested in participating and have 15-20 minutes to chat (I promise to keep the time!), please either contact me at [] or [] or use this form https://docs.google.com/forms/d/e/1FAIpQLSdmmFHaiB20nK14wrQJgfrA18PtmdagyeRib3xGtvzkdn3Lgw/viewform?usp=sf_link with your choice of the times that work for you.

I’ll follow up with you to figure out what method is the best way for us to connect.

Please contact me if you have any questions or require more information about this project.

Thank you for considering taking part in this research.

Regards

Reply to "Call for participation in the interview study with Wikidata editors"
Mabschaaf (talkcontribs)

Hi Wostr, may I ask you for the reason of this edit? According to SciFinder CAS 6912-67-0 exactly represents the molecule without defined stereochemistry in none of the both stereo centers. Greetings --Mabschaaf (talk) 17:42, 3 July 2021 (UTC)

Wostr (talkcontribs)

I'm not sure right now, but:

  • sourced external-IDs should be very rarely deleted even if such IDs are wrong; the better way to do it in WD is to deprecate such statement, because deleting wrong IDs is only temporary, sooner or later such IDs will be added again by some bot-owner — so I usually deprecate CAS numbers instead of deleting them as there were situations in the past that deleted (wrong) external-id reappeared after some mass-import of data.
  • as I no longer have access to SciFinder for several years now, in situations like this I have to use secondary sources (if CAS Common Chemistry entry is not available). I think that based on ChemIDplus entry and maybe some others (which matched CAS: 6912-67-0 with InChI=1S/C5H9NO3/c7-3-1-4(5(8)9)6-2-3/h3-4,6-7H,1-2H2,(H,8,9)/t3?,4-/m0/s1) I chose to deprecate CAS number in (4RS)-4-hydroxy-DL-proline (Q411237) and leave it with normal rank in (4RS)-4-hydroxy-L-proline (Q27102938).

If SciFinder entry tells different, it should be changed — statement in (4RS)-4-hydroxy-L-proline (Q27102938) deprecated and in (4RS)-4-hydroxy-DL-proline (Q411237) normalized.

Mabschaaf (talkcontribs)

I did as described. Thanks for your help.--Mabschaaf (talk) 11:08, 4 July 2021 (UTC)

Reply to "(4RS)-4-hydroxy-DL-proline (Q411237)"
Mike Peel (talkcontribs)

Re - again an issue with the Commons link belonging with the Wikipedia sitelinks. Perhaps you can split the Commons category if it also covers a different topic?

Wostr (talkcontribs)

I'm just trying to figure out what would be the best name for categories and searching for any precedent, because the best names would be Category:L(−)-Ephedrine and Category:D(+)-Ephedrine. The problem with lack of Commons category in (−)-ephedrine (Q219626) would be resolved soon.

Egon Willighagen (talkcontribs)

Hi, thanks for catching this. The SMILES matches the InChI/InChIKey and PubChem CID. These also seem wrong.

Wostr (talkcontribs)

I didn't notice that :(

Egon Willighagen (talkcontribs)

I need to run more consistency checks, it seems... so much to do, so little time :( (Overall, I think it's in pretty good shape :)

This post was hidden by Wostr (history)
This post was hidden by Wostr (history)
SCIdude (talkcontribs)

Please correct this How many did you do wrong in this batch?

Wostr (talkcontribs)

Yes, you're right about the fact that this item describe stereochemically defined compound. I did not know that there may be InChIs with "?" in sublayers /b or /t that are not an indication of an undefined stereocenter. The problem is the 3-iminopyrazol-1-yl group. InChI from PubChem indicates that double-bond stereochemistry of H-N=C< is undefined. However, it's hard to reproduce this in any software available to me and even redrawn PubChem structure in ChemDraw gives different InChI.

I'll check all the InChIs in these 4 batches for possible 'false positives' in /b sublayer. I can't tell you right now how big is this problem, but I'll contact you as soon as I have that kind of knowledge. Then I'll correct all incorrectly changed statements, but I can't tell you right know if these kind of errors are occasional and I'll correct them manually, or I'll have to use semi-automatic tools.

SCIdude (talkcontribs)

It may be a problem with InChis from ChEBI. If so, we should replace them in bulk if that solves the problem.

Wostr (talkcontribs)

After a quick check I see that there may be about 20–30 items in these batches that have to be checked. I'll do that manually (however, I don't know when — probably tomorrow od the day after as I have a really hot week in work).

Wostr (talkcontribs)

There are 30 items that I'll be reviewing, all are listed here. It seems that most of the problems is a result of double bond on nitrogen atoms or double bonds in rings. I'm not sure why InChI in these items shows e.g. oxime group HO-N=C< as a group that should have defined stereochemistry. However, there are situations like in Heme O (Q620211): InChI from PubChem shows undefined configuration of many double bonds of porphyrin, while InChI from ChemSpider shows all that double bonds as stereochemically defined.

I'll try to check whether these InChIs from PubChem are correct for these items. Maybe we should have more than one InChI is such situations (even with a deprecated rank).

SCIdude (talkcontribs)

This problem also showed with my current ChEBI InChi key fixes, resulting in different keys. I agree multiple InChis and keys are unavoidable. But, when using different ranks, we should have a consistent way to assign these. For example, do we prefer to not specify oxime bonds? What about diazo -N=N- bonds, PubChem usually leaves them unspecified (I agree with this). As to porphyrin bonds, is the (E) configuration geometrically possible? If not, the bond does not need to be specified. This has to be defined on some project page. Could you please do this?

SCIdude (talkcontribs)

BTW I think I found out why the ChEBI InChis may have a problem. Take the SMILES "C1C[C@@H]2CC[C@H]1C2" which is norbornan with redundant stereo information (the centers are potentially stereogenic but not in this case). When input in PubChem, they automatically remove the stereo specs, input in ChEBI does not. From this the InChis become different. So effectively it's a ChEBI software problem.

Wostr (talkcontribs)

As I thought that ~30 incorrect items is a very low number given the scale of QS batches, I checked the whole batch in Excel rather than trying to query it using SPARQL from WD.

I found 774 potential InChIs that may have ? in /b sublayer and may not be a group of stereoisomers. I've manually checked all the items (unfortunately, most have only one source – DSSTOX database – because were created by GZWDer imports) and found:

  • about 58,5% are correct (mostly undefined configuration of C=C bonds or substituted diazo bonds)
  • about 23,8% have to be checked more carefully, however I think that most are correct (about 95% of these are undefined configuration of double bonds in eight or more membered rings – I checked that it is possible to have an eight-membered ring with at least one E double bond, so probably these items are correctly described as 'group of stereoisomers')
  • about 17,7% are probably incorrect (about 85% of these are unsubstituted imino groups that are treated in many databases as stereochemically undefined, however at least 3 different InChIs can be assigned for such situations; the rest are unsubstituted diazo groups, heterocyclic compounds or some weird borderline cases + some items in which InChIs from different sources differ).

I'll post on Wikidata:WikiProject Chemistry discussion page in a few days about this problem. Most incorrectly added 'groups of stereoisomers' for compounds with unsubstituted imino group can be reverted using QS, so it won't be a problem to do it technically, but we have to do it in uniform way for all cases.

The problem you've mentioned about norbornan and redundant stereo descriptors may cause additional problems in the future. I added 'group of stereoisomers' to items that have ? in InChI sublayers /b or /t. However, there are also many InChIs for groups of stereoisomers that lack these sublayers (if a compound have 2 stereocentres and 1 is undefined – there is a /t sublayer with a ? descriptor for one stereocenter; if all 2 stereocentres are undefined – there is no /t sublayer – so we stil have thousands of 'groups of stereoisomers' (with all stereocentres undefined) that are classified as 'chemical compounds'. I asked Egon Willighagen about the script he wrote in 2019 – items like norbornan may be false positives that we should try to exclude.

Reply to "Wrong group statement"