About this board

Previous discussion was archived at User talk:SCIdude/Archive 1 on 2019-11-02.

Charles Matthews (talkcontribs)

So you object to my removals of a broad "molecular biology" statement as main subject of an item about an article.

Generally, as you know, I concentrate on MeSH starred terms, and I work now to give the exact MeSH term, not something broader. It seems that quite broad terms are not really considered to be so bad, but I think the main subject statements should be referenced. And where there is a reference, the statements should be made accurate according to that reference. You and at least one other editor add comparable statements from other such good sources.

There are also main subject statements coming from the author-supplied keywords. Those index terms are strings, and need to be matched to Wikidata items. I add those, with reference and two qualifiers so that everyone can see what is happening.

Finally, there are inferred statements. There are many added by the Source MD tool for item creation, and that tool adds many broad statements, and many bad statements also. I have a page for tracking seriously bad statements and a high proportion of the errors on that page seem to come from that tool.

From the early days one can see many broad statements such as "organic chemistry" and "catalysis" that might be true, if there an enzyme involved, but really aren't helpful. There are a large number of statements "inferred from title" that are unhelpful in various ways, because automated text-mining of titles is not a great technique.

So, if you want me not to delete a main subject like "molecular biology" which is an inference of a certain kind, you should explain how the inference is made, and why it is helpful. Because much of what happens seems to me not to be helpful. Charles Matthews (talk) 12:42, 21 November 2022 (UTC)

SCIdude (talkcontribs)

You are perfectly right. I apologize. I'm abusing the tag to mark papers for my curation work at Reactome. But please give me some time to fix my tools before you continue removing. After fixing my scripts I will remove them myself so there may be no need for you to do it then. I will give you notice here when I'm done. Thanks.

SCIdude (talkcontribs)

I have now removed all ~3,000 statements of the form main subject (P921):molecular biology (Q7202) on articles that I made. In almost all cases there is a more specific topic on the same item, either a subclass of biological process, or of protein. So I consider the issue closed.

Reply to "Broader terms in P921"
Wostr (talkcontribs)

In 2020 you created Q100721660 and moved some statements to it from Q419167. Now both items have been merged. This popped up on my Watchlist, however, I don't have time right now to check these edits, but maybe you'd like to know that this merge happened.

SCIdude (talkcontribs)

Well, he can do that, as I have given no ref for the mixture fact. And I don't remember where I got it from. His other merges (the last two weeks) are good. Thanks for noting.

Reply to "basic fuchsine (Q419167) / (Q100721660)"

Naturschutzgebiet Jenaer Forst (Q61685796) and Jenaer Forst (Q32063053)

Leutha (talkcontribs)

I started page on Jenaer Forst on en wikipedia and then linked it to (Q32063053), but then I found Jenaer Forst (Q61685796) on de wikipedia. When I tried to merge them, I could not as they are already linked. (Q61685796) seems to specifically relate to the Naturschutzgebiet and has "Naturschutzgebiet Jenaer Forst". However the German article also has specific information outside the NSG: "Der Kernbereich des Kasernenkomplexes wurde nach einer letzten Nutzung als Erstaufnahmeeinrichtung für Asylsuchende renaturiert, liegt aber großteils nicht im NSG." This relates to information I want to develop on the page in en eikipedia. Therefore, I am asking:

1) Should (Q61685796) be renamed "Naturschutzgebiet", or is there a better solution?

2) Would you be able to resolve this matter?

3) If you feel the status quo should be maintained, could you provide a rationale that would be helpful as regards future entries concerning ''Forst''.

This would be very helpful. User:Leutha

SCIdude (talkcontribs)

This is not a "status quo", both are different concepts obviously. You seem to be a Wikipedia writer, and you are used to merging concepts, in order to get a readable article. Unfortunately, the de-wiki article merges both concepts. This is a common problem in Wikipedia and can only be resolved there. Other than that, I don't see exactly what problem you have, can you plz elaborate?

SCIdude (talkcontribs)

Just a hint as to the solution, why do you think these two exist? and

SCIdude (talkcontribs)

It is just the notion that every Wikipedia article has to be "readable". I think it is quite possible that the general solution lies in the definition of that term. Where is it defined anyway and who is behind it?

Leutha (talkcontribs)

Thanks for your response. By "status quo (Q201610)" I meant as things stand at the moment. I apologise for a mistake in my suggestion as I meant to put "Naturschutzgebiet Jenaer Forst (Q61685796)". Thanks for the example you offer of Lüneburg Heath Nature Park (Q1508609) and Lüneburg Heath (Q311124) which have distinct names, and bear a similar - but not identical - relationship each other. By having distinct labels when items appear in other languages it will then make it much clearer what would be the best wikidata link to make. I hope that clarifies my suggestion.

I'm afraid my knowledge of German is not sufficient to do any major work on on the de-wiki, however I will make sure the same problem does not occur on the en wiki

As for the term "readable", wikidata has: human-readable (Q16716513) and machine-readableness (Q36822946), but I don't understand why these should be seen as opposites rather than complementary, particularly in light of the statement on the wikidata main page: "Wikidata is a free and open knowledge base that can be read and edited by both humans and machines." I think your question is long and deep.

SCIdude (talkcontribs)

Yes, that is right, different concepts should be given different names. However, this is not always possible or necessary, as the main difference of Wikidata items is always found in the instance-of and subclass-of statements. All the WD statements define the concept, not the names (labels).

SCIdude (talkcontribs)

That said, I don't think anyone would object if you change the label. It's not as relevant as the statements, or the connected sitelinks.

Reply to "Naturschutzgebiet Jenaer Forst (Q61685796) and Jenaer Forst (Q32063053)"
M2k~dewiki (talkcontribs)
SCIdude (talkcontribs)

Hallo @Codc. de:Naturstoff ist momentan bei en:biomolecule, und de:Naturprodukt ist bei en:natural product. Die Hierarchie im Englischen ist natural product --> umfasst natural material und biomolecule, siehe auch enwp. Das ist nicht korrekt?

SCIdude (talkcontribs)
SCIdude (talkcontribs)

Brockhaus Biomolekül:

Naturstoffe, im weiteren Sinn alle Stoffe, die in der Natur vorkommen; im engeren Sinn organische Verbindungen, die aus Tieren, Pflanzen und Mikroorganismen isoliert werden können.

Reply to "Naturstoff vs. Naturprodukt"
Wostr (talkcontribs)

FYI: we have an additional level in classification of aldohexoses between aldehydo-hexose (Q105024342) and compounds like aldehydo-D-mannose (Q27117223) or aldehydo-L-mannose (Q27117227)aldehydo-mannose (Q106964021) (group of two stereoisomers, L and D). This level was introduced for every aldohexose, even if there is no ChEBI equivalent, as an effort to standardise and clean-up items about carbohydrates (more in: User:Wostr/Carbohydrates, as for now I only managed to clean-up aldohexoses).

So I moved subclass of (P279) aldehydo-hexose (Q105024342) that was added by your bot from items like aldehydo-D-mannose (Q27117223) to items like aldehydo-mannose (Q106964021).

SCIdude (talkcontribs)

Thanks, this was a bit experimental, and I'll be switching from SMILES to InChi for detection of core structures next. It's still possible to miss such connections The goal, of course, is to have classes that can be checked, and substances added, (semi-)automatically.

Reply to "aldohexose (open form)"
Wostr (talkcontribs)

There is also one problem regarding the classification of cyclic compounds that has to be adressed. We have classes like tricyclic compound (Q3539074) and there are two ways such classes are defined in sources:

  1. n-cyclic compound = every compound has exactly three rings, no more, no less, in the whole structure
  2. n-cyclic compound = every compound has no less than three rings, but may have more

Selecting any of the options has serious consequences for the entire classification and may result in our classification being inconsistent with classifications from other sources.

The first option seems more logical and consistent as every compound is classified according to the number of rings in the structure. However, classes like phenothiazine (Q16023748) or dibenzazepine (Q33416403) cannot be subclasses of tricyclic compound (Q3539074) but only polycyclic compound (Q426145) (as there is no certainty that every compound belonging to phenothiazine (Q16023748) or dibenzazepine (Q33416403) has exactly three rings). It is also not consistent with ChEBI, e.g. pentacyclic LSM-20934 is classified under organic tricyclic compound. From the other side, choosing the second option leaves us with a weird classification tree: tetracyclic compound (Q7706284) (four or more rings) should be a subclass of tricyclic compound (Q3539074) (three or more rings).

I have no good solution to this. I'd personally choose the first option, even if it means a lot of inconsistencies between databases and the need for carefully checking that each class and chemical compound is assigned to the appropriate n-cyclic compounds class.

SCIdude (talkcontribs)

The classification of LSM-20934 looks like an error, note all the LSMs under are two-star entries. What remains is the problem of derivatives adding a bridge to the core structure, I don't think this happens often, and that compound is no longer a derivative (in my book). So, I agree with you that option 1 is the most natural, but only if this applies to the core, not the whole structure, e.g. is still a naphtalene.

SCIdude (talkcontribs)
Reply to "n-cyclic compounds"

Call for participation in the interview study with Wikidata editors

Kholoudsaa (talkcontribs)

Dear SCIdude,

I hope you are doing good,

I am Kholoud, a researcher at the King’s College London, and I work on a project as part of my PhD research that develops a personalized recommendation system to suggest Wikidata items for the editors based on their interests and preferences. I am collaborating on this project with Elena Simperl and Miaojing Shi.

I would love to talk with you to know about your current ways to choose the items you work on in Wikidata and understand the factors that might influence such a decision. Your cooperation will give us valuable insights into building a recommender system that can help improve your editing experience.  

Participation is completely voluntary. You have the option to withdraw at any time. Your data will be processed under the terms of UK data protection law (including the UK General Data Protection Regulation (UK GDPR) and the Data Protection Act 2018). The information and data that you provide will remain confidential; it will only be stored on the password-protected computer of the researchers. We will use the results anonymized (?) to provide insights into the practices of the editors in item selection processes for editing and publish the results of the study to a research venue. If you decide to take part, we will ask you to sign a consent form, and you will be given a copy of this consent form to keep.

If you’re interested in participating and have 15-20 minutes to chat (I promise to keep the time!), please either contact me on kholoudsaa@gmail.com or use this form https://docs.google.com/forms/d/e/1FAIpQLSdmmFHaiB20nK14wrQJgfrA18PtmdagyeRib3xGtvzkdn3Lgw/viewform?usp=sf_link  with your choice of the times that work for you.

I’ll follow up with you to figure out what method is the best way for us to connect.

Please contact me using the email mentioned above if you have any questions or require more information about this project.

Thank you for considering taking part in this research.



Reply to "Call for participation in the interview study with Wikidata editors"
Bamyers99 (talkcontribs)

I have just added a note at the top of the EntitySchema directory indicating that it is programmatically generated. I incorporated some of your changes into the Configuration. I moved the molecular biology schemas to their own category. I added a See also link to the WikiProject Main classes and their canonical database. I didn't add the See also link to the WikiProject ShEx page since it is duplication of the data in the directory.

SCIdude (talkcontribs)

This is great work!

Reply to "EntitySchema directory updating"

Please stay away from the Merge tool in the near future

Maxim Masiutin (talkcontribs)

Your advice to "Please stay away from the Merge tool in the near future" is inappropriate. Please stay away from such advices here. ~~~~

Reply to "Please stay away from the Merge tool in the near future"
Wostr (talkcontribs)

I'm not sure about 2-phenylcyclopropan-1-amine (Q100423358). Right now it's quite messy regarding classification of DL-tranylcypromine (Q420885), (2S)-2-phenyl-1-cyclopropanamine (Q27163528), (1R,2R)-2-phenylcyclopropan-1-amine (Q27280143) and 2-phenylcyclopropan-1-amine (Q100423358). Before your edits DL-tranylcypromine (Q420885) seemed to be about group of stereoisomers (both stereocenters undefined; probably with some identifiers for stereochemically defined compounds); now I'm not sure how to change instance of (P31)/subclass of (P279) in the rest of the items.

Check 2-phenylcyclopropan-1-amine (Q100423358), (1R,2R)-2-phenylcyclopropan-1-amine (Q27280143) and (2S)-2-phenyl-1-cyclopropanamine (Q27163528) to make sure that I get your idea right. But I'm still unsure about DL-tranylcypromine (Q420885) — is this about a racemate, about a group of stereoisomers or about a specific (stereochemically defined) chemical compound?

SCIdude (talkcontribs)

@Wostr The product is the trans-racemate, i.e. (R,S) and (S,R), and I actually wanted to add P31 for this, but suddenly remembered someone emphasized not to mix group and racemate, so I stopped. Maybe the name should be changed to (RS*,SR*) to be more exact?

Wostr (talkcontribs)

Okay, now it makes more sense, I'll handle this. We need two new items for both stereoisomers to properly model this situation and move/delete few properties that are not 100%-true for a racemic mixture. I'll write again after doing this.

Wostr (talkcontribs)

I think all is done right now. 2-phenylcyclopropan-1-amine (Q100423358) and (2S)-2-phenylcyclopropan-1-amine (Q27163528) as group of stereoisomers, tranylcypromine (Q420885) as a racemate, trans-(−)-tranylcypromine (Q100429558), trans-(+)-tranylcypromine (Q100429273), (1S,2S)-2-phenylcyclopropan-1-amine (Q100430420) and (1R,2R)-2-phenylcyclopropan-1-amine (Q27280143) as specific stereoisomers.

If you come across similar situations with racemates in the future, feel free to point me to such items.

SCIdude (talkcontribs)


Reply to "tranylcypromine"