Photocyte (talk) 05:43, 18 April 2020 (UTC) Saehrimnir
Jasper Deng
Egon Willighagen
Denise Slenter
Daniel Mietchen
Emily Temple-Wood
Pablo Busatto (Almondega)
Antony Williams (EPA)
Devon Fyson
Samuel Clark
Tris T7
Robert Giessmann
Cord Wiljes
Adriano Rutz
Jonathan Bisson
Charles Tapley Hoyt
Peter Murray-Rust
Pictogram voting comment.svg Notified participants of WikiProject Chemistry

Hello gang (not sure if I am doing this right by placing it on my own talk page - let me know if this is unusual), I wanted to ask some questions about the integration of PubChem and Wikidata. As you may know, PubChem is (pretty sure) the most comprehensive database of molecular structures. Many PubChem entries, are linked to a Wikidata item (e.g. https://www.wikidata.org/wiki/Q418878 - scroll down to see the PubChem CID property). But, novel PubChem entries, as far as I can tell, do not get propagated to Wikidata within a reasonable amount of time (e.g., https://pubchem.ncbi.nlm.nih.gov/compound/139291741 , and if I search Wikidata for that PubChem ID "139291741", nothing comes up).

So, questions:

  • 1) Is there any effort to make it so *every* PubChem compound gets replicated in Wikidata, even if it doesn't have a Wikipedia page yet? If I made a bot that tried to do that, would that cause any issues?
  • 2) Actually, the https://www.wikidata.org/wiki/Q418878 example I gave above links out to PubChem, but if you click through the link, you'll notice that the entry has been marked non-live. In actuality, it should link to the live version of the compound: https://pubchem.ncbi.nlm.nih.gov/compound/135445694 . Is there any effort to be updating these PubChem<->Wikidata links?
  • 3) If updating the PubChem<->Wikidata links, in the Wikipedia page linked to the Wikidata item, the PubChem linkout is also used, but it is a separate thing under the Chembox template (see here: https://en.wikipedia.org/wiki/Coelenterazine). Does anyone know if there is a bot that is updating those Chembox PubChem linkouts from the Wikidata entry, or vis-versa?

Hi, the answer your first question: no, there is no such effort. There are two reasons why this will not happen soon either: PubChem is too large. Second, we do not have enough CC0 data in chemistry to demonstrate which chemicals are notable and which not. To answer the second questions, I don't think there is now, but this is something we could work on. The normal approach here is to create a report that indicates which non-live PubChem CID records are linked to, so that the WikiProject Chemistry can look at them. If not mistaken, this is how it all started, but this work has not been continued. I would say, all questions are relevant, but keep in mind, there is not a general 1-to-1 relation and so many corner cases that the term corner case is not really even appropriate. I would say the WikiProject has limited bandwidth, but if interested, you're most welcome to continue talking. --Egon Willighagen (talk) 07:12, 18 April 2020 (UTC)

You really do NOT want to simply load all of PubChem into Wikidata as you will inherit so many errors and will spend years cleaning them up if you are not careful. Our efforts at CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) have 882,000 chemicals (and five full time curators) and we are curating data every day and are 20 years into the effort and still finding errors. I can point you to a number of articles highlighting the issues of blindly using public domain chemistry data.... --Antony Williams 23:52, 26 July 2020 (UTC)
Hello, good place to discuss about this kind of topics is Wikidata:WikiProkect Chemistry. Then concerning your questions, no , ther eis no intention to import all PubChem into WD. WD is not a mirror of PubChem the intersection of more databases and contributors work. Then PubChem has some duplicates (just have a look at Wikidata:Database_reports/Constraint_violations/P662#"Single_value"_violations) and because the policy concerning what is a chemical compound is different between PubChem and WD. So there is a need to filter what should be imported into WD.
Then WD is lacking maintenance bots checking databases to ensure up-to-date information in WD. The main reason is tha mass importations were done in WD without a curation of data. So keeping data up-to-date is wortheless until a complete curation of the original set: nothing guarantees that a PubChem CID is added in the correct Q item. Most data were imported from Wikipedias with a lot of errors and bad definitions of the chemical strucutre (typical example: structure of salt form instead of acid form for an acid). So performing the task you mention is good but a more structural work is needed first.
Finally concerning data displayed in Wikipedia articles, this is a problem of Wikipedia: there is a possibility to instantaneous up-date of data in Chembox, but for that Wikipedia's people have to accept the use of lua template and the display of data from WD. Currently there is a strong opposition to that system for different reasons (you can find some discussion about that here. Best regards. Snipre (talk) 13:34, 19 April 2020 (UTC)