Wikidata:Property proposal/chemical formula

‎molecular formula

edit

Return to Wikidata:Property proposal/Natural science

   Under discussion
DescriptionDescription of chemical compound giving element symbols and counts
Representsmolecular formula (Q188009)
Data typeItem
Domaintype of chemical entity (Q113145171) group of stereoisomers (Q59199015)
Allowed valuesmolecular formula (Q188009)
Example 12-hydroxy-5-octanoylbenzoic acid (Q209407)C₁₅H₂₀O₄ (Q129998552)
Example 2abscisic acid (Q332211)C₁₅H₂₀O₄ (Q129998552)
Example 3Santonic acid (Q7420590)C₁₅H₂₀O₄ (Q129998552)
Example 4silver bicarbonate (Q27260276)CHAgO₃ (Q130044611)
Expected completenessalways incomplete (Q21873886)
See alsochemical formula (P274)
Wikidata projectWikiProject Chemistry (Q8487234)

  Notified participants of WikiProject Chemistry

Motivation

edit

This proposal addresses the need for improved data structure and maintenance within Wikidata’s chemical compound data. Currently, the Wikidata:WikiProject Chemistry manages approximately 1 million chemical items, with many of them linked to chemical formula (P274) and mass (P2067). The main issues are:

Redundancy in Data: With about 300,000 unique chemical formula strings in use, redundancy is a significant problem. Some strings are associated with over 1,000 items, which complicates data management (see https://w.wiki/B2ax).

Efficiency and Maintenance: Transitioning from string-based formulas to item-based ones will simplify maintenance, reduce redundancy, and optimize query performance, especially for SPARQL queries involving formulas or masses.

Data Optimization: Moving mass (P2067) statements to the newly created formula items will reduce the number of triples and make data management more efficient. Additionally, this change will facilitate the use of different units for masses and allow for better structured data.

Improved Modeling: Switching to item-based formulas could eliminate the need for overly complex has part(s) (P527) statements on chemicals, allowing cleaner, more precise data models (e.g., identifying all chemical formulas containing more than five oxygen atoms).

This change is expected to bring numerous benefits, including reduced redundancy, improved query efficiency, and better data maintenance. The potential downside of increased label editing can be managed, and the overall gain for Wikidata’s chemical data justifies this proposal. If approved, I am prepared to create the necessary items and migrate existing data.

Any further input to refine this proposal is more than welcome!

P.S.: I have no strong opinions if current chemical formula (P274) should be deleted or used on the new items as "Chemical Formula String"  – The preceding unsigned comment was added by AdrianoRutz (talk • contribs) at 15:00, August 28, 2024‎ (UTC).

discussion

edit
  •   Support sounds great! Egon Willighagen (talk) 15:25, 28 August 2024 (UTC)[reply]
      Comment Last night on the boat between Finland and Sweden I thought of another aspect where this would help model the chemistry in Wikidata better. If chemical formula are items (and thanks to GZWDer for showing various Wikipedias decided it was useful too), then they can also subclass each other. We can have an isotope-agnostic chemical formula ( the common case) and subclasses for chemical formula with isotopes.As such it does much more than being something technical (e.g. just about scalability) but actually improve how we talk about the chemistry. Egon Willighagen (talk) 07:07, 29 August 2024 (UTC)[reply]
  • Some comments:
  1. I will oppose "Additionally, this change will facilitate the use of different units for masses and allow for better structured data." - For consistency and machine-readability we should stick to one unit. I instead propose Wikidata:Property proposal/formula weight.
  2. Many wikis has pages like C15H20O4 (Q1250089). Some wikis treat it as disambiguation pages; some as set indices; we need to discuss how to handle such existing items. GZWDer (talk) 21:10, 28 August 2024 (UTC)[reply]
  • I looked at the English Wikipedia sitelink-ed page, and that actually looks exactly like a page about a chemical formula. To be honest, this actually sounds like in argument in favor of this proposal and that C15H20O4 (Q1250089) should be of type chemical formula (Q83147). The same for the French WP page, and neither say they are disambiguation pages, but are far more like a category of things with the same property. Just like this proposal, not? Egon Willighagen (talk) 06:58, 29 August 2024 (UTC)[reply]
I was only partially able to follow your mind here. In your proposal, you mention this property if created, thus you would support it? I believe the discussion about mass (P2067) (and units) or other properties is an interesting one this proposal would allow to better discuss/implement, and what I mentioned about these or what is currently on the example item are just ideas, if this new property allows for these things to also improve, even better! AdrianoRutz (talk) 08:51, 30 August 2024 (UTC)[reply]
  •   Weak oppose I cannot question arguments raised here about efficiency, but I don't see this as a proper way forward. This proposal completely fails to take into account the fact that for a given chemical entity there may be many – equally correct – chemical formulae (simple example in Q27260276#P274). Moving chemical formulae to another item will not help at all with the most important purpose for which WD exists – using this data. I would see the new property as being created only to assist with specific activities – but not to replace existing properties – and with appropriate disclaimers in the name and constraints that it is a strictly technical property only. Wostr (talk) 22:21, 28 August 2024 (UTC)[reply]
    I think this proposal has no problems with alternative formula notations, e.g. like CHAgO₃ (Q130044611). Or? Egon Willighagen (talk) 06:51, 29 August 2024 (UTC)[reply]
    CHAgO₃ and AgHCO₃ are not the same chemical formula. Just as e.g. XeF4O and XeOF4 which would require two different items for the same compound. In fact, for some compounds several new items would need to be created. For some chemical species we would have formulae that have different number of atoms of elements: C30H40F2N8O9, C15H17FN4O3·1,5H2O and C30H34F2N8O6·3H2O are correct formulae for the same compound, but I don't see a way for this to be reflected correctly by the current proposal. Everything looks fine if you consider only simple organic compounds and their formulae in Hill notation, but it's not that simple especially if we consider some inorganic compounds which are not molecules. Wostr (talk) 12:34, 29 August 2024 (UTC)[reply]
    Thank you for this important point! I removed the single value constraint, thus allowing for what you mention. AdrianoRutz (talk) 08:47, 30 August 2024 (UTC)[reply]
    Good point about non-molecular substances. I think the chemical concept we are trying to capture is that of isomerism: chemical entities are isomers when they have the same molecular formula (Q188009) or (non-structural) formula unit (Q1437643), enabling one molecule/ion/unit of the first chemical entity to be rearranged into one molecule/ion/unit of the second chemical entity by moving atoms/bonds around.
    • For example, the ionic compounds with structural formulas [CrCl(H₂O)₅]Cl₂•H₂O and [Cr(H₂O)₆]Cl₃ are (hydration) isomers, which we can recognise by assigning them the same formula H₁₂Cl₃CrO₆. This shows that all species in the crystal lattice of a compound should be combined together into a single entity when determining the formula. In the example you give above, the correct formula would be C₃₀H₄₀F₂N₈O₉, derived from combining together 2C₁₅H₁₇FN₄O₃•3H₂O, the smallest formula unit with integer multiples of all species.
    • Likewise, the molecular substance CO(NH₂)₂ and ionic compound NH₄OCN are considered isomers, which we can recognise by assigning them the same formula CH₄N₂O. This is the molecular formula of urea and the formula unit of ammonium cyanate, showing how molecular and non-molecular substances can be isomeric.
    • For ions, fulminate(1−) (Q27110286) (with structural formula CNO-) and cyanate anion (Q55503523) (with structural formula OCN-) are isomers, which we can recognise by assigning them the same formula CNO-.
    • Clathrates are similar to coordination compounds. E.g. methane clathrate (Q389036) has structural formula 4CH₄•23H₂O, yielding the formula C₄H₆₂O₂₃. Likewise, the endohedral fullerene CH₄@C₆₀ should have formula C₆₁H₄.
    • Compounds should not usually map to multiple formulas: if C links to two different formulas, one the same as A (from reference 1) and one the same as B (from reference 2), this implies C is isomeric with A, and C is isomeric with B, but A is not isomeric with B. This only makes sense if 1 and 2 disagree as to what the correct formula of C ought to be.
    • When references disagree, we may need to support multiple formulas. Historically, w:en:copper monosulfide was thought to have structure [Cu2+][S2-], corresponding to the formula CuS. It has now been assigned the structure [Cu+]₃[S2-][S₂-], which would correspond to Cu₃S₃. However, PubChem still has the old formula. We might want to update Wikidata to the new formula while also keeping the PubChem-referenced formula (with a note that it's not the correct formula).
    • Non-stoichiometric compounds, alloys, and mixtures of indeterminate composition are more complicated to support. E.g. pyrrhotite (Q421944) has formula Fe1-xS (x = 0 to 0.125). Rather than trying to support formula units with atom counts that are algebraic expressions (e.g. 1 - x), I think it would be easier if we could list the formulas of the endpoints: Fe₇S₈ and FeS. Similarly, superconducting yttrium barium copper oxide (Q414015) has formula YBa2Cu3O7−x (x = 0 to 0.65), with endpoint formulas YBa2Cu3O6.35 (i.e. Y20Ba40Cu60O127) and YBa2Cu3O7. I think it's hard to come up with a perfect solution though. InChI (P234) has similar issues for non-stoichiometric compounds: https://doi.org/10.1186/s13321-015-0068-4#Sec45.
    Preimage (talk) 17:47, 31 August 2024 (UTC)[reply]
  •   Support I also see more benefits than downsides. Support. Wostr I am not sure to understand how this would be a problem even for entities which could be described using different MF sequences of atoms like Q27260276#P274. Indeed the has part(s) (P527) and quantity (P1114) of the MF entity, see C₁₅H₂₀O₄ (Q129998552) would allow to efficiently retrieve such compounds represented in different MF notation systems. What would exactly be the inconvenient in this particular case? GrndStt (talk) 06:22, 29 August 2024 (UTC)[reply]
  •   Support, conditional on change of representation to molecular formula (Q188009). As noted in w:en:chemical formula#Types, chemical formula (Q83147) has four separate meanings: empirical formula (e.g. formaldehyde and glucose both have empirical formula CH₂O), molecular formula (e.g. urea and ammonium cyanate both have molecular formula CH₄N₂O in Hill notation, indicating they are isomers), structural formula (a graphical representation of the structure, not so relevant here), and condensed (or semi-structural) formula (e.g. urea has condensed formula CO(NH₂)₂ whereas ammonium cyanate has condensed formula [NH₄][OCN]). Molecular formulas "indicate the simple numbers of each type of atom in a molecule, with no information on structure", which is what we need for mass calculations. They also avoid the issue raised by Wostr regarding non-uniqueness of chemical formulas (e.g. NH₄NO₃ and H₄N₂O₃ are both valid formulas for ammonium nitrate), as each chemical should have a single canonical molecular formula in Hill notation (with the exception of rare cases where there is disagreement regarding structure, e.g. w:en:copper monosulfide). One last potential issue: molecular formulas are often defined as not including isotopes, e.g. PubChem lists both deuterated chloroform and chloroform as having molecular formula CHCl₃. Egon Willighagen's suggestion to have a subclass of [molecular] formulas with isotopic information would resolve this issue though, I think. Preimage (talk) 12:22, 29 August 2024 (UTC)[reply]
    Just revised the naming to change to molecular formula (Q188009), as suggested. 👍🏼 AdrianoRutz (talk) 07:16, 24 September 2024 (UTC)[reply]
  •   Oppose A chemical formula is an abstract entity and not one that has a mass.
It's worth noting that unicode can't capture all chemical formula and Mathematical expression could express more. ChristianKl16:29, 25 September 2024 (UTC)[reply]
You're wrong about that. Each chemical formula has a defined number of atoms of a defined number of elements. Although each element has multiple isotopes, for every element with stable isotopes there is a standard mass associated with it which is the atomic weight which will be found with a typical sample. So the molecular weight of a particular chemical formula very much can be expressed. David Newton (talk) 09:58, 27 September 2024 (UTC)[reply]
Currently, in Wikidata a chemical formula is a notation. Notations don't have inherent mass. The NCI description of what a chemical formula happens to be is "representation of a substance using symbols for its constituent elements". It's not the object that it's describing. While the object that a formula is describing can have mass the formula itself doesn't. It's a Document in NCI's ontology. In PROCO it's a quality and also not something that has mass. material entity (Q53617407) have mass and molecular formula (Q188009) isn't. ChristianKl12:47, 9 October 2024 (UTC)[reply]