Wikidata:Property proposal/plural forms

plural forms

edit

Originally proposed at Wikidata:Property proposal/Natural science

   Not done
Descriptionstores the string used by GNU Gettext (Q937302) and compatible tools to describe simply how many plural forms a language has and what ranges of numbers each covers
Representsgrammatical number (Q104083)
Data typeString
Domainitem, language (Q34770)/dialect (Q33384)/language variety (Q3329375)
Allowed valuesnplurals=[number here]; plural=[string with particular format described at https://gnu.org/software/gettext/manual/html_node/Plural-forms.html#FOOT5]
Example 1Arabic (Q13955)nplurals=6; plural=(n==0 ? 0 : n==1 ? 1 : n==2 ? 2 : n%100>=3 && n%100<=10 ? 3 : n%100>=11 ? 4 : 5);[1]
Example 2Montenegrin (Q8821)nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10
Example 3English (Q1860)nplurals=2; plural=(n != 1);
Example 4Indonesian (Q9240)nplurals=1; plural=0;
Source
Planned useadd languages listed in the source tables
Number of IDs in source~150
Expected completenessalways incomplete (Q21873886)
Wikidata projectWikiProject Linguistics (Q10857957)

Motivation

edit

These strings are important for software localisation, but resources online are scattered; this seems to be a good fit for Wikidata's mission of being a central repository of such information. Arlo Barnes (talk) 21:35, 10 June 2022 (UTC)[reply]

Discussion

edit
  •   Comment @Arlo Barnes: I like the idea a lot but the proposal need to be totally reworked (so technically   Oppose as it is). First, the string datatype feels very bad, it's obscure and prone to mistakes. Also, it too depeendant on one system (GNU) where we should have a more general and neutral solution (for instance, this solution seems to ignore decimal number, "1.5" is followed by a singular in French but by a plural in English for instance, it also ignores gender and other grammatical agreement). Finally, it seems more to be something for Wikifunctions than for Wikidata. That said, on the Wikidata side, I see that we don't have a property "has grammatical number", like we do for has grammatical gender (P5109) (and so many others ; or am I missing something?). PS: as a breton speaker this whole system feels very funny :D as we don't agree on number after number (be do we agree on gender after number and low numbers also cause mutation) Cheers, VIGNERON (talk) 16:58, 11 June 2022 (UTC)[reply]
  • The solutions presented in yall's comments (using Wikifunctions once it becomes ready, using a multiplex of statements) are certainly more elegant, but I still think having the unparsed string stored has utility, because it means someone can look up the language and copy and paste the whole slug into their localisation software (or better, the software can look it up by itself). Perhaps as a qualifier to a more semantically-specified statement? Arlo Barnes (talk) 17:17, 11 June 2022 (UTC)[reply]
  • There are multiple common formats for this information. For example, CLDR uses the XML-based LDML format. Would it be possible to model plural rules with enough structure that either gettext or LDML could be generated from it, using an ordinary SPARQL query, no need for Wikifunctions? I would like to be able to state, for example, that Vietnamese (Q9199) has one form according to some sources but two forms according to others. But if a format only used by some sources only tells part of the story, are we responsible for translating the format used by other sources into gettext format? At a glance, I'm not sure that the LDML format can be converted losslessly into gettext format in every case, though maybe it won't matter for any of the 150 initial occurrences proposed above. CLDR is also considering additional attributes to be applied at a higher level than the condition. Minh Nguyễn 💬 23:36, 11 June 2022 (UTC)[reply]
I like your line of thinking here where a SPARQL query could yield a variety of formats, but surely translating from the gettext string to a series of statements is equivalent effort to converting from other formats into gettext where possible -- the human entering the data still has to be able to read two formats, the source and whatever we're storing it as. The advantage of using an existing format is that those can sometimes be the same and so modulo a 'stated in' reference it can just be entered verbatim. I'm ambivalent as to which system might be of best advantage in such a situation, although if there are incompatibilities then the most expressive one would be preferable of course. If nothing suits, then I guess a Wikidata-internal system might well do to try to maximize flexibility. This would be equivalent to informally specifying a new format in RDF, if I'm not mistaken. Arlo Barnes (talk) 01:10, 12 June 2022 (UTC)[reply]

Wifey: [for the property name] Something verbose but precise would work I think. Like "GNU Gettext formatter string for plural forms"... The more clear it is from the label alone what you are supposed to put in it, the better (since realistically people are going to see it in autocomplete before they see the documentation). I would also suggest changing "eventually complete (Q21873974)" to "always incomplete (Q21873886)". There's no complete list of every language to date, so unless there's a very finite set of them which can have this property it's unlikely to have a complete set (further complicating this is the possibility of dialectal variations in plural form).

References

edit
  1. http://wiki.arabeyes.org/Plural_Forms