Wikidata:Property proposal/Vocalized form

Vocalized form edit

Originally proposed at Wikidata:Property proposal/Lexemes

Not done

Represents	niqqud (Q1777790)
Data type	string, unless there's a more specific data type for an alternative lexical form-invalid datatype (not in Module:i18n/datatype)
Domain	lexeme
Example	For the Hebrew lexeme "כתב" in the sense of "writing system" it will be כְּתָב (Lexeme:L415). For the lexeme כתב in the sense of "reporter", it will be כַּתָּב (Lexeme:L416).
Source	Arabic diacritic (Q775724), niqqud (Q1777790)
Planned use	Add the property to Lexeme:L415 and Lexeme:L416, and any other Hebrew word. See example.
See also	There is a comparable property for Q items: vocalized name (P4239), but it probably shouldn't be reused. That property is only for names, and this proposed property is for all words and forms. Quite likely, there should be a similar property for Arabic, and perhaps the same property can be used for both Hebrew and Arabic, but I only know Hebrew well, so this will need input from somebody who is familiar with Arabic grammar and lexicography.

Motivation

Hebrew has two standard spelling systems: vocalized and unvocalized. (There are several other variations, which will probably need their properties, but these two should be the start.) In casual writing, the unvocalized system is used almost always, but in some contexts, the vocalized spelling is used. In particular, Hebrew monolingual dictionaries indicate both forms (You can read more about it at the Wikipedia article Niqqud, and in the Encyclopedia of Hebrew Language and Linguistics article Vocalization of Modern Hebrew.) This is necessary to learn the pronunciation and the grammar. This will be needed for every form, and not only for the basic lemma. When we have automatic generation of declined forms, every declined form will have to have both vocalized and unvocalized spelling (we'll have to decide which will be the default... dictionaries and grammar usually go for the vocalized spelling in declination tables, but we'll have to discuss what is best for Wikidata). Amir E. Aharoni (talk) 14:46, 23 May 2018 (UTC)[reply]

Discussion

Question Isn't this better using the built-in "forms" feature of lexemes - the "grammatical feature" then could indicate whether it was vocalized or not? ArthurPSmith (talk) 19:33, 23 May 2018 (UTC)[reply]

I don't think so. It's not a different grammatical form, but a different representation of the same grammatical form.

I can imagine, for example, that if using alternate spelling systems is supported, then each spelling system can have a label, and each form can have several representations. However, vocalized/nonvocalized Hebrew is not like a language such Russian, French, or German, which had different spelling standards over time, and it's not a regional variation like European and Brazilian Portuguese. For Hebrew these are different representation of the same word in the same language in the same place and time.

Perhaps I should mention that Hebrew does have variation in spelling standard over time, but it's not as notable as it is in Russian with its 1918 reform, for example. I doubt that it will be great demand to include early-20-century Hebrew spelling and current Hebrew spelling (sipur as ספור and סיפור). Including vocalized forms, however, is essential, because that is the full pronunciation, and all dictionaries have both forms. --Amir E. Aharoni (talk) 20:00, 23 May 2018 (UTC)[reply]

@Amire80: Ok - but lexemes also allow multiple representations of the same form, but I think I see your point that this is different. So the proposal is to have two lexemes for every Hebrew word, one for vocalized and one for unvocalized spelling, and link them through this property? Maybe the label should avoid the word "form" to prevent confusion here? ArthurPSmith (talk) 19:14, 24 May 2018 (UTC)[reply]

Possibly, not sure. I am trying to reach to lexicographers with relevant knowledge and ask for their opinion.

Making them equal and neutral rather than preferring one is probably a good idea. Other interfaces that produce dictionaries for actual people's consumption can decide what do they prefer to show as the primary form (among the current common dictionaries, Rav-Millim is an example of a dictionary that shows the unvocalized form as primary, and Even-Shoshan is an example of one that shows the vocalized). (And yes, there's the question of whether Lexical Wikidata is the dictionary itself that people consume, or is it just an infrastructure from which other dictionaries will be derived.) --Amir E. Aharoni (talk) 06:06, 29 May 2018 (UTC)[reply]

Comment @Amire80, ArthurPSmith: I can't say if this property is needed or not. Not sure if it help, but I had a case a bit similar today, where and how to had "amāre" to Lexeme:L1643. The problem is obviously very different but the possibility of resolution are similar : use lemma, creation of forms or property ? What is sure is that form is not limited to « grammatical » form. On Lexeme:L114 Tpt used forms for dialectal variation and on Lexeme:L95, I've used the lemma to indicate a variation (on the longer we should probably choose one method, but not both, and stick to it but meanwhile, it demonstrate the possibilities). To me, creating a property seems a bit to be the too-easy-lazy way. Cdlt, VIGNERON (talk) 16:27, 29 May 2018 (UTC)[reply]
- If I have understood correctly, each form of an Hebrew world could have a vocalized spelling. If yes, would go the simplest way. It seems to me that this way is, if there is no reason to consider use vocalized version as a different form of the not vocalized one (i.e. if there is no statement we would do on the vocalized version and not on the not vocalized one), to just do what is done on Lexeme:L95 by VIGNERON, i.e. add two representations for each form (and maybe two lemmas for each lexemes) one for the vocalized version and one for the unvocalized one with relevant language codes. It seems to be also what Ontolex tends two (see the second example) Tpt (talk) 16:46, 29 May 2018 (UTC)[reply]

Weak support - Can we solve this problem by calling the English property name "vocalized spelling" and making the datatype Monolingual String? Then each form of a Hebrew word can have its own vocalized spelling listed on the same Lexeme, without requiring extra Lexemes for the sake of recording niqqud. Deryck Chan (talk) 09:55, 13 June 2018 (UTC)[reply]

Weak oppose If I understand correctly, this is about an alternative spelling convention used to distinguish words that are otherwise spelled the same, but pronounced differently. That's what "spelling variants" of lemmas and form representations are for. Since כְּתָב and כַּתָּב כַּתָּב are already modeled as separate lexemes (as they should be if they are pronounced differently or have different morphology), this should work fine, perhaps using he-x-Q21283070 as the variant code. The lexemes are homographs, as they have the same lemma in he (כתב), but they have different lemmas in the vocalized variant (and would have different pronunciations as well). Is there any use case that would not be covered by this? -- Duesentrieb (talk) 13:18, 15 July 2018 (UTC)[reply]
- Oppose per above. We already do this for some Arabic entries (e.g. شُبَاط (L8661)) and it seems to work. - Nikki (talk) 09:57, 19 September 2018 (UTC)[reply]
Comment see also now Wikidata:Property proposal/word with diacritical signs which I think is a better approach, but I'm thinking both can be handled just via the representation system in lexemes... ArthurPSmith (talk) 14:21, 16 July 2018 (UTC)[reply]

Not done No consensus.--Micru (talk) 09:46, 22 December 2018 (UTC)[reply]