Wikidata talk:Lexicographical data/Archive/2023/04

Latest comment: 1 year ago by عُثمان in topic Usage examples on senses
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.


Should we unify some languages?

We currently do not have clear rule whether to unify some languages. Two different dialect/variety may be different in some ways, but may still be kept in one lexeme:

There are some groups of languages for discussion. I am tended to unify all of them if possible, but as I do not speak these languages we need to discuss their merits:

  • Serbo-Croatian <= Serbian, Crotian, Bosnian, Montenegrin
  • Malay <= Indonesian
  • Persian <= Tajik
  • Armenian <= Western Armenian

(For Norwegian Bokmål/Nynorsk, the languages are well developed with many active users and attempted to unify them will require much work. I am still willing to know arguments against unification; I searched previous discussion but does not found why not.)

(For Chinese, I have an idea to introduce "Unified Chinese lexeme". But this is a somewhat large project so this will be discussed in a separate section.)

--GZWDer (talk) 16:06, 31 January 2023 (UTC)

Regarding Norwegian, I'd like to point out that Bokmål and Nynorsk are not much more similar to each other than either are to Swedish or Danish (10 years ago, I created this short comparison). Furthermore, when the predecessors of Bokmål and Nynorsk came into existence in the latter half of the 1800s, Bokmål was more or less a variant of Danish, while Swedish was probably closer to Danish than Nynorsk (or Landsmål, as it was known at the time) was. --Njardarlogar (talk) 17:54, 31 January 2023 (UTC)
For merging Malay-Indonesian, and Persian-Tajik, I think the case is very clear. These are identical for practical purposes; as I understand it the standard forms of Malay and Indonesian are both based on the same dialect of Malay, and most speakers of Tajik Persian live in Afghanistan and do not use Cyrillic anyway. It is just a matter of achieving consensus with the users who created the Indonesian and Tajik lexemes. عُثمان (talk) 18:23, 10 February 2023 (UTC)
With respect to the first two pairs, as much as I would support those, you may be inviting @Ivi104: to restart the Yugoslav Wars (Q242352) and @Bennylin: to restart the Indonesia–Malaysia confrontation (Q1361929) with such proposals. I tend to agree with عُثمان regarding Persian and Tajik, however (@Farorud: who created most of the Tajik lexemes). Mahir256 (talk) 18:38, 10 February 2023 (UTC)
Also @Denny: for Crotian. GZWDer (talk) 22:35, 10 February 2023 (UTC)
That's not a fight I want to be involved in. Unifying Croatian and Serbian may reduce the interest of using this data in the Serbian and Croatian Wiktionaries. So unless we unify the Wiktionaries (and maybe even the Wikipedias) -- which would be a very different discussion -- I feel like I would suggest to put this discussion on hold. --Denny (talk) 22:54, 10 February 2023 (UTC)
Wikipedia and Wiktionary projects is usually written in one standard language (e.g. Croatian Wikipedia is written in standard Croatian and Croatian Wiktionary contains definition in standard Croatian of words from all languages), but Lexemes are not only about standard languages. Usually lexemes in specific lects not only results in redundancy, but also encourage a bias towards a standard languages. About reusing the data in Wiktionaries, this is my thought:
  • For Lemmas and Forms specific language code (e.g. hr) is preferred for standard language. So any lexemes or forms not found in hr can be ignored by users. (we still reserve macro code for lexemes not considered part of standard language)
  • Senses can already have gloss in Serbian, Croatian and Serbo-Croatian. For lexemes or senses that is only used in Croatian, we can specify it using variety of lexeme, form or sense (P7481), and any lexemes, forms and senses with variety of lexeme, form or sense (P7481) that is not Croatian can be ignored if Croatian data is requested.
GZWDer (talk) 23:57, 10 February 2023 (UTC)
(Sorry for my bad English) I think unifying lexemes in the Languages of Farsi and Tajiki is possible. But the question is that what will be written on the line language (Tajik or Persian). And plus some words in Tajik have several forms, for example, the word "روشن" (Light) can be written both as "равшан" (ravshan) and as "рӯшан" (rushan), and "اروپا" (Europe) as "Аврупо" ("Avrupo") and "Урупо" ("Urupo"), etc. Some words even have several forms, the word "آمریکا" (Europe) "Америка, Амрико, Омрико ("America, Amriko, Omriko"). What kind of form will be written in the string of the lexeme for Tajik, or for each form will be created a new lexeme? Or is there a Property "Alternative Forms"? -- Farorud (talk) 09:50, 11 February 2023 (UTC)
For this case, I have drafted Wikidata:Lexicographical data/Alternative forms. This means unless one form is overwhelmingly more common in use (or considered "standard"), روشن , равшан/ravshan and рӯшан/rushan can be considered three forms of the same lexeme. GZWDer (talk) 22:05, 11 February 2023 (UTC)
The exact same pattern happens in Persian words borrowed into Punjabi. Since و is a semi vowel there are often multiple forms. For Persian پلاو there is پلاو ਪਲਾਵ (palāv) or پلاؤ ਪਲਾਉ (palau). I just add both and link them with the alternative form property. The actual common pronunciation is plā but Punjabi dictionaries prefer to use the Persian spellings over pronunciation spellings as the main lemma, so I follow this in selecting the lemma for the lexemes. عُثمان (talk) 00:45, 12 February 2023 (UTC)
I think the decision whether to unify two languages should be made by speakers of the two languages in question. It is vital that we have native and fluent speakers of a language editing lexemes if we want good quality data for that language and it doesn't matter how much we optimise the data model if the way we do it makes those speakers unwilling to get involved. I also think redundancy should be the least of our concerns right now. We have very little (if any) data for the vast majority of languages. There are only 82 languages with more than 100 lexemes, for example. - Nikki (talk) 13:44, 12 February 2023 (UTC)
@Darafsh, can you join to this discussion, as the user that created many of lexemes in Farsi. What you think? -- Farorud (talk) 15:09, 12 February 2023 (UTC)
I agree with عثمان for Persian and Tajik. Despite the fact that both languages are similar, no one is a dialect of the other. As Farorud said, there are so many words that pronunciation differently and even in some words, the Persian has a different form, خفاش (bat) in Persian and Болдаст in Tajik is just an example. So I don't think unifying the Persian and Tajik lexemes is a good idea. Darafsh Kaviyani (Talk) 12:48, 14 February 2023 (UTC)

A counterpoint to your point GZWDer. Breton (Q12107) is one and only language (some people may disagree but the majority agree) but it's *very high* of variation, it's dialectal (4 main ones) and it also have several orthography (at least 3 major ones in the last century). So for a single lexeme, you may end up with dozen of forms (sometimes homographic but not always) and each form may also have up to 10 different pronunciations (so we are around 120 cases). If you add up that there is mutation (first letter of the world changing depending on the context and what's before), up to 10 grammatical numbers (singular, plural, collective, singulative, dual, plural of dual, plural of plural, plural of collective, plural of singulative, etc.) and a load of other "traps". It's quite complex, and if you don't know the language very confusing. I would love to have a clear way to model that, but I'm curious wht this way could be ; especially from outside fro someone who don't know about all this complexity. Look at piv/piou/piw (L2127) or lagad (L114) for some example (not simple but far from being the most complicated ones). Cheers, VIGNERON (talk) 17:50, 16 April 2023 (UTC)

Two potential merges: “indefinite” and “simple future”

While going through the Wikidata:Wikidata Lexeme Forms templates, I noticed two pairs of items with identical labels:

Any thoughts or opinions? Lucas Werkmeister (talk) 14:31, 2 April 2023 (UTC)

@Lucas Werkmeister: For simple future (Q1475560) and simple future (Q96323395), I think there is no problem to have them merged. -- Bodhisattwa (talk) 15:37, 2 April 2023 (UTC)
I don't mind merging indefinite (Q53997857) and indefinite number (Q53998049). --Infovarius (talk) 15:58, 3 April 2023 (UTC)
I don't know if those can be merged. DL2204, what do you think? Theklan (talk) 17:33, 2 April 2023 (UTC)
These two don't describe the same thing. Q53997857 describes indefinite noun and adjective forms in Swedish and other languages (and the wikipedia article about Swedish indefinite noun forms is linked to that item), while Q53998049 describes indefinite noun forms in Basque; in Basque, that "indefinite" (termbank entry) is a grammatical number, along with singular and plural. In Swedish, noun forms can be singular and indefinite (example). There, the indefinite form seems the one to be expected when using indefinite articles (a.k.a. determiners). This is evidence for not taking both as exactly the same phenomenon. Actually, I don't see the subclass relation so evident. I'd rename Q53998049 to "indefinite number". DL2204 (talk) 16:13, 3 April 2023 (UTC)
This is tricky... simple future (Q1475560) and simple future (Q96323395) I would say there is maybe a slight distinction : indicative simple future and simple future. I'm not sure if there is a simple future outside of indicative but anyway, it's maybe better to keep them appart (just like past imperfect (Q12547192) and imperfect (Q108524486)). @Mahir256, Tanay barisha: is it what you meant (or at least, close enough) and if so, should be improve the items to better reflect that? Cheers, VIGNERON (talk) 17:15, 4 April 2023 (UTC)
@DL2204, VIGNERON: If they shouldn’t be merged, then I’d appreciate a relabeling of the items to make the labels unambiguous, yeah :) Lucas Werkmeister (talk) 11:52, 8 April 2023 (UTC)
OK, I agree; I have updated labels in EU, EN, DE for Q53998049. DL2204 (talk) 07:55, 13 April 2023 (UTC)
@Lucas Werkmeister: I'm not 100% sure, I'd like more comments on that; in particular from @Mahir256, Tanay barisha:. Cheers, VIGNERON (talk) 17:27, 16 April 2023 (UTC)

Usage examples on senses

Currently, usage example (P5831) is advised to be used only as a main value on lexemes, and not on senses and forms. I would like to propose allowing usage examples on senses directly. Note that this proposal is not to move existing statements to senses, just to allow adding new ones to senses (the location that work best may differ for different languages and types of lexemes).

As I understand it, the primary reasons for only allowing usage examples on the lexeme level are a) for consistency across lexemes, and b) because an example can demonstrate usage for both a sense and form it should not be on one or the other. While these reasons make sense, I would contend that I have not seen any evidence that the first of them has proven to be helpful with regards to usage examples. I can say with certainty that for the languages I have been contributing lexemes in, I have not found this property to be useful at all on the lexeme level, whereas I can think of clear use cases if they were on senses. I think the second reason is something which probably varies by language. The outline below lists reasons I would find usage examples on senses useful, and they concern Punjabi and Hindustani which are two languages for which demonstrating usage of senses seems essential while demonstrating usage of forms does not.

  • Practical concerns: Both Punjabi and Hindustani are written with two writing systems, and every usage example is expected to be provided in each of them. If we want to add more than one usage example to a lexeme, there is no way to group together the same usage example in a different script, making this information more difficult to parse. On verbs in both of these languages, we can expect very high counts of both senses and forms. See for example, the 50 senses (so far) on उठना/اُٹھنا (L1071943). I am in the early stages of fleshing out the senses on lexemes like this one, so I will note that this is by no means an outlier and there are many verbs in both languages which can be expected to have a similar or greater number of senses. The form count on Hindustani verbs can reach ~150 and on Punjabi verbs can be in the hundreds. These types of lexemes are exactly those which could most benefit from usage examples for clarity, and adding 100 statements to the top of a lexeme does not seem tenable as opposed to two per sense (for each script).
  • Forms can often only be determined in the context of the sense: If we take the phrase کھڑے کرے in both Hindustani and Punjabi, this could be a perfect participle form or a subjunctive form. Which of these it is can only be understood from the meaning of the sentence it occurs in, as some senses of this phrase correspond to perfective uses while others do not.
  • It is necessary to demonstrate that senses exist:
    • Counterintuitive senses: The "primary" sense of the lexeme containing करना/کرنا (L579999-S21) is commonly understood as "to do," however S21 here describes the use of this verb to mean "to cook." Most dictionaries of Hindustani do list this as a sense, and tend to accompany it with some simple examples that can clear any confusion about how this sense was identified. For example نور اللغات(Q116742594) lists روٹی کرنا "bread *doing" (bread cooking), a familiar phrase which clarifies the role of this verb in cooking-related compounds.
    • Mitigating against deficient sources: This concerns Hindustani much more than it does Punjabi. For historical and political reasons, there has never been a dictionary published for this language which can be said to include all of the most common words in use. In order to deal with the general spurious nature of most sources in and about this language, we are forced to rely on usage examples for demonstrating the existence of lexemes and senses. For example, the verb کچوٹنا (to be vexed) is easy to find examples of in Urdu writing, and yet an entry for it has never been published in any Urdu dictionary. The verb رچنا is one for which the primary senses are concerned with marriage, yet most Hindi dictionaries to date have incorrectly described this verb as having to do with dying/coloring. Most dictionaries of the language also contain a large number of words which do not exist and have never been used. In order to account for the poor quality of sources in this language, the dictionary Dictionary of Hindi Verbs (Q116477566) relies exclusively on usage examples to demonstrate the existence of senses and is much more insightful for it. It would be productive to extend this method to lexemes on Wikidata, especially in light of the fact that the sense counts in Dictionary of Hindi Verbs (Q116477566) can be understood to be a minimum. A couple of specific cases are considered below.
      • The verb کرنا is listed in نور اللغات(Q116742594) as having 18 senses, all of which are corroborated with usage examples. The dictionary Hindi Shabdsagar lists 13 senses for this verb, however 6 of these must be discarded due to either being glossed with intransitive verbs (this verb has no intransitive senses) or being glossed with words which have no relevance to the actual use of this verb. Dictionary of Hindi Verbs (Q116477566) lists 90 senses for this verb, all demonstrated well with at least one usage example, and as suggested above, can be understood to be the minimum senses for the lexeme. This kind of massive discrepancy in how verbs are described in dictionaries of Hindustani is liable to cause confusion unless the lexemes for these verbs are similarly supported by usage examples on senses.
      • The verb چِلچِلانا is listed in Dictionary of Hindi Verbs (Q116477566) as having a single sense demonstrated with a single usage example. By comparison, Urdu Lughat a more recent aggregate dictionary, offers four different senses for it with cited usage examples of each to demonstrate that they exist. The dictionary Hindi Shabdsagar lists a single sense which is erroneous (stating that this verb has to do with the sun shining, when it specifically has to do with the sun's heat), and additionally offers an entry for a transitive version of this verb which does not exist (and glosses it with an intransitive verb). In order to model Hindustani verbs accurately, dictionary senses without clear usage examples such as those in Hindi Shabdsagar must be discarded across the board.
    • Some senses require showing the lexeme in multiple contexts to demonstrate their existence. For the verb चाहना/چاہنا (L579700), this example uses it twice: وہ بہت کچھ چاہ لیتے ہیں۔ لیکن اگر صرف چاہنے سے ہی سب کچھ مل جاتا، تو پھر کویی ویکتی کام نہ کرتا۔ The verb's primary sense is "to desire," which is normally an "incompletable" sense. However, there is an additional sense "to be fond of" which allows expressing both an unterminated fondness, and a perfective fondness that has occurred. An example like this demonstrates the flexibility of this sense. The forms used are not really relevant as they take on rather predictable inflections based on syntactic context.

(With respect to Punjabi, the sources available are very high quality and comprehensive by contrast--however, they only scratch the surface when it comes to documenting the usage of lexemes. For example, no dictionary to date has provided usage examples employing the passive voice or future tense in Punjabi, and as such usage examples on senses would be a useful way to demonstrate these in a way existing sources have not.) عُثمان (talk) 19:50, 24 April 2023 (UTC)

I agree with this. Usage examples should definitely be per sense, just like in a dictionary. We should create a new property for this. Mtanti (talk) 08:15, 26 April 2023 (UTC)
I disagree. Ideally an example should always link to a specific sense and to a specific form (it should probably be added as a mandatory constraint) and I see no reason why not to make the link from the lexeme level. Forms are important (and I would say maybe even more important then the sense), they are not always "predictable" (not for people who don't know the language, not for machines, etc. and think about extreme cases like "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo" even I you know English quite well, it's everything but easy ; lastly for Breton, there is at least 100 differents case of inflections for nouns, it's totally impredictable, I had to double-check twice before adding them to skouarn (L627965)). For me, the last example is a perfect demonstration of why it should be on the lexeme level. Cheers, VIGNERON (talk) 17:05, 26 April 2023 (UTC)
@VIGNERON To be clear, I am not suggesting that what I am saying about forms is true of all languages nor do I think the location of usage examples has to change for Breton if they are better suited on the lexeme level for that language.
All I am saying is that, for the reasons given above and in the languages mentioned above, I do not think usage examples are usable at the top of lexemes. The fact that dictionaries of these languages do place usage examples with the senses while others might not is I think a particularly compelling reason to do this, as it allows for better conformity with the sources being referenced. We do already have a number of properties which are placed on different entities depending on the language; for example on lexemes like vergessen (L412870) we might expect transitivity (P9295) to be on senses, while on lexemes like ਲੌਟਣ/لَوٹݨ (L1096159) we might expect to find this property only once at the top of the lexeme. عُثمان (talk) 20:59, 26 April 2023 (UTC)
We can also have examples for each form. Mtanti (talk) 19:58, 27 April 2023 (UTC)
In general, having the same things at different places really doesn't seems like a good idea to me (is it really the same thing then?).
Especially in this case, if you duplicate the example from lexeme level to both sense and form levels. I prefer one statement on lexeme level with 2 qualifiers than two duplicated statements on sense and form level. In the end, it's equivalent and with the same amount of data but the first one is fare easier to maintain (if you have a correction or a precision to make).
Cheers, VIGNERON (talk) 08:13, 29 April 2023 (UTC)
@VIGNERON I am going to set up खड़ा कर देना/کھڑا کر دینا (L1097048) to have usage examples on the senses to better illustrate what I mean, because I think with examples like this it is clearly not possible for examples to be useful on the lexeme. If it turns out to be a problem we can revert it and discuss but I really think this is something which is language-specific in terms of how it can be useful. عُثمان (talk) 17:10, 29 April 2023 (UTC)

Concerning Tajik and Persian

This subject was touched on in the broader conversation above a few months ago: Wikidata_talk:Lexicographical_data#Should_we_unify_some_languages?. However, that was quite a broad question about something that can only really be evaluated on a case by case basis.

I would like to ask @Farorud and @Darafsh, who have contributed most of the existing Tajik and Persian lexemes respectively, to reconsider the case for merging them. Concerning these lexemes in particular, I think a merger would be mutually beneficial, and that we have reached a point where maintaining them separately seems particularly counter productive considering there are only a few users creating them and no efforts are being made to coordinate them.

Out of the 13021 Persian lexemes currently on Wikidata, there are only five adverbs, and lexemes for a number of the most common grammatical words and verbs have yet to be created. While the number of Tajik lexemes is much smaller at 731, this includes a number of words which we may consider the core vocabulary of the language. If the lexeme for همیشه همیشه/ҳамеша (L589853) were merged for example, a large number of Persian or Tajik phrases could be constructed using lexemes which already exist instead of having to create new ones. Then that lexeme could be linked to lexemes in the many other languages which have borrowed this word. The largest portion of loaned vocabulary in Punjabi comes from Persian, and I do not intend to create new lexemes for words like همیشه همیشه/ҳамеша (L589853) when linking them in etymologies and would prefer to just use the ones which already exist. I would feel silly creating a duplicate of this lexeme knowing that if, for example, a sense were added to one of them, no effort would be made to check to see if that sense is also on the other. Besides that particular example, there are a number of lexemes for compounds added as Tajik or Persian for which combines lexemes (P5238) statements could be added right away without creating any new lexemes.

As it stands, most of the Tajik lexemes have no references linked to them. Practically all of them could become referenced in short order using Dehkhoda ID (P11328).

I do not think the reasons that have been brought up thus far not to do this are very compelling. To go over them, as I understand them:

  • There are pronunciation differences. This is true between dialects of any language and is not typically used as a reason to treat them separately. Words like basil (L20345) are not pronounced similarly at all across English dialects and this has not been treated as a concern at all. We already have a property available for clarifying which variety a statement about pronunciation applies to: pronunciation variety (P5237).
  • There are vocabulary differences. Болдаст is not actually the Tajik word for bat, this word has never been used anywhere besides Tajik Wikipedia. Even if it was, it is a compound constructed of Persian words which are so common that a Punjabi speaker would understand what it means. Words like даст if anything are examples of how few differences there actually are in the core vocabulary. Even besides all that, if a certain word is specific to a region or dialect, we do have variety of lexeme, form or sense (P7481) and location of sense usage (P6084). It is typical of most languages that there are some concepts for which the common word differs by dialect or region.
  • The writing system is different. We have the ability to place multiple representations on a form on Wikidata and this is already being done on a large number of lexemes. This tool can produce conversions between the two writing systems based on a parallel corpus: https://github.com/stibiumghost/tajik-to-persian-transliteration . The number of Tajik lexemes which already have Arabic script representations on them anyway is substantial. I would be happy to help with transcription between the two writing systems and discuss any particular tricky cases if they come up, as I do not think there will be any that have no solution.
  • The present tense conjugation differs between Iranian and Tajik Persian. This is as far as I can tell the most substantive difference between the two varieties, but I do not think it would be exceptionally challenging to model using variety of lexeme, form or sense (P7481). This is also a situation which is common between varieties of languages which are not treated separately. See for example, Lua error in mw.wikibase.lexeme.entity.form.lua at line 56: bad argument #1 to 'pairs' (table expected, got nil). (L707490-F49).

Finally, I do agree with @Farorud that a different Wikidata item should be used for the language if these are to be merged. If Persian (Q9168) is used, the lexeme creation page will automatically set the representation to fa and not allow a tg input which is not desirable. The languages which already use multiple representations use language items which do not have one code or the other linked to them. The simplest option seems to be Southwestern Iranian (Q390424). This term also covers Persian varieties used in parts of Afghanistan and Pakistan (for which duplicating these lexemes again does not seem like a great idea). There might be a better option I have not considered however. عُثمان (talk) 21:45, 26 April 2023 (UTC)

Thanks for bringing this topic up. I need more time to consulate with experts, and thinking about this merging. Darafsh Kaviyani (Talk) 12:27, 27 April 2023 (UTC)
@Darafsh Thank you for considering. Let me know if there is anything I can do to help; I am thinking of creating some example combined lexemes if that would help. عُثمان (talk) 16:40, 27 April 2023 (UTC)
@عُثمانعُثمان I think that would be useful if we have some examples. Darafsh Kaviyani (Talk) 17:06, 27 April 2023 (UTC)
@Darafsh Here are a couple of examples I have tried:
  • الکی/аллакай (L1097039) This is a word which is apparently common in Tajik Persian but not Iranian Persian. I referred to this source for the Arabic script spelling: https://archive.org/details/2-1-2008/ I added statements to the form and sense indicating that it is primarily used in Tajik.
  • تبیت/тибит (L1097040) This is a noun which is pronounced differently in Tajik and Iranian Persian. I have shown this pronunciation difference in the statements on form 1. As I understand it, the Arabic element of both varieties is largely the same, but for some words the final tah marbouta has been realized as "at" in Iranian Persian and "a" in Tajik Persian. This difference is one I would consider insignificant enough that only one form is needed still to represent both pronunciations.
I think the verb forms can actually remain exactly the same. I was wrong about the difference in the present tense - the difference is in how compound tenses are used in the context of a sentence; the basic forms and verb endings are the same between the two varieties.
For nouns, Tajik treats را ро as a case suffix where as in modern Iranian Persian it is typically written separately. I have represented that on the second example here. Differences like this are very common between Hindi and Urdu and they are already represented in this way; Hindi होकर would be written as ہو کر in Urdu. (See होकर/ہو کر (L579662-F7)) عُثمان (talk) 19:43, 27 April 2023 (UTC)
I am thinking New Persian (Q56356571) might be more appropriate actually. عُثمان (talk) 16:39, 27 April 2023 (UTC)
Return to the project page "Lexicographical data/Archive/2023/04".