Wikidata:Property proposal/Lexemes

Property proposal: Generic Authority control Person Organization
Creative work Place Sports Sister projects
Transportation Natural science Computing Lexeme

See also edit

This page is for the proposal of new properties.

Before proposing a property

  1. Search if the property already exists.
  2. Search if the property has already been proposed.
  3. Check if you can give a similar label and definition as an existing Wikipedia infobox parameter, or if it can be matched to an infobox, to or from which data can be transferred automatically.
  4. Select the right datatype for the property.
  5. Read Wikidata:Creating a property proposal for guidelines you should follow when proposing new property.
  6. Start writing the documentation based on the preload form below by editing the two templates at the top of the page to add proposal details.

Creating the property

  1. Once consensus is reached, change status=ready on the template, to attract the attention of a property creator.
  2. Creation can be done 1 week after the creation of the proposal, by a property creator or an administrator.
  3. See property creation policy.

On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2024/03.

Wikibase lexeme edit

character in this lexeme edit

   Under discussion
Descriptioncharacter(s) this lexeme consists of
Representscuneiform sign (Q23017336)
Data typeItem
Domainlexeme, form
Example 1ga/𒂷 (L726974)𒂷 (Q87555355)
Example 2dumu/𒌉 (L643788)𒌉 (Q87556519)
Example 3dingir/𒀭 (L724542)𒀭 (Q87555087)
Planned useLinking lexemes to character representations
See alsoHan character in this lexeme (P5425) which links Han Chinese characters in Japanese and Chinese lexemes to Unicode, https://www.wikidata.org/wiki/Wikidata:Property_proposal/Cuneiform_character_in_this_lexeme for the previous discussion on the property "cuneiform character in this lexeme"

Motivation edit

Currently, we lack a property in Wikidata to link lexeme representations to QIDs of characters of a given script. The examples above show how to link cuneiform lexemes to their character QIDs which represent their Unicode code points, but the property can be used to link any lexeme to relevant parts of the script it uses. Already, the property Han character in this lexeme (P5425) allows to link Han Chinese characters in Chinese, Japanese, and Vietnamese to their respective representations in Wikidata. This property proposal wants to generalize this property Han character in this lexeme (P5425) as "character in this lexeme" or let this anticipated property become a super property of Han character in this lexeme (P5425).

See also the discussion of https://www.wikidata.org/wiki/Wikidata:Property_proposal/Cuneiform_character_in_this_lexeme which led to the creation of this property proposal instead.

  Support seems fine to me, either as a superproperty or replacement. ArthurPSmith (talk) 20:06, 4 May 2023 (UTC)[reply]
  Support this is an excellent idea, since the more general use of 'character' will allow for a number of languages with the same issue to progress in morpho-graphemic annotation, including: Sumerian, Akkadian, Hittite, Hurrian, Ugaritic, Elamite, Old Persian, just to name a few. Admndrsn (talk) 9:11, 8 May 2023 (EST)
  Comment Would this also mean that it could be used to say rød (L2310) character in this lexeme Ø (Q28827) and D (Q9884)Finn Årup Nielsen (fnielsen) (talk) 18:45, 8 May 2023 (UTC)[reply]
Yes, you could also use it for the Latin alphabet and in the example you proposed, even though I see people use this more in langauges like Chinese or Cuneiform where the individual characters often express its own meaning.
But why not? You could query all lexemes with Ø (Q28827) if that is interesting to do. Situxx (talk) 13:35, 9 May 2023 (UTC)[reply]
You already can do that: See this query for example. - Nikki (talk) 20:03, 24 June 2023 (UTC)[reply]
@Fnielsen:, would you like to give your opinion? Regards, ZI Jony (Talk) 06:14, 24 January 2024 (UTC)[reply]
  Support will help and advance digital cuneiform studies Enki75 (talk) 11 May 2023
  Strong oppose Having a generic property like this is a really bad idea. Linking characters in lexemes to the corresponding items can easily be done automatically, so there should be a really good reason to add links manually instead. For Han character in this lexeme (P5425), that is because items for Han characters have useful lexicographical data on them which would otherwise end up duplicated as lexemes. My opposition to the previous proposal was because items for Cuneiform characters do not have useful lexicographical data on them, and that is still the case, looking at the items in the examples.
If this is added, people will surely start mass-adding it for every lexeme eventually. We are already having problems with the query service because of the amount of data, and adding millions more statements linking every character in a lexeme would only cause more problems for us. - Nikki (talk) 20:03, 24 June 2023 (UTC)[reply]
Here's a simple script I just made to list the characters in a lexeme automatically: User:Nikki/LexemeLinkCharacters.js. - Nikki (talk) 21:27, 24 June 2023 (UTC)[reply]
Hi!
Thanks for your comment.
In the previous proposal, you wrote about the following properties as examples which would constitute the lexicographical data you are missing in the cuneiform examples:
I quote: "(e.g. stroke count (P5205), grade of kanji (P5277), radical (P5280), ideographic description sequences (P5753))"
All of this information can be added, we are just lacking properties for that as well, hence you only see information which can be added right now with the properties we have, which are the following:
- stroke count: Is currently proposed here: https://www.wikidata.org/wiki/Wikidata:Property_proposal/Gottstein_code
- radicals are currently represented using "has parts" relations (at least for Unicode signs which allow so) (see here for an example https://www.wikidata.org/wiki/Q87555001)
- depicts relations describe what the character depicts (which is often different from the sense of the Lexemes using the character)
- dictionary references to signlists
You can find an example in this web application which would also illustrate the main use case I have in mind: https://situx.github.io/paleordia/c/?q=Q87554995&qLabel=%F0%92%80%80
The script you posted is certainly useful to link from a Lexeme to its characters, but my usecase is actually the opposite.
I would for example like to know which Lexemes contain a cuneiform sign. Unless I have missed a better solution, the SPARQL query to achieve this would need a set of languages written in the cuneiform script and check the lemmas (maybe also forms) of all of these languages with regex matching.
It is also on the homepage and runs only over Sumerian, but is already quite slow (https://situx.github.io/paleordia/c/?q=Q87554995&qLabel=%F0%92%80%80).
If we had a property like the one proposed here I think it would be easier to query for the lexemes which fit a cuneiform sign or whatever other sign in other languages for that matter.
Finally, there is the issue of paleographic sign variants:
There might be certain Lexemes which are only written with sign variants of a specific shape.
You can look at the sign AN here https://situx.github.io/paleordia/c/?q=Q87555087&qLabel=%F0%92%80%AD
which looks different depending on the time period.
We currently cannot express that as well, but we have the information and will gradually add them to Wikidata as a prototype of a digital paleography. Situxx (talk) 11:38, 26 June 2023 (UTC)[reply]
@Nikki:, any changes in your opinion based on the response. Regards, ZI Jony (Talk) 06:14, 24 January 2024 (UTC)[reply]
  Oppose the current (far too general) property name, but would be   Neutral, now that phonetic value (P12436) is being used, if it were restricted to cuneiform symbols. Mahir256 (talk) 14:15, 6 March 2024 (UTC)[reply]

Aragonario ID (6th version) edit

   Under discussion
Descriptionidentifier for an Aragonese or Spanish lexeme in the Aragonese-Spanish online dictionary (version since January 2023)
Data typeExternal identifier
DomainAragonese and Spanish lexemes
Allowed values[1-9][0-9]{6}
Example 1augua (L8226)1074499
Example 2sangonera (L307650)1108110
Example 3abanderato (L647971)1070015
Sourcehttps://aragonario.aragon.es/
Planned useadd to existing Aragonese and Spanish lexemes
Number of IDs in sourcebetween 75,137 (45,112 + 30,025) and 82,145 (1114581 - 1032437)
Expected completenesseventually complete (Q21873974)
Formatter URLhttps://aragonario.aragon.es/words/$1/
See alsoAragonario ID (5th version) (P11071)

Motivation edit

It appears that this month, a new version of the Aragonario was launched with several thousand more entries compared to the original version, leading to the invalidation of all IDs from the previous version. This proposal covers IDs from the new version, in line with there being separate properties for new and former schemes.

(Those former IDs are not all lost, however, as the proposal for the property covering the previous version has a link to a spreadsheet with a complete list of all of those IDs I compiled a few months ago--they should continue to be added for posterity. I have now begun compiling a list of the newer IDs, and it is hoped that reconciling information between the two versions—which I intend to do myself—will be made easier as a result.) Mahir256 (talk) 17:57, 20 January 2023 (UTC)[reply]

Discussion edit

‎Kamus Besar Bahasa Indonesia Daring entry edit

   Under discussion
Descriptionidentifier for an entry in the online version of Kamus Besar Bahasa Indonesia
RepresentsGreat Dictionary of the Indonesian Language (Q4200623)
Data typeExternal identifier
Domainlexeme
Allowed values[a-z0-9\.,'\-_\(\); ]+
Example 1cagar budaya (L739124)cagar budaya and cagar_budaya
Example 2Yth. (L1119265)Yth.
Example 3Al-Qur'an (L1119263)Al-Qur'an
Example 4S-1 (L1119266)S-1
Example 5umbi-umbian (L700147)umbi-umbian
Example 6patah tongkat berjeremang (L1119283)patah tongkat berjeremang (patah sayap bertongkat paruh; patah tongkat bertelekan)
Example 7pucuk dicinta, ulam tiba (L1119282)pucuk dicinta, ulam tiba (hendak ulam pucuk menjulai)
External linksUse in sister projects: [ar][de][en][es][fr][he][it][ja][ko][nl][pl][pt][ru][sv][vi][zh][commons][species][wd][en.wikt][fr.wikt].
Number of IDs in source119,345
Expected completenessalways incomplete (Q21873886)
Formatter URLhttps://kbbi.kemdikbud.go.id/entri/$1

Motivation edit

KBBI is the most widely used dictionary by Indonesian. This will help folks at WikiProject Indonesia describing their source and providing useful link to authoritative information about Indonesian lexemes to others when they are creating or editing Indonesian lexemes. This property proposal would also enrich the existing properties such as Oxford English Dictionary entry ID (pre-July 2023) (P5275), Collins Online English Dictionary entry (P11230), Lëtzebuerger Online Dictionnaire ID (P9397), and Cambridge Dictionary entry (British English) (P11422). Labdajiwa (talk) 06:00, 24 May 2023 (UTC)[reply]

Discussion edit

  •   Support Seems ok to me. ArthurPSmith (talk) 20:49, 1 June 2023 (UTC)[reply]
  • Comment: Given that this would potentially be as useful for the lexemes currently modeled as Malay (Q9237), I would like to see it clarified how either a) these languages should be merged or b) these languages will be maintained separately, that is, should identifiers from this dictionary be placed on equivalent/identical Malay and Indonesian lexemes, and is there a plan to ensure that information sourced from this dictionary is used to update lexemes in both varieties where applicable? -عُثمان (talk) 17:15, 6 June 2023 (UTC)[reply]
    Well, there was a discussion about this years ago and folks over there doesn't seem agree to merge Indonesian with Malay. I'd say this identifier will be used mainly for Indonesian. If this dictionary can be used in Malay lexemes, well there are plenty of entries of this dictionary marked as Malay (i.e. sama ada). Perhaps this dictionary can also be used in the lexemes of regional languages in Indonesia where many of them doesn't have its reliable online dictionary. And FYI, as of June 2023, Indonesian has 19,864 lexemes, while Malay has 2,729 lexemes. Labdajiwa (talk) 14:31, 7 June 2023 (UTC)[reply]
    @Labdajiwa: Are you sure the 'Mal' annotation doesn't simply mean 'used primarily in Malaysia' (i.e. Q15065–to be distinguished from the more general Q9237 to which at least a plurality of the entries in this dictionary would apply), in much the same way that there are Malay words used, say, mainly in Brunei or mainly in Singapore? You should clarify your 'perhaps' statement: can this or can this not be used for such regional languages? And lexeme count is not relevant when lexemes lack meanings; all Malay lexemes have at least one, while this cannot be said for the Indonesian ones. Mahir256 (talk) 04:39, 31 July 2023 (UTC)[reply]
    This Malay-Indonesian debate seems to be because of confusion. AFAIK, linguistically, in English, the official and standardized form of language in each country is called "Malaysian Malay" and "Indonesian". Both language are descended from (or "grouped in" probably could be another correct term) "Malay". Malay, translated to Indonesian, is Melayu. Melayu is commonly used in Indonesia to refer "a language that is used in Malaysia". Malaysian Malay could be translated to Indonesian as Melayu Malaysia, but nobody used that word in Indonesia. Merging proposal sounded like to merge Malaysian Malay and Indonesian, which I think is not possible because "A language is a dialect with an army and navy". Hddty (talk) 02:22, 12 August 2023 (UTC)[reply]
    @عُثمان, Mahir256, Hddty:, would you like to give your opinion? @Labdajiwa:, could you please response comments above. Regards, ZI Jony (Talk) 06:18, 24 January 2024 (UTC)[reply]
    @ZI Jony I have nothing to add here that has not already been said. If we want identical lexemes representing words in the same dialect of the same language under a different name, we can already find this kind of non-information by having a browse through the Bahasa Indonesia category on Malay Wiktionary or vice versa. There we can find entries for words either with identical definitions or with "translated" definitions restating the headword. It just seems like missed opportunity to actually utilize linked data for what it is suited for. Treating pluricentric languages such as Hindi/Urdu and Tajik/Persian has already been demonstrated as possible, and plenty of linguistic literature does not make any effort to treat Malay/Indonesian as separate languages. عُثمان (talk) 07:38, 24 January 2024 (UTC)[reply]

‎MobiTUKI Swahili-English Dictionary entry edit

   Under discussion
Descriptionentry for a Swahili word in the online edition of the TUKI Swahili-English dictionary
Data typeExternal identifier
Example 1godoro/ڠٗدٗورٗ (L1157217)godoro_godoro
Example 2ote/أٗوتٖ (L1230685)-o-ote_ote
Example 3habedari/هَبٖدَارِ (L1226698)habedari!_habedari!

Motivation edit

MobiTUKI Swahili-English Dictionary (Q122264347) is a comprehensive and useful source worth linking to lexemes in Swahili (Q7838). The URL format would be https://swahili-dictionary.com/swahili-english/$1 and a regular expression for validating the identifier value is [A-PR-WYZa-pr-wyz\-\!]+\_[A-PR-WYZa-pr-wyz\-\!]+ .

Discussion edit

Walkuraxx (talk) 05:06, 27 October 2018 (UTC) Daniel Mietchen (talk) 22:44, 27 October 2018 (UTC) Marsupium (talk) Pipimurphy (talk) 13:51, 30 January 2019 (UTC) EditaLeiden (talk) 09:38, 16 May 2019 (UTC) Tris T7 (talk) 09:55, 19 August 2019 (UTC) Spinster 💬 MassiveEartha (talk) 05:50, 11 February 2020 (UTC)[reply]

  Notified participants of WikiProject Africa Regards, ZI Jony (Talk) 13:44, 8 March 2024 (UTC)[reply]

‎Lisaan Masry Egyptian Arabic Dictionary ID edit

   Ready Create
Descriptionentry for a lexeme in the online Lisaan Masry Egyptian Arabic Dictionary
Data typeExternal identifier
Domainlexemes in Egyptian Arabic (Q29919) (and apparently Modern Standard Arabic (Q56467) and English (Q1860) as well)
Example 1سِتّ (L2546) 5592
Example 2رفيّع (L706535) 8934
Example 3نضف (L709452) 166
Example 4peanut butter (L1259881) 7022
Example 5fall in love (L34422) 18427
Example 6ichneumon (L322227) 12578
Number of IDs in source21,959 (at least)
Formatter URLhttps://www.lisaanmasry.org/online/word.php?ui=&id=$1

Motivation edit

Lisaan Masry Egyptian Arabic Dictionary (Q124630462) seems useful to link to Egyptian Arabic lexemes. It features definitions, grammatical details, a list of important word forms, usage examples, and pronunciation audios. There are also IDs for reverse entries which can be linked to English lexemes. The URL format is https://www.lisaanmasry.org/online/word.php?ui=&id=$1 ----عُثمان (talk) 16:40, 21 February 2024 (UTC)[reply]

Discussion edit

Wikibase form edit

Wikibase sense edit

‎DWDS sense ID edit

   Ready Create
Descriptionurl slug of a sense in DWDS
RepresentsDWDS-Wörterbuch (Q108696977)
Data typeExternal identifier
DomainGerman lexeme sense
Allowed values[A-Za-z0-9ÄÖÜäöüß-]+#d-[\d-]+
Example 1Hand (L25807-S5)Hand#d-1-3
Example 2Hand (L25807-S1)Hand#d-1-6
Example 3Hand (L25807-S2)Hand#d-1-1
Expected completenessalways incomplete (Q21873886)
Formatter URLhttps://www.dwds.de/wb/$1
See also
Single-value constraintyes
Distinct-values constraintyes

Motivation edit

Ids for senses are particulary useful for words with many senses to keep track of missing one's. I also plan to propose duden sense ids so we can match senses from different dictionaries. The id can be aquired by clicking links in the section Bedeutungsübersicht in each LexemeShisma (talk) 09:08, 28 February 2024 (UTC)[reply]

Discussion edit

  Notified participants of WikiProject Germany Regards, ZI Jony (Talk) 13:32, 8 March 2024 (UTC)[reply]

  Support Bigbossfarin (talk) 17:30, 8 March 2024 (UTC)[reply]

Other edit