Wikidata:Property proposal/Lexemes
Property proposal: | Generic | Authority control | Person | Organization |
Creative work | Place | Sports | Sister projects | |
Transportation | Natural science | Computing | Lexeme |
See also edit
- Wikidata:Property proposal/Pending – properties which have been approved but which are on hold waiting for the appropriate datatype to be made available
- Wikidata:Properties for deletion – proposals for the deletion of properties
- Wikidata:External identifiers – statements to add when creating properties for external IDs
- Wikidata:Lexicographical data – information and discussion about lexicographic data on Wikidata
This page is for the proposal of new properties.
Before proposing a property
- Search if the property already exists.
- Search if the property has already been proposed.
- Check if you can give a similar label and definition as an existing Wikipedia infobox parameter, or if it can be matched to an infobox, to or from which data can be transferred automatically.
- Select the right datatype for the property.
- Read Wikidata:Creating a property proposal for guidelines you should follow when proposing new property.
- Start writing the documentation based on the preload form below by editing the two templates at the top of the page to add proposal details.
Creating the property
- Once consensus is reached, change status=ready on the template, to attract the attention of a property creator.
- Creation can be done 1 week after the creation of the proposal, by a property creator or an administrator.
- See property creation policy.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2024/03. |
Wikibase lexeme edit
character in this lexeme edit
Description | character(s) this lexeme consists of |
---|---|
Represents | cuneiform sign (Q23017336) |
Data type | Item |
Domain | lexeme, form |
Example 1 | ga/𒂷 (L726974) → 𒂷 (Q87555355) |
Example 2 | dumu/𒌉 (L643788) → 𒌉 (Q87556519) |
Example 3 | dingir/𒀭 (L724542) → 𒀭 (Q87555087) |
Planned use | Linking lexemes to character representations |
See also | Han character in this lexeme (P5425) which links Han Chinese characters in Japanese and Chinese lexemes to Unicode, https://www.wikidata.org/wiki/Wikidata:Property_proposal/Cuneiform_character_in_this_lexeme for the previous discussion on the property "cuneiform character in this lexeme" |
Motivation edit
Currently, we lack a property in Wikidata to link lexeme representations to QIDs of characters of a given script. The examples above show how to link cuneiform lexemes to their character QIDs which represent their Unicode code points, but the property can be used to link any lexeme to relevant parts of the script it uses. Already, the property Han character in this lexeme (P5425) allows to link Han Chinese characters in Chinese, Japanese, and Vietnamese to their respective representations in Wikidata. This property proposal wants to generalize this property Han character in this lexeme (P5425) as "character in this lexeme" or let this anticipated property become a super property of Han character in this lexeme (P5425).
See also the discussion of https://www.wikidata.org/wiki/Wikidata:Property_proposal/Cuneiform_character_in_this_lexeme which led to the creation of this property proposal instead.
- Support seems fine to me, either as a superproperty or replacement. ArthurPSmith (talk) 20:06, 4 May 2023 (UTC)
- Support this is an excellent idea, since the more general use of 'character' will allow for a number of languages with the same issue to progress in morpho-graphemic annotation, including: Sumerian, Akkadian, Hittite, Hurrian, Ugaritic, Elamite, Old Persian, just to name a few. Admndrsn (talk) 9:11, 8 May 2023 (EST)
- Comment Would this also mean that it could be used to say rød (L2310) character in this lexeme Ø (Q28827) and D (Q9884) — Finn Årup Nielsen (fnielsen) (talk) 18:45, 8 May 2023 (UTC)
- Yes, you could also use it for the Latin alphabet and in the example you proposed, even though I see people use this more in langauges like Chinese or Cuneiform where the individual characters often express its own meaning.
- But why not? You could query all lexemes with Ø (Q28827) if that is interesting to do. Situxx (talk) 13:35, 9 May 2023 (UTC)
- You already can do that: See this query for example. - Nikki (talk) 20:03, 24 June 2023 (UTC)
- @Fnielsen:, would you like to give your opinion? Regards, ZI Jony (Talk) 06:14, 24 January 2024 (UTC)
- You already can do that: See this query for example. - Nikki (talk) 20:03, 24 June 2023 (UTC)
- Support will help and advance digital cuneiform studies Enki75 (talk) 11 May 2023
- Strong oppose Having a generic property like this is a really bad idea. Linking characters in lexemes to the corresponding items can easily be done automatically, so there should be a really good reason to add links manually instead. For Han character in this lexeme (P5425), that is because items for Han characters have useful lexicographical data on them which would otherwise end up duplicated as lexemes. My opposition to the previous proposal was because items for Cuneiform characters do not have useful lexicographical data on them, and that is still the case, looking at the items in the examples.
- If this is added, people will surely start mass-adding it for every lexeme eventually. We are already having problems with the query service because of the amount of data, and adding millions more statements linking every character in a lexeme would only cause more problems for us. - Nikki (talk) 20:03, 24 June 2023 (UTC)
- Here's a simple script I just made to list the characters in a lexeme automatically: User:Nikki/LexemeLinkCharacters.js. - Nikki (talk) 21:27, 24 June 2023 (UTC)
- Hi!
- Thanks for your comment.
- In the previous proposal, you wrote about the following properties as examples which would constitute the lexicographical data you are missing in the cuneiform examples:
- I quote: "(e.g. stroke count (P5205), grade of kanji (P5277), radical (P5280), ideographic description sequences (P5753))"
- All of this information can be added, we are just lacking properties for that as well, hence you only see information which can be added right now with the properties we have, which are the following:
- - stroke count: Is currently proposed here: https://www.wikidata.org/wiki/Wikidata:Property_proposal/Gottstein_code
- - radicals are currently represented using "has parts" relations (at least for Unicode signs which allow so) (see here for an example https://www.wikidata.org/wiki/Q87555001)
- - depicts relations describe what the character depicts (which is often different from the sense of the Lexemes using the character)
- - dictionary references to signlists
- You can find an example in this web application which would also illustrate the main use case I have in mind: https://situx.github.io/paleordia/c/?q=Q87554995&qLabel=%F0%92%80%80
- The script you posted is certainly useful to link from a Lexeme to its characters, but my usecase is actually the opposite.
- I would for example like to know which Lexemes contain a cuneiform sign. Unless I have missed a better solution, the SPARQL query to achieve this would need a set of languages written in the cuneiform script and check the lemmas (maybe also forms) of all of these languages with regex matching.
- It is also on the homepage and runs only over Sumerian, but is already quite slow (https://situx.github.io/paleordia/c/?q=Q87554995&qLabel=%F0%92%80%80).
- If we had a property like the one proposed here I think it would be easier to query for the lexemes which fit a cuneiform sign or whatever other sign in other languages for that matter.
- Finally, there is the issue of paleographic sign variants:
- There might be certain Lexemes which are only written with sign variants of a specific shape.
- You can look at the sign AN here https://situx.github.io/paleordia/c/?q=Q87555087&qLabel=%F0%92%80%AD
- which looks different depending on the time period.
- We currently cannot express that as well, but we have the information and will gradually add them to Wikidata as a prototype of a digital paleography. Situxx (talk) 11:38, 26 June 2023 (UTC)
- @Nikki:, any changes in your opinion based on the response. Regards, ZI Jony (Talk) 06:14, 24 January 2024 (UTC)
- Here's a simple script I just made to list the characters in a lexeme automatically: User:Nikki/LexemeLinkCharacters.js. - Nikki (talk) 21:27, 24 June 2023 (UTC)
- Oppose the current (far too general) property name, but would be Neutral, now that phonetic value (P12436) is being used, if it were restricted to cuneiform symbols. Mahir256 (talk) 14:15, 6 March 2024 (UTC)
Aragonario ID (6th version) edit
Description | identifier for an Aragonese or Spanish lexeme in the Aragonese-Spanish online dictionary (version since January 2023) |
---|---|
Data type | External identifier |
Domain | Aragonese and Spanish lexemes |
Allowed values | [1-9][0-9]{6} |
Example 1 | augua (L8226) → 1074499 |
Example 2 | sangonera (L307650) → 1108110 |
Example 3 | abanderato (L647971) → 1070015 |
Source | https://aragonario.aragon.es/ |
Planned use | add to existing Aragonese and Spanish lexemes |
Number of IDs in source | between 75,137 (45,112 + 30,025) and 82,145 (1114581 - 1032437) |
Expected completeness | eventually complete (Q21873974) |
Formatter URL | https://aragonario.aragon.es/words/$1/ |
See also | Aragonario ID (5th version) (P11071) |
Motivation edit
It appears that this month, a new version of the Aragonario was launched with several thousand more entries compared to the original version, leading to the invalidation of all IDs from the previous version. This proposal covers IDs from the new version, in line with there being separate properties for new and former schemes.
(Those former IDs are not all lost, however, as the proposal for the property covering the previous version has a link to a spreadsheet with a complete list of all of those IDs I compiled a few months ago--they should continue to be added for posterity. I have now begun compiling a list of the newer IDs, and it is hoped that reconciling information between the two versions—which I intend to do myself—will be made easier as a result.) Mahir256 (talk) 17:57, 20 January 2023 (UTC)
Discussion edit
- @Aradgl, Uesca: and @Nikki, عُثمان, Bovlb: from the previous proposal. Mahir256 (talk) 17:57, 20 January 2023 (UTC)
- I'm afraid, as expected, the aragonario has changed its routes and the Aragonario ID no longer works.
- It is not useful to use a web ID that can change. If someone wants to inquire about that lexeme, they can do so by searching for the lexeme itself in the Aragonario using the lexema or another source of information (paper dictionaries, for example).
- I do not have any kind of control over the Aragonario, nor can we demand anything of him.
- @Mahir256 Uesca (talk) 18:15, 20 January 2023 (UTC)
- @Uesca: There is still merit to retaining the old identifiers; many of them are still accessible through the Internet Archive, and its function of serving as an identifier has not really diminished (see, in addition to the 'former scheme' properties, ones like ISOCAT ID (P2263), Google+ ID (P2847), and other properties for discontinued websites). As for the issue of changes in IDs, these too can be reflected in the data; if their ability to change made them not useful, then properties for social media accounts--whose IDs can frequently be changed by their users--would also not be useful. Mahir256 (talk) 18:23, 20 January 2023 (UTC)
- Do they have any policy about identifiers? Do we have any contacts on their team that can advise us? I'd love to be able to map this stuff, but it's not very satisfactory to support a property for an identifier that can be invalidated on a whim. Bovlb (talk) 18:57, 20 January 2023 (UTC)
- @Bovlb: I sent an email to the address posted on the 'Contacto' page of that site asking about the stability of their identifiers. Mahir256 (talk) 20:13, 20 January 2023 (UTC)
- @Uesca, Bovlb: After resending the message once, I eventually got a reply. Mahir256 (talk) 17:33, 3 February 2023 (UTC)
- Hmm. Thanks for following up.
- When they say "right now we are creating permanent and stable links", does that mean that the links they currently create are permanent and stable, or does it mean that they're currently designing yet another version of identifiers, this time to be permanent and stable? Bovlb (talk) 18:38, 3 February 2023 (UTC)
- @Bovlb: This is what they had to say about that. Mahir256 (talk) 14:10, 8 February 2023 (UTC)
- @Mahir256 Hmm. From that response, it doesn't sound like we should proceed with this property at this time. Bovlb (talk) 15:47, 8 February 2023 (UTC)
- @Bovlb: I would agree, but it appears @Uesca: has begun adding these new IDs (accidentally?) using the existing property intended for the old IDs (e.g. fuyita (L1016834) has an ID which would not have worked prior to January 2023); if they are to be shifted, it would need to be to this proposed new property (lest they be removed completely or shifted to described at URL (P973)). Mahir256 (talk) 18:41, 12 February 2023 (UTC)
- @Uesca, Bovlb, Mahir256: what's your current opinion for the proposed property? Regards, ZI Jony (Talk) 13:18, 24 January 2024 (UTC)
- If they've now decided to have stable identifiers, then I have no objection. Bovlb (talk) 16:09, 24 January 2024 (UTC)
- @Uesca, Bovlb, Mahir256: what's your current opinion for the proposed property? Regards, ZI Jony (Talk) 13:18, 24 January 2024 (UTC)
- @Bovlb: I would agree, but it appears @Uesca: has begun adding these new IDs (accidentally?) using the existing property intended for the old IDs (e.g. fuyita (L1016834) has an ID which would not have worked prior to January 2023); if they are to be shifted, it would need to be to this proposed new property (lest they be removed completely or shifted to described at URL (P973)). Mahir256 (talk) 18:41, 12 February 2023 (UTC)
- @Mahir256 Hmm. From that response, it doesn't sound like we should proceed with this property at this time. Bovlb (talk) 15:47, 8 February 2023 (UTC)
- @Bovlb: This is what they had to say about that. Mahir256 (talk) 14:10, 8 February 2023 (UTC)
- @Uesca, Bovlb: After resending the message once, I eventually got a reply. Mahir256 (talk) 17:33, 3 February 2023 (UTC)
- @Bovlb: I sent an email to the address posted on the 'Contacto' page of that site asking about the stability of their identifiers. Mahir256 (talk) 20:13, 20 January 2023 (UTC)
- Do they have any policy about identifiers? Do we have any contacts on their team that can advise us? I'd love to be able to map this stuff, but it's not very satisfactory to support a property for an identifier that can be invalidated on a whim. Bovlb (talk) 18:57, 20 January 2023 (UTC)
- @Uesca: There is still merit to retaining the old identifiers; many of them are still accessible through the Internet Archive, and its function of serving as an identifier has not really diminished (see, in addition to the 'former scheme' properties, ones like ISOCAT ID (P2263), Google+ ID (P2847), and other properties for discontinued websites). As for the issue of changes in IDs, these too can be reflected in the data; if their ability to change made them not useful, then properties for social media accounts--whose IDs can frequently be changed by their users--would also not be useful. Mahir256 (talk) 18:23, 20 January 2023 (UTC)
- Support Per above, at the very least these can be archived. --عُثمان (talk) 20:21, 20 January 2023 (UTC)
Kamus Besar Bahasa Indonesia Daring entry edit
Description | identifier for an entry in the online version of Kamus Besar Bahasa Indonesia |
---|---|
Represents | Great Dictionary of the Indonesian Language (Q4200623) |
Data type | External identifier |
Domain | lexeme |
Allowed values | [a-z0-9\.,'\-_\(\); ]+ |
Example 1 | cagar budaya (L739124) → cagar budaya and cagar_budaya |
Example 2 | Yth. (L1119265) → Yth. |
Example 3 | Al-Qur'an (L1119263) → Al-Qur'an |
Example 4 | S-1 (L1119266) → S-1 |
Example 5 | umbi-umbian (L700147) → umbi-umbian |
Example 6 | patah tongkat berjeremang (L1119283) → patah tongkat berjeremang (patah sayap bertongkat paruh; patah tongkat bertelekan) |
Example 7 | pucuk dicinta, ulam tiba (L1119282) → pucuk dicinta, ulam tiba (hendak ulam pucuk menjulai) |
External links | Use in sister projects: [ar] • [de] • [en] • [es] • [fr] • [he] • [it] • [ja] • [ko] • [nl] • [pl] • [pt] • [ru] • [sv] • [vi] • [zh] • [commons] • [species] • [wd] • [en.wikt] • [fr.wikt]. |
Number of IDs in source | 119,345 |
Expected completeness | always incomplete (Q21873886) |
Formatter URL | https://kbbi.kemdikbud.go.id/entri/$1 |
Motivation edit
KBBI is the most widely used dictionary by Indonesian. This will help folks at WikiProject Indonesia describing their source and providing useful link to authoritative information about Indonesian lexemes to others when they are creating or editing Indonesian lexemes. This property proposal would also enrich the existing properties such as Oxford English Dictionary entry ID (pre-July 2023) (P5275), Collins Online English Dictionary entry (P11230), Lëtzebuerger Online Dictionnaire ID (P9397), and Cambridge Dictionary entry (British English) (P11422). Labdajiwa (talk) 06:00, 24 May 2023 (UTC)
Discussion edit
- Support Seems ok to me. ArthurPSmith (talk) 20:49, 1 June 2023 (UTC)
- Comment: Given that this would potentially be as useful for the lexemes currently modeled as Malay (Q9237), I would like to see it clarified how either a) these languages should be merged or b) these languages will be maintained separately, that is, should identifiers from this dictionary be placed on equivalent/identical Malay and Indonesian lexemes, and is there a plan to ensure that information sourced from this dictionary is used to update lexemes in both varieties where applicable? -عُثمان (talk) 17:15, 6 June 2023 (UTC)
- Well, there was a discussion about this years ago and folks over there doesn't seem agree to merge Indonesian with Malay. I'd say this identifier will be used mainly for Indonesian. If this dictionary can be used in Malay lexemes, well there are plenty of entries of this dictionary marked as Malay (i.e. sama ada). Perhaps this dictionary can also be used in the lexemes of regional languages in Indonesia where many of them doesn't have its reliable online dictionary. And FYI, as of June 2023, Indonesian has 19,864 lexemes, while Malay has 2,729 lexemes. Labdajiwa (talk) 14:31, 7 June 2023 (UTC)
- @Labdajiwa: Are you sure the 'Mal' annotation doesn't simply mean 'used primarily in Malaysia' (i.e. Q15065–to be distinguished from the more general Q9237 to which at least a plurality of the entries in this dictionary would apply), in much the same way that there are Malay words used, say, mainly in Brunei or mainly in Singapore? You should clarify your 'perhaps' statement: can this or can this not be used for such regional languages? And lexeme count is not relevant when lexemes lack meanings; all Malay lexemes have at least one, while this cannot be said for the Indonesian ones. Mahir256 (talk) 04:39, 31 July 2023 (UTC)
- This Malay-Indonesian debate seems to be because of confusion. AFAIK, linguistically, in English, the official and standardized form of language in each country is called "Malaysian Malay" and "Indonesian". Both language are descended from (or "grouped in" probably could be another correct term) "Malay". Malay, translated to Indonesian, is Melayu. Melayu is commonly used in Indonesia to refer "a language that is used in Malaysia". Malaysian Malay could be translated to Indonesian as Melayu Malaysia, but nobody used that word in Indonesia. Merging proposal sounded like to merge Malaysian Malay and Indonesian, which I think is not possible because "A language is a dialect with an army and navy". Hddty (talk) 02:22, 12 August 2023 (UTC)
- @عُثمان, Mahir256, Hddty:, would you like to give your opinion? @Labdajiwa:, could you please response comments above. Regards, ZI Jony (Talk) 06:18, 24 January 2024 (UTC)
- @ZI Jony I have nothing to add here that has not already been said. If we want identical lexemes representing words in the same dialect of the same language under a different name, we can already find this kind of non-information by having a browse through the Bahasa Indonesia category on Malay Wiktionary or vice versa. There we can find entries for words either with identical definitions or with "translated" definitions restating the headword. It just seems like missed opportunity to actually utilize linked data for what it is suited for. Treating pluricentric languages such as Hindi/Urdu and Tajik/Persian has already been demonstrated as possible, and plenty of linguistic literature does not make any effort to treat Malay/Indonesian as separate languages. عُثمان (talk) 07:38, 24 January 2024 (UTC)
- @عُثمان, Mahir256, Hddty:, would you like to give your opinion? @Labdajiwa:, could you please response comments above. Regards, ZI Jony (Talk) 06:18, 24 January 2024 (UTC)
- This Malay-Indonesian debate seems to be because of confusion. AFAIK, linguistically, in English, the official and standardized form of language in each country is called "Malaysian Malay" and "Indonesian". Both language are descended from (or "grouped in" probably could be another correct term) "Malay". Malay, translated to Indonesian, is Melayu. Melayu is commonly used in Indonesia to refer "a language that is used in Malaysia". Malaysian Malay could be translated to Indonesian as Melayu Malaysia, but nobody used that word in Indonesia. Merging proposal sounded like to merge Malaysian Malay and Indonesian, which I think is not possible because "A language is a dialect with an army and navy". Hddty (talk) 02:22, 12 August 2023 (UTC)
- @Labdajiwa: Are you sure the 'Mal' annotation doesn't simply mean 'used primarily in Malaysia' (i.e. Q15065–to be distinguished from the more general Q9237 to which at least a plurality of the entries in this dictionary would apply), in much the same way that there are Malay words used, say, mainly in Brunei or mainly in Singapore? You should clarify your 'perhaps' statement: can this or can this not be used for such regional languages? And lexeme count is not relevant when lexemes lack meanings; all Malay lexemes have at least one, while this cannot be said for the Indonesian ones. Mahir256 (talk) 04:39, 31 July 2023 (UTC)
- Well, there was a discussion about this years ago and folks over there doesn't seem agree to merge Indonesian with Malay. I'd say this identifier will be used mainly for Indonesian. If this dictionary can be used in Malay lexemes, well there are plenty of entries of this dictionary marked as Malay (i.e. sama ada). Perhaps this dictionary can also be used in the lexemes of regional languages in Indonesia where many of them doesn't have its reliable online dictionary. And FYI, as of June 2023, Indonesian has 19,864 lexemes, while Malay has 2,729 lexemes. Labdajiwa (talk) 14:31, 7 June 2023 (UTC)
MobiTUKI Swahili-English Dictionary entry edit
Description | entry for a Swahili word in the online edition of the TUKI Swahili-English dictionary |
---|---|
Data type | External identifier |
Example 1 | godoro/ڠٗدٗورٗ (L1157217) — godoro_godoro |
Example 2 | ote/أٗوتٖ (L1230685) — -o-ote_ote |
Example 3 | habedari/هَبٖدَارِ (L1226698) — habedari!_habedari! |
Motivation edit
MobiTUKI Swahili-English Dictionary (Q122264347) is a comprehensive and useful source worth linking to lexemes in Swahili (Q7838). The URL format would be https://swahili-dictionary.com/swahili-english/$1
and a regular expression for validating the identifier value is [A-PR-WYZa-pr-wyz\-\!]+\_[A-PR-WYZa-pr-wyz\-\!]+
.
Discussion edit
Notified participants of WikiProject Africa Regards, ZI Jony (Talk) 13:44, 8 March 2024 (UTC)
Lisaan Masry Egyptian Arabic Dictionary ID edit
Description | entry for a lexeme in the online Lisaan Masry Egyptian Arabic Dictionary |
---|---|
Data type | External identifier |
Domain | lexemes in Egyptian Arabic (Q29919) (and apparently Modern Standard Arabic (Q56467) and English (Q1860) as well) |
Example 1 | سِتّ (L2546) 5592 |
Example 2 | رفيّع (L706535) 8934 |
Example 3 | نضف (L709452) 166 |
Example 4 | peanut butter (L1259881) 7022 |
Example 5 | fall in love (L34422) 18427 |
Example 6 | ichneumon (L322227) 12578 |
Number of IDs in source | 21,959 (at least) |
Formatter URL | https://www.lisaanmasry.org/online/word.php?ui=&id=$1 |
Motivation edit
Lisaan Masry Egyptian Arabic Dictionary (Q124630462) seems useful to link to Egyptian Arabic lexemes. It features definitions, grammatical details, a list of important word forms, usage examples, and pronunciation audios. There are also IDs for reverse entries which can be linked to English lexemes. The URL format is https://www.lisaanmasry.org/online/word.php?ui=&id=$1
----عُثمان (talk) 16:40, 21 February 2024 (UTC)
Discussion edit
- Support Mahir256 (talk) 20:42, 7 March 2024 (UTC)
Wikibase form edit
Wikibase sense edit
DWDS sense ID edit
Description | url slug of a sense in DWDS |
---|---|
Represents | DWDS-Wörterbuch (Q108696977) |
Data type | External identifier |
Domain | German lexeme sense |
Allowed values | [A-Za-z0-9ÄÖÜäöüß-]+#d-[\d-]+ |
Example 1 | Hand (L25807-S5)→Hand#d-1-3 |
Example 2 | Hand (L25807-S1)→Hand#d-1-6 |
Example 3 | Hand (L25807-S2)→Hand#d-1-1 |
Expected completeness | always incomplete (Q21873886) |
Formatter URL | https://www.dwds.de/wb/$1 |
See also | |
Single-value constraint | yes |
Distinct-values constraint | yes |
Motivation edit
Ids for senses are particulary useful for words with many senses to keep track of missing one's. I also plan to propose duden sense ids so we can match senses from different dictionaries. The id can be aquired by clicking links in the section Bedeutungsübersicht in each Lexeme –Shisma (talk) 09:08, 28 February 2024 (UTC)
Discussion edit
Notified participants of WikiProject Germany Regards, ZI Jony (Talk) 13:32, 8 March 2024 (UTC)
- Support Bigbossfarin (talk) 17:30, 8 March 2024 (UTC)