Wikidata talk:Lexicographical data

Add topic
Active discussions









Support for Wiktionary


How to help








Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2022/12.

How to model nouns that can be inflected by genderEdit


I'm talking about French (Q150) here, but I'm pretty sure it can be applied to (at least some) other languages.

We have nouns that can be inflected simply by gender, like occupations (computer scientist (Q82594)informaticien (L620286) / informaticienne (L620287)), animals (dog (Q144)chien (L241) / chienne (L29225)), etc.

At the moment, on Wikidata, the current model is applied by Metamorforme42 on several French lexemes (for instance informaticien (L620286) / informaticienne (L620287)). It is tedious (impossible?) to navigate from a gender to another (as hyperonym (P6593) can be used for disambiguation based on other criteria than gender). This model also leads to the creation of questionable items like computer scientist (Q113547263) and computer scientist (Q113547227), with many issues:

  • an obvious confusion between grammatical gender and other genders;
  • if we start creating items as a combination of their properties, it will quickly become unsustainable (soon, an item for Male / British / Computer scientist / born in the 20th century?).

I wonder if it would be better to merge lexemes like serveur (L17430) / serveuse (L673611). It's the choice made by several dictionaries (example for informaticien (L620286) / informaticienne (L620287) in Le Robert, TLFi). One advantage is that it is straightforward to get the feminine forms of a noun. It also avoids to duplicate similar glosses.

Note that:

  • in French, the masculine form is also the "general" (there is no neutral grammatical gender);
  • merging lexemes would not always be possible: for instance, for horse (Q726), cheval (L19113) / jument (L25951) should obviously stay separate.

What do you think?

Ping @Denny: as, if I remember correctly, he talked about his experience at Google about that on the Telegram channel a few months ago.

Cheers, — Envlh (talk) 21:25, 17 August 2022 (UTC)

I like the idea and would suggest to merge them, but @Nikki: had a number of good arguments against it. --Denny (talk) 21:28, 17 August 2022 (UTC)
Hi @Envlh,
this model is adapted from Wikidata_talk:Lexicographical_data/Archive/2021/12#male and female variants of lexemes (original usecase with Lexemes in German).
Also, I seriously doubt there would be a notable Lexeme for Male / British / Computer scientist / born in the 20th century, so there is a limit in the creation of this kind of items. These two specific items were created to replace inappropriate item for this sense, and because there are some scientific articles related to computer scientist (Q113547227) for example (structural need).
I agree about the navigation issue: this have been painfull to add specified by sense (P6719) qualifier on senses because of the separation of Lexemes.
About the confusion between grammatical gender and other genders, could you please elaborate as it is far from obvious to me?
Metamorforme42 (talk) 21:53, 17 August 2022 (UTC)
We have semantic gender (P10339) for specifying that a sense is specific to a particular gender, so there's no need to create items like computer scientist (Q113547263) and computer scientist (Q113547227) even with separate lexemes. (And on items, like for scientific articles, we'd normally use sex or gender (P21) as a qualifier, so I don't agree that there's a structural need for them). - Nikki (talk) 22:56, 17 August 2022 (UTC)
We had a lot of talks about this subject. See for instance Wikidata:Making_sense#Gendered_professions (by Denny) or this discussion I launched: Wikidata_talk:Lexicographical_data/Archive/2019/05#Lexemes_and_gender_of_noun (Lehrerin (L34168) was already an example on the beta test before the lexemes were created). Several people have a lot of agument against merging but megering also have it perks. My main question is: if we merge them, how would we indicate which forms refers to which set of senses? (and the other way round) Imagine merging meunière (L25640)/meunier (L306576) (where I put a lot of data specific to each lexeme).
I hear the navigation issue. Maybe we can revive this old proposal Wikidata:Property proposal/noun for other gender?
For the « One advantage is that it is straightforward to get the feminine forms of a noun. » will it be? many noun have several feminin forms : autrice/auteure, docteure/doctoresse, and so on (and I'm not even talking about about case like "une docteur" or variants like "doctoresse" or "meusnière")
Right now, lexemes are pretty empty but we need to think and have a general model when there will be all forms and senses. Maybe there is a model where we can have only one lexemes but I don't see it now and I'm leaning towards the separate lexemes.
PS: the TLFi has two URLs: and (with same content but still two URLs) and their form tools also separate from
PPS: Q113547263 and Q113547227 should most likely be deleted merged (and this has nothing to do with Lexemes, it breaks Wikidata general rules and customs).
Cheers, VIGNERON (talk) 08:48, 19 August 2022 (UTC)
I redirected Q113547263 and Q113547227 to Q82594. Feel free to delete them if required. The creation of these two items was retrospectively a bad idea; thanks for having pointed this out, I will pay more attention to notability before creating items from now. — Metamorforme42 (talk) 15:09, 19 August 2022 (UTC)
@metamorforme42: thanks, and sorry I meant merge not delete. All is good now. Cheers, VIGNERON (talk) 07:38, 20 August 2022 (UTC)

Thanks to Metamorforme42 for merging the items. For some explanations on genders, you can read Gender on Wikipedia or see the list of allowed values for sex or gender (P21) (to sum up, gender is not binary, and you can't just have male opposed to female).

For the auteure/autrice issue, I think this is a more general issue to mark forms that fit together within a lexeme, not limited to the discussion here. Sometimes, we can do it easily, for instance marking some forms with orthographic corrections of French in 1990 (Q486561) on chariot/charriot (L25948). But sometimes, it's not the case, like balayer (L689016) which can be written with a i or a y on several forms, and nothing to group similar forms together (and I don't think we should create a new lexeme to distinguish these forms).

About the URLs of TLFi, you realize that this is just a search engine? Its URLS doesn't contain IDs, but only what you input in the search engine. You have informaticien and informaticienne, but also informaticiens and informaticiennes (I don't think you want new lexemes for these plural forms?), and even garbage like informàtïcién, that all return, as you noted, the same unique entry.

It was repeated several times that there were lots of arguments in previous discussions against merging lexemes like informaticien (L620286) / informaticienne (L620287), but I was unable to find any. It is sometimes stated that the masculine and the feminine are two different concepts (we just proved the contrary with the merge) or that one is derived from the other. I disagree with these. A computer scientist is still a computer scientist, regardless of their gender: male, female, or anything else. In my opinion, gender is just a grammatical trait that can be used to inflect a noun (of course, not always, there is for instance no feminine for cheval (L19113)), like the number. And you don't create a new lexeme when a word has a plural form. As already stated, at least some sources (Le Robert, TLFi, and let's add another one: Larousse) seem to agree with that, as they have a single entry and a single definition fur such lexemes.

For informaticien, this would gives:

I don't see the point to have several senses for different genders. If we really want them, maybe we can create other senses, and use a combination of grammatical gender (P5185) and semantic gender (P10339) to specify that theyr are limited to some genders?

For meunier, this would gives:

The advantages of this model:

  • you can easily find feminine variants of a word;
  • you don't duplicate senses for each existing gender.

With the models currently used in informaticien (L620286) and meunier (L306576) (if you look closely, you can see that they are not exactly the same), you duplicate senses (and everything that comes with them like properties, examples, etc.), and I've no idea on how to find their feminine variants. Maybe the revival of Wikidata:Property proposal/noun for other gender proposed by VIGNERON is a good idea.

Cheers, — Envlh (talk) 10:25, 20 August 2022 (UTC)

  1. With this model, how are we supposed to express that F3 and F4 cannot be used with meunier@fr-S1 to refer to male miller (Q694116)? We never say « une meunière » to describe a male miller (Q694116) (« Johann Georg Hiedler (Q385804) est une meunière. » is wrong) or « des meunières » to describe a group containing male miller (Q694116).
  2. Even if I agree that « informaticien » and « informaticienne » are primary referring to the occupation, I think there is a weakly defined concept combining this specific occupation and a certain kind of gender (but in a binary way, and not well defined) and we can find expression of this concept with a single word in some languages like French, German… In my opinion, this word is a sense and not only a form because it correspond to a different concept; for example:
  3. Also, maybe it would probably be relevant to ask some experienced contributors from frwikt to explain us why they have separate entries for informaticienne and informaticien.
Metamorforme42 (talk) 14:20, 20 August 2022 (UTC)
Short answer for 1: it is implicit, like it is implicit that F2 is not valid for your example (Johann Georg Hiedler (Q385804) est des meuniers is wrong). Cheers, — Envlh (talk) 16:17, 20 August 2022 (UTC)
@Envlh: that could maybe work, the implicit inference could be miss by some re-user but so is many subtleties... (and we could still document it to ease the re-use).
What about the others statements refering only to a specific set of sense of forms? For instance described by source (P1343) (or any identifiers) on meunière (L25640) where I specificied the senses (to see when and how the word evolved, to see that "miller's wife" was originally the only sense and that at some point it disappeared and came back again... it allows to do a query like this: - I only add the basic data as a test but I would love to add dozen more sources to have better results). Could we use some qualifier? (semantic gender (P10339) and grammatical gender (P5185) again?) Even merged, I feel like keeping gendered senses would be simplier, no?
Cheers, VIGNERON (talk) 18:40, 20 August 2022 (UTC)

Hello. What you indicate for French, happens exactly the same in Spanish. On the one hand, it seems logical that the different gender inflections are in a single lexeme and are introduced as forms. On the other hand, if it were done that way, there would be a lot of information that I don't know how it could be added to Wikidata. I show some examples.

Lexeme gato (L34279) has two genders (masculine: gato; feminine: gata). However, not all senses have two genders, some are only masculine or only feminine.

In that specific lexeme as it is created right now, both genders are in a single lexeme, and the gender is indicated at the sense level, but I am not sure if this way is right.

In general, at this moment most noun lexemes in Spanish are created with a lexeme for each gender. For example, lexemes potro (L620468) and potra (L620476).

Advantages of this second way:


  • duplicate senses

--Hameryko (talk) 19:55, 22 August 2022 (UTC)

  Comment @Hameryko: for me it's obvious that 3 "gato" should be in 3 different Lexemes - they are homonyms (at least different etymology)...  – The preceding unsigned comment was added by Infovarius (talk • contribs).
@Hameryko: I agree with Infovarius, this is obviously a different case and different lexemes; gato (L34279) should be split.
@Envlh: could you try to simulate a merge of meunière (L25640) and meunier (L306576) in sandbox 2 (L1234). I'd like to see how it would work exactly in such complex - but not unusual - cases (with a lot of qualifiers I guess).
Cheers, VIGNERON (talk) 11:38, 2 September 2022 (UTC)
@VIGNERON: done, with last edit being Special:Diff/1720989563. Please tell me if something is missing. Cheers, — Envlh (talk) 19:09, 4 September 2022 (UTC)
Hello @VIGNERON: did you have the time to review this model? Cheers, — Envlh (talk) 15:34, 9 October 2022 (UTC)
To your point about translations, translations are linked from sense to sense so you can still do this with shared-gender lexemes. If a lexeme has a completely different etymology I model it separately, or if it is a completely different part of speech. I have been adding lexemes in Punjabi with some of them having both grammatical genders and I can show some examples if it would be helpful. Modeling grammatical gender on separate lexemes is less viable for this language, because about a third of nouns have both masculine and feminine inflections, and the reason for this is different for each often depends on the underlying meaning of the noun.
I will note that I do not use the grammatical gender property on senses as this is not really a semantic feature. Instead, I use multiple statements on the lexeme itself, and subject sense qualifiers. This way, it is easy to see that some senses work with both grammatical genders, while others only work with one. Semantic gender I put on senses if necessary on animate nouns.
  • Inanimate noun ਲਿੰਗ/لِنگ (L684192) has 3 senses meaning sexuality/gender (abstract concept), penis, or arm. If you are using it to mean penis or arm it has to be masculine, but you can use it as masculine or feminine for sexuality/gender. The masculine and feminine forms are identical in singular direct case (ਲਿੰਗ), but the surrounding sentence would have to change inflections for the gender. Both the Punjabi labels for grammatically feminine (ਇਸਤਰੀ ਲਿੰਗ) and masculine (ਪੁਲਿੰਗ) are derived from the masculine form of this word. Since the word for "feminine" here is derived from the masculine form of a word that allows a feminine form, we cannot assume anything about the relationship between semantics and grammar.
  • The first part of the word for grammatically feminine, ਇਸਤਰੀ, is a homograph that is a good example of an identical lexeme that should be separate. ਇਸਤਰੀ/اِستری (L700203) means clothes iron and is derived from Portuguese, but ਇਸਤਰੀ/اِستری (L700209) means woman or wife and is derived from Sanskrit.
  • Animate noun ਡੱਡੂ/ڈڈّو/ڈڈو (L678986) has an unspecified gender sense for a frog that must be grammatically masculine. There are separate senses for male and female frog which are connected to masculine and feminine respectively, but the more common sense tied to the feminine forms is for tadpoles. (In common speech, I do not think people are thinking about the semantic gender of the tadpoles.) Then there is a common sense for a term of affection for small children, since people think of frogs as being cute. There is no semantic gender tied to this sense because even though this is usually used for boys you can use it for either gender. Since grammatical gender is already being used to indicate size and age in other senses, maybe somebody is calling their really big daughter the masculine form (in Punjabi culture, people love fat babies).
  • Proper noun ਰਫ਼ੀ/ਰਫੀ/رفیع (L691272) is an unambiguously male first name, or typically a surname inherited from someone's father as far as semantics are concerned. It has a feminine inflection though, as in Punjabi there are senses for rudely talking about female in-law relatives by inflecting the names of their husbands. This sense has been included on the lexeme for the male name because it does not mean anything without that context. ਰਫ਼ੀਆ/ਰਫੀਆ/رافعہ (L691273) is the actual female form of this name, which has a separate lexeme because these names where each derived individually from Arabic as male and female names. The actual female name is not part of the same lexical unit as the female inflection of the male name.
  • Inanimate noun ਅੰਬ/انب (L677644) means mango, and has both feminine and masculine forms. The feminine forms are used for particularly small mangoes, or particularly small mango plants, or particularly young mango plants. Anything food or agriculture related tends to be much more semantically rich in Punjabi, which means either complex gender/sense relationships like this, or simpler ones where all inflections for gender and number are eliminated as with ਸਾਗ/ساگ (L697738).
Middle river exports (talk) 01:04, 5 September 2022 (UTC)
@Infovarius:, @VIGNERON: The matter is, in the case of the lexeme gato (L34279), all the senses that I mention have the same etymology (although depending on the sense, the forms of one gender or another can be used) and in that case, I think that it would be more appropriate for it to be united in the same lexeme.
@Middle river exports: I like how the issue of genders are resolved in Punjabi. I hadn't thought of that option of using the property subject sense (P6072) to indicate which sense corresponds to each gender, but it seems quite useful. I think that this solution can be applied in the same way to Spanish for those lexemes that are of the same type and have the same etymology, but that change gender depending on the sense. – The preceding unsigned comment was added by Hameryko (talk • contribs).

Coming back to the original question. I'm wondering if the two model are really equivalent. For instance, since grammar has change a lot other the years, I'm interrested to know if we could use detailed recording (like I started on meunière (L25640) - only a beggining obviously) to mesure how "sexist" words are. For instane, when does a word shifted from one meaning (like "miller wife") to an other ("female miller") and when did the masculine shift from "generic masculine" (miller regardless of the gender) to "true masculine" (male miller). This is just an example, I'm not sure if either of the proposed model could answer this question and there is thousands more question. I feel like I don't have enough data to make an enlighten and meaningful review right now (which is a choice for the status quo, it's suboptimal I know but still better then making a wrong choice). We really need more people to pitch in. Cdlt, VIGNERON (talk) 16:51, 9 October 2022 (UTC)

Hello everyone, even though being late to the discussion and without having thought through all of it, some questions touched reminded me of problems discussed in The World Atlas of Language Structures Online. So here are the links to two of its chapters each with many further references, in case they might be useful in the discussion:

Best, --Marsupium (talk) 09:29, 20 October 2022 (UTC)

Getting audio files from a lexemeEdit


We are working an a project that would allow getting pronunciation audio files for a word in a given language from a lexeme. The word, language, and lexeme values would be provided by the user. For instance, the user wants to be able to grab the audio file for "tomato" in language "en-au" from Lexeme:L7993. By looking at this json response, we are assuming that we need to loop through the forms, check if the form has an exact match for the word "tomato" in the list of representations, and if so, we would check if the form has an audio file that matches the language qualifier "en-au". Would this logic make sense? HMonroy (WMF) (talk) 19:09, 13 September 2022 (UTC)

@HMonroy (WMF): yes absolutely, audio are store on forms so this is the right and only way to do it.
By the way, could you tell us a bit more about this project? It sounds very interesting and it could help make adding the pronunciation more attractive.
Cheers, VIGNERON (talk) 10:20, 9 October 2022 (UTC)

Obsolete spellingEdit

How to say that свекла (L160967) is an old (and now incorrect) spelling/pronunciation of свёкла (L161458)? Some property at Lexeme level? Or at each Sense? Or try to merge them and create some variant of lemmas at each form? Infovarius (talk) 20:14, 28 September 2022 (UTC)

How about adding an end time (P582) statement to the lexeme for the old spelling, and start time (P580) to the one with the new spelling? I don't see a more speicific property suitable for lexemes only. However, the constraints for end time (P582) should then be changed to allow lexeme type entities (it looks like an unintended omission, since start time (P580) has already been extended in that respect). Both properties can be used either as main values or as qualifiers (the latter could be used with IPA or audio file values on lexeme forms in case just one or a few of the forms have changed pronunciation, or on the item for this sense (P5137) statement if the meaning has changed over time).
Besides the changed properties, the two (or more) lexemes should have identical property/value sets. Add mutual synonym (P5973) links between the variant lexemes to find all spelling variants, past and present.
If the change has been gradual rather than instant, use an appropriate precision on the time value (such as decade or century), with some overlap between old and new. SM5POR (talk) 09:29, 30 September 2022 (UTC)
+1 but I would still merge them ; if the spelling is the only variation, then it's the same lexeme. Cheers, VIGNERON (talk) 15:56, 10 October 2022 (UTC)
If the spelling change is in the root, and therefore appears also in each form, wouldn't merging them result in a pretty complex lexeme with 2 × 13 forms, of which 13 forms will have an end time (P582) statement and the other 13 forms a start time (P580) statement? Imagine a language engine trying to render a piece of text using 19th-century orthography; is it supposed to filter individual forms (and senses) based on the temporal properties? And if you enumerate all the spelling variants of the "same" word since medieval times, can you be sure they are all the "same" lexeme, also when an old word via spelling changes diverges into multiple modern words? God, good, goods? SM5POR (talk) 15:16, 11 October 2022 (UTC)
synonym (P5973) is intended for senses and I would not call these lexemes "synonyms" - I would say they are different "forms" of the same lexeme. I would prefer to link them at Lexeme level, but said to be the same as (P460)/permanent duplicated item (P2959) is for items... --Infovarius (talk) 07:56, 12 October 2022 (UTC)
Good point; I agree and retract my suggestion to use synonym (P5973) in these cases. Still, orthographic changes happen all the time, sometimes to individual words, but at other times to mere phonemes used throughout a written language. In Swedish, a lot of words changed spelling from "e" to "ä" (with no change in pronunciation) in the early 20th century, but later some of those words changed back to "e". A change may apply to all forms of a word, or just some of them. If we are to describe all those transitions within the same lexeme, we may need some qualifiers (temporal, dialectal etc) on the base lemma as well as on each form, but at other times separate lexemes may be the easiest solution (such as when a noun has changed grammatical gender, or when case declinations have been eliminated, leading to a different set of forms). SM5POR (talk) 09:21, 14 October 2022 (UTC)

Attested in this senseEdit

I'd like to cite examples from published media (typically newspapers) when a lexeme has been used in a particular sense, but the property I have tried, attested in (P5323), turns out not to be allowed on senses, only on lexemes and forms (plus Wikibase items), and this seems to be by intent, so I'm not challenging that.

Looking into related properties and their constraints, I find no other property that would suit my purpose better, but instead found some qualifiers that I think will resolve the issues I have, most notably subject sense (P6072).

I'm therefore currently using the sense citation format shown in the example below, but before I continue using this format, I'd like to invite your comments and potential objections. Am I missing some relevant qualifier? Which ones are unnecessary, fully redundant, or even inappropriate?

The sample case shown here pertains to the Swedish word ombudsman (L239133) (which is the etymological origin of the corresponding English word ombudsman (L299316)), and besides this one for the original and broader sense of the Swedish word (meaning "proxy"), I have added a similar citation for the well-known public office established after the political events of 1809:

Ok? SM5POR (talk) 17:51, 7 October 2022 (UTC)

@SM5POR: yes, I think this is the right way to do it (at least, it's how I do it and we did it for described by source (P1343) on lexemes like devezh (L627477)). Cheers, VIGNERON (talk) 10:13, 9 October 2022 (UTC)

Phonetic glueEdit

Prompted by @Asaf (WMF) wondering how to model lexeme forms governed by rules about phonetic context (like a (L2767)) there is now a page at Wikidata:Lexicographical_data/Sandhi_rules for collecting examples of where this occurs (per a suggestion from @Nikki). If anybody is interested, it would be helpful to add any bullet points here to take a look at & work out how to indicate this information in a more sophisticated way than linking to entities representing letters. عُثمان (talk) 13:28, 19 October 2022 (UTC)

Where do we conduct discussions (open questions, answers and suggestions)? The corresponding Sandhi_rules Talk page hasn't been created yet; should it be? After precedes word-initial (P6712) had been created there was a discussion at Property_talk:P6712 concerning the choice of data type for this property (three years ago). I'd like to continue that discussion (in short, I'd suggest linking to a phoneme or a sequence of phonemes rather than to a letter), but I'm not sure Property_talk:P6712 is the best place for such a discussion, especially if there are other properties to be considered as well in this subject context and we would like to take a broader approach on the issue.
"How to pronounce ghoti (Q1359881) or translate it into languages other than English"... :-) SM5POR (talk) 05:24, 20 October 2022 (UTC)
There is a talk page on the linked page now, feel free to add to it there. At some point I might ping people who have edited it or expressed interest if a lot of time passes since I know talk pages discussions can be kind of hard to keep track of on here. عُثمان (talk) 22:44, 20 October 2022 (UTC)

New Lexeme creation page will be live on Wikidata on November 2ndEdit

Hi everyone,

Among our development goals this year is to make the lexicographical data part of Wikidata easier to understand for people not familiar with lexicography. This included reworking the Lexeme creation page to improve the editing experience of users. We plan to replace the Special:NewLexeme page with the new one on November 2nd!

As you may recall, we made a number of tweaks to the old page and asked you to test it and give feedback (see the previous announcement). We addressed the issues the community raised, and we would like to thank everyone who participated in the testing and provided feedback.

While the new Special:NewLexeme is already scheduled to be deployed, we would still like to hear what you think. If you have any questions or suggestions please let us know on this talk page.

Cheers, -Mohammed Sadat (WMDE) (talk) 09:23, 21 October 2022 (UTC)

It's still worse than the current one for me. My biggest issue with the current page is how tedious it is to use, and the new one has managed to make it more tedious by making everything harder to enter.
More issues:
The fields aren't marked as required in the HTML any more.
The required marker is not following the style used by MediaWiki elsewhere.
The required marker is tiny, has a weirdly large gap before it, has no tooltip and does nothing if you click on it.
Pressing enter after entering a language name or lexical category now tries to submit the form instead of selecting the top entry.
Page up and page down no longer work in the dropdowns for the language, lexical category or spelling variant.
The spelling variant field still incorrectly links to Help:Monolingual text languages which is a completely unrelated page.
The spelling variant field no longer automatically selects the right language if you enter a language code (e.g. try "es" - on the current page you get Spanish, in the new one it gives you Esperanto)... even though it tells you enter the language code.
There seems to no longer be any way to open the list of spelling variant languages... which means it's now almost impossible for people to work out how to enter unsupported languages if they don't already know what to do.
If you tab out of the language or lexical category field without selecting an item, there's no indication that the field is incomplete until you try to submit it and get an error.
Moving the terms of use/license info between the input fields and the submit button means that in most browsers, you now have to tab three times to get from the lexical category field to the submit button.
If you use a country variant like de-at ( the example lexeme is showing the fallback language name when it shouldn't.
- Nikki (talk) 21:01, 22 October 2022 (UTC)
On, the placeholder text suggests a language code (mis-x-Q26790) that can't be used there.
On and, if you try to create the lexeme, it only shows an error about Q1 not existing for the language or lexical category after getting an API error - I would expect it to verify the input before trying to create the lexeme and then show an error message next to the corresponding field. - Nikki (talk) 16:31, 1 November 2022 (UTC)
Thank you! I've created the following tickets and added them to one of the upcoming sprints: phab:T322681, phab:T322683, phab:T322684, phab:T322685, phab:T322686, phab:T322687
For the language code in the spelling variant: What would be your preferred way to address it? Lydia Pintscher (WMDE) (talk) 19:56, 8 November 2022 (UTC)
@Lydia Pintscher (WMDE): BUG!! Language code for created lexeme is wrong: --Infovarius (talk) 13:05, 15 November 2022 (UTC)
@Infovarius: Looks like a misconfigured common.js? Mahir256 (talk) 07:44, 27 November 2022 (UTC)
Oh. Oh! My bad, sorry. How was it remain unnoticed by me for so long... User:Mahir256, you are a detective! --Infovarius (talk) 14:23, 28 November 2022 (UTC)

How to search for lexemes?Edit

I think Wikidata:Lexicographical data should describe how you can search for lexemes ... because that is a very important aspect. What I currently gathered:

  • The search autocompletion on this wiki does not yield Lexemes at all.
  • The search on this wiki by default does not search the Lexeme namespace but that can be configured.
  • There are some third-party websites to search for lexemes listed on Wikidata:Tools/Lexicographical data.

Is there some search-as-you-type search for lexemes? If so I think it should be linked from Wikidata:Lexicographical data.

--Push-f (talk) 08:20, 5 November 2022 (UTC)

The only way I know of is by prefixing a regular Wikidata search with L:; no autocomplete yielded. I do not know of way to search that has autocomplete for lexemes عُثمان (talk) 17:37, 6 November 2022 (UTC)
Practically speaking, I find the Ordia tool linked on the third-party site list to be the easiest way to get an overview of the lexemes in a given language. عُثمان (talk) 17:38, 6 November 2022 (UTC)

Useful P5830 here?Edit

I tried to represent accurate forms used in proverb with combines lexemes (P5238) so I used subject form (P5830) like here: był w Pacanowie, wie jak kozy kują (L733871). What do you think, it is correct? Gower (talk) 11:52, 13 November 2022 (UTC)

@Gower: The appropriate qualifier to use for forms on combines lexemes (P5238) is object form (P5548); I have also added series ordinal (P1545) qualifiers to indicate the order in which the words appear. Mahir256 (talk) 15:21, 13 November 2022 (UTC)
Thank you for answer. I knew about object form (P5548) but it seems to be inapropriate in that case, because it is not derived from some form but is the form (variant) indeed. What do you think? Gower (talk) 16:24, 13 November 2022 (UTC)
@Gower: While object form (P5548) was originally created as a qualifier for "derived from", its applicability has since been widened to clarify the objects of statements more generally (such as the values of "combines" statements), and similarly with respect to subject form (P5830) and "usage example" and being widened in applicability to clarify statement subjects (such as lexemes with certain described by source (P1343) and has quality (P1552) statements). Think of them as analogues to subject named as (P1810) and object named as (P1932) but with lexeme forms and senses as values rather than strings. (Perhaps this generality has not been exported to languages other than English in property labeling?) Mahir256 (talk) 16:35, 13 November 2022 (UTC)

Derivation property?Edit

Do we have property to mark, adnotate derived lexemes at lexeme record? If not, should we create one? Gower (talk) 16:02, 16 November 2022 (UTC)

@Gower: We do have derived from lexeme (P5191), possibly qualified with object form (P5548) and object sense (P5980) if necessary. Mahir256 (talk) 18:07, 16 November 2022 (UTC)
@Gower, Mahir256: and also mode of derivation (P5886) (among many other qualifiers). Cheers, VIGNERON (talk) 09:34, 26 November 2022 (UTC)

Spelling variantsEdit

We have e.g. Phoenician language (Q36734). Here: Help:Wikimedia_language_codes/lists/all we have (I think so) language code for Punic: xpu. Why it isn't working when I put xpu as spelling variant e.g. here: Lexeme:L738123. That kind of problem is repeating with other languages. What should I do? Type: "mis" in "spelling variant" frame? Gower (talk) 08:52, 18 November 2022 (UTC)

Even if a language code is available elsewhere it has to be specifically enabled for lexemes as far as I know. These can be requested on Wikimedia Phabricator but the backlog has been accumulating for around a year and it is not clear when or if new codes will be added.
We do have the option of distinguishing mis codes by suffixing with -x-QID. I have updated your linked lexeme to use mis-x-Q36734 with the QID for Phoenician.
This does not work for glosses however. I have been using the "gloss quote" property with the language as a qualifier for lack of a better option, but it is currently not clear how glosses in missing languages are supposed to be distinguished. عُثمان (talk) 20:01, 24 November 2022 (UTC)

Links between Wiktionaries and lexemesEdit

Do we have property to link lexemes with Wiktionary entries? If not, shouldn't we create it, should we? Gower (talk) 12:53, 18 November 2022 (UTC)

We have automatic tool - User:Nikki/LexemeInterwikiLinks.js. --Infovarius (talk) 19:29, 19 November 2022 (UTC)
@Infovarius thanks, but why it isn't default option for everyone like "Wikidata item" at sidebar? LexemeInterwikiLinks.js works well in Wikidata, but doesn't work on local Wiktionaries, where we have interwiki only to the Wiktionaries in other languages… Gower (talk) 10:14, 20 November 2022 (UTC)
I think it is just a matter of implementing it on the given Wiktionary. I am in the very beginning stages of doing this on pnbwiktionary. On this entry the senses and inflection table are pulled from a lexeme (ਮਾਅਨਾ/معنیٰ/معنی (L729524)) using a module based on one implemented by @Mahir256 for Bengali Wiktionary. Sidebar links are probably possible as well in this case because pnbwiktionary is for the most part monolingual. (There are two Polish entries and a Chinese one but that's it for non-Punjabi words.) For wiktionaries like enwiktionary, an entry-to-lexeme sidebar could get quite long as entries there are all shared based on one string. I am quite interested in implementing some more human-readable renderings of lexicographic data for speakers/readers of the language. pnbwiktionary has not had active editors since 2014, but likely more for lack of resources for the language than lack of interest. Lexemes could be used to automate the creation of entries to a certain extent. عُثمان (talk) 19:49, 24 November 2022 (UTC)
@Gower: we don't need a property ; just as there is not stored interwiki on the Wiktionaries, the Cognate extension create them based on the string, we don't need to store on Wikidata and several gadgets already exist from both Wikidata and Wiktionaries sides (on the Wix side, I'm using fr:wikt:Utilisateur:VIGNERON/LienLex.js). Cheers, VIGNERON (talk) 09:08, 26 November 2022 (UTC)

How to model templatic pattern morphemes of Semitic languages?Edit

Arabic languages or varieties have morphemes that are templatic patterns determining a skeleton of often vowels and sometimes prefixes, infixes and suffixes in which the radicals (often consonants) of a root are inserted. (This applies to other Semitic languages as well, but is maybe less codified than in Arabic, I myself are mostly familiar with the situation in Arabic.)

My question is: How should we model these templatic patterns? Especially: Should they get entities in the item namespace or the lexeme namespace? As they are not lexemes by themselves, I've started to create entities for some in the item namespace, a list can be found at User:Marsupium/Arabic morphological patterns.

But some qualities mostly expressed by lexemes in the lexeme namespace apply to these templatic patterns a well. For example, I'd like to express Fi3iiL (Q115287997)derived from lexeme (P5191)Fa3iiL (Q115287998). But that would violate derived from lexeme (P5191)'s allowed-entity-types constraint (Q52004125). How to solve this? Should these patterns be moved to the lexeme namespace or should the property's constraint get widened? Thanks in advance for any comments, --Marsupium (talk) 18:23, 20 November 2022 (UTC)

While the item namespace is designed to describe objects in a language-independent fashion, with the labels in multiple languages primarily intended to aid the human reader/editor of these items, entries in the lexeme namespace are specific to each language. In order not to spend a huge amount of database memory on the language of work or name (P407) qualifier, entities specific to each language should preferably go into a language-specific part of Wikidata. It makes little sense to "translate" the "Fi3iiL" pattern label (or any of its "aliases") from Arabic into English, German, or any other of the 400 languages supported.
The lexeme entry format, listing senses and forms as sub-entries, does however not suit every possible language-specific construct such as a morphological or word order pattern. To the extent they are language-independent (spanning multiple languages) they might fit into the main item space, but I'm not convinced the triplet format is an optimal design element by which to define linguistic patterns. I'm therefore thinking in terms of adding yet another section to Wikidata.
While I first thought of it as a "grammeme" namespace (G-items), this may be too limited a purpose to serve also your morphological patterns. Before deciding how to implement them, I would suggest trying out a few different structures using custom entries in WikiMedia Commons databases (say, one file per language described). Keeping all the entries for a language in a single file also takes care of the notability requirement and helps avoid creating thousands of structural items.
Here are a number of language constructs or rule sets that could use a language-specific entry format that is not necessarily a lexeme (there may be partial overlap between some of these):
Rather than suggest a specific entry format for each one of these constructs, I'd prefer a generic format that can easily be adapted for several different purposes. SM5POR (talk) 07:44, 23 November 2022 (UTC)
Here is an example of how it could be done:
  1. Define a record format that will be sufficient for your immediate need, but also leave room for future expansion. I suggest the following columns/fields:
    • Language (ISO code "arz" or Q-item Egyptian Arabic (Q29919) as you prefer)
    • Type (Q-item, morphological pattern (Q6913446) in your case)
    • Object (Q-item, adjective (Q34698) from your example)
    • Pattern (probably a string like "Fi3iiL" or whatever you find convenient to process)
    • Options (if you need to express some limitations; think of it as a qualifier to the main Pattern statement)
    • Source (Q-item, such as a printed or online grammar)
    • Key (a string stating page number, entry keyword or other precise reference to said source)
  2. Create a file with a number of records according to this format.
  3. Upload the file to Wikimedia Commons as Data:Sandbox/(username)/ or similar.
  4. Add a statement Egyptian Arabic (Q29919)Sandbox-Tabular data (P4045)(filename as above) to Wikidata, linking to any documentation page you may have written about your experiment in the Reference section.
While it may seem redundant to specify the language in every record when it also appears in the filename, doing it this way allows you to reorganize your records in a single file or multiple files as you find convenient without having to rewrite the records according to different formats.
I would advise against creating any items specifically to support your experimental data records only, such as Egyptian Arabic grammar (Q114419189), when they don't serve a more general need. Add more fields to your custom table entries instead if necessary. Since items are not meant to be repurposed once they have found redundant, your gradual development work might risk wasting a lot of items, but you can delete and reuse the records of your table files without any such problem.
When you have developed some code to use your tables and the format has matured enough to be used by multiple editors independently of each other, and also for some of the other pattern rule I suggested, it may be time to write a property proposal to replace the experimental Sandbox-Tabular data (P4045) property, but I believe we are far from there yet. SM5POR (talk) 05:57, 25 November 2022 (UTC)

Listing lexemes the topic (QID) consists ofEdit

I didn't find any way of specifying lexemes in a Q-item which it consists of. Technically it is impossible because Wikidata types don't have a monolingual array of lexemes data type. So I'd like to open discussion about such a possibility. It opens ways for automatic translations, choosing appropriate forms of words the topic consists of, making plural form, etc.
One possible solution is a property like combines lexemes (P5238) but for Q-items that would list lexemes of a topic. Because every language has its own lexemes for a topic, simple lexeme data type is unsuitable. It would became unordered mess of lexemes in different languages with order and language qualifiers. Such a way is too complex even for machine processing.
Probably a proposal for a new data type would solve the problem. What do you think? D6194c-1cc (talk) 07:57, 25 November 2022 (UTC)

I'm not sure I understand exactly what you are looking for, but the most general property relating lexemes (or rather their senses) to items is item for this sense (P5137). Since there aren't (yet) specific items corresponding to every possible sense of a lexeme in any language, many senses either don't link to any item at all, or they link to the item of a related word. You can make a query in SPARQL to find statements using item for this sense (P5137) by either specifying the sense or the item, or by leaving both unspecified (I just did the latter, and found 136,445 statements).
As an example, the English noun water (L3302-S1) links to items liquid water (Q29053744) and water (Q283). Making a query "?sense wdt:P5137 wd:Q283" yields a list of 73 senses representing words for "water" in various languages. Given the current state of things, this is hardly enough to generate a useful pocket dictionary automatically, let alone translate a full sentence from one language to another, but it may form a rudimentary basis for future development work.
The part of your question that I don't understand is "lexemes the topic consists of". By using SPARQL queries you can certainly limit your matches to only the language you are interested in, say German or French. Translating a phrase such as "the water is cold" however also involves identifying word classes (parts of speech), grammatical forms etc and assembling the translated words in proper order. Part of that problem may be approached using the linguistic pattern data type we discuss in the previous question above (about morphological patterns in Semitic languages), but it's a complex problem that will certainly not be solved by merely adding another data type. Fortunately, we can use existing data types to simulate new ones; there is no need to make a formal proposal merely to experiment with this.
The property combines lexemes (P5238) is hardly of much use here as it merely maps between one lexeme and a list of constituent lexemes; it's not meant to be used with items. SM5POR (talk) 11:01, 25 November 2022 (UTC)
Wikipedia doesn't support the extension that supports SPARQL, so it isn't a solution for me. Also item for this sense (P5137) can't be used for phrases that consist of multiple words. Let me explain the idea by example: scholarly article (Q13442814) = scholarly (L13568)+article (L5515) (en). When you know lexemes you can find their abbreviations, translate them into another language (yes, its very complicated task), and also you can find plural form of noun (Q1084) lexemes (articles (L5515-F2)). Specifying lexemes in items could help to automate many tasks. D6194c-1cc (talk) 11:32, 25 November 2022 (UTC)
Allow me to turn that approach around: Items don't specify any lexemes, so starting out with an item and going nowhere isn't a solution for me. But when I know the lexemes in the source language, I can look up their corresponding items, query Wikidata for matching lexemes in the target language and output the resulting words. It won't be pretty, but a human reader may be able to understand it anyway. Adding SPARQL support to Wikipedia could help automate this task.
Of course, I don't claim that adding SPARQL support to Wikipedia is done overnight, or that the result will not suffer from any performance problems, but let's consider the alternatives:
If the only phrase you will ever want to translate is "scholarly article", sure, that could essentially be done in less than a minute. But if the prerequisite is that we first have to find a technical format for "specifying lexemes in items", and then actually add those lexemes for, say, a million items and 400 languages, I'm not convinced we will finish that task before Wikipedia actually has SPARQL support.
We have two problems here: One is the apparent lack of a strategy for integrating Wikidata with other Wikimedia projects, and the other is the technical difficulty of automatically translating full sentences or multi-word phrases between different languages. I certainly hope that also Wikipedia will one day be able to take full advantage of all the effort that is put into Wikidata, but if Wikidata editors, in order to accomodate Wikipedia, have to spend most of their time worrying about item labels or designing robots to add inverse statements of those already added, I'm not sure Wikidata will ever reach its full potential.
And I use the phrase "apparent lack of strategy" to describe the impression I get, not to discredit the Wikimedia Foundation. There may well be things happening behind the scenes that I'm simply not aware of, but if so, why do we still have to add all those inverse properties? SM5POR (talk) 06:44, 26 November 2022 (UTC)
You can't get lexeme by its name, because different lexemes have different meanings. For example, мир (L100000) have plural form in Russian language and мир (L99999) doesn't have plural form. Also in mw.wikibase you can only fetch items directly, no queries are supported. So the only way to get lexemes of an item is to specify lexemes directly in item. D6194c-1cc (talk) 08:14, 26 November 2022 (UTC)
  1. Items don't list lexemes. To get lexemes from items, someone has to specify them first. You propose doing that.
  2. Wikipedia doesn't do queries. To do queries in Wikipedia, someone has to implement it first. I propose doing that.
Why do you depend on an item to translate a phrase? You want to translate "scholarly article". There is an item scholarly article (Q13442814) labelled like that, which means you can translate it after specifying the lexemes "scholarly" and "article" in that item, right?
I want to translate "the water is cold". There is no item labelled like that, and I will not create it because it isn't notable. Instead I split the phrase into its constituent words "the", "water", "is" and "cold", search the lexemes for the corresponding forms and senses (this is a manual task, or one that requires AI) and do the translation.
Likewise, you want to split the item label "scholarly article" into its constituent words "scholarly" and "article", search the lexemes for the corresponding forms and senses (this is a manual task, or one that requires AI) and specify the resulting lexemes with the item.
We are essentially doing the same thing. The differences are:
  1. You require a preexisting item for the phrase you want to translate. I don't, I use the phrase directly.
  2. You specify the lexemes for all the existing items in advance. I specify the lexemes in the phrase I want to translate, only when I need them.
I'm right now quite puzzled as to your idea here; do you really expect Wikidata editors to begin listing hundreds of lexemes with every item, or do you hope it will be done automatically using AI? Because either way, that work is going to take time, and I don't see how you can possibly reduce the time needed to translate a single phrase by first preparing, say, a million items for future translation (there are around 100 million items in Wikidata, but most of them are unlikely to ever be looked up for translation).
As to the impossibility of doing database queries from Wikipedia, that's not some natural law, but an effect of what functionality has or has not been implemented yet. Wikidata has existed for merely ten years, and Wikipedia will hopefully continue to evolve with it for many years to come. It's a bad idea to redesign Wikidata on the assumption that Wikipedia will forever be stuck with its present capabilities only, especially if the Wikidata redesign will cost more than removing the Wikipedia limitations.
Maybe Wikipedia isn't currently the optimal translation tool? -- "If only this Christmas tree were built like a kitchen stove, we could boil eggs on it, and we wouldn't need a stove for that." -- SM5POR (talk) 13:03, 26 November 2022 (UTC)
Again, you cannot find lexeme by its name. Different lexemes have same name but different meanings. It's a task for AI (which determines context), rather than for simple automation. As about scholarly article (Q13442814) (in Russian: "научная статья"), the plural form in Russian would be "научные статьи" (this item was just an example). When you know lexemes, it't not so hard to implement language engine that would change forms of words in a phrase. My task is to get phrases from Wikidata and use them as titles in modules. But I need to translate them and change their form. For this task I have made workaround, but for the plural form I have to make exceptions for every language. D6194c-1cc (talk) 13:46, 26 November 2022 (UTC)
Identifying the proper lexemes (or rather their senses, which is what we want to do) certainly isn't trivial; we are in agreement there. The difference between our respective approach is when we perform that first phase of the translation process, and the cost of doing it that way. Of course merely generating the translation will be faster (and can be done automatically) than parsing the original phrase plus generating the translation, but someone still has to do the parsing, and it will take at least the same effort.
How many modules (approximately) will you need titles for (hundreds, thousands, or millions), and how many times will you need to translate each individual item label (once, twice, or a hundred times)? Can those items you need translated be listed in advance so that we can make a better estimate of the work that has to be done, or do they pop up in a totally unpredictable fashion? SM5POR (talk) 15:07, 26 November 2022 (UTC)

Lexeme to mergeEdit


Our rules for separating/merging lexemes are not ultra-clear, so I want to check here.

Is it ok to merge 24-7 (L44134) and 24/7 (L44135) ? For me "24-7" and "24/7" are just variant of the same lexeme (same language, same lexical category, same meaning, two identifiers in the Merriam-Webster but it is said to be variants of the same lexeme).

@SixTwoEight, Rachmat04, UWashPrincipalCataloger: who edited these two entities.

Cheers, VIGNERON (talk) 12:10, 26 November 2022 (UTC)

I'm not sure about English, but at least in Swedish "24/7" has the additional (or I would rather say original) sense of "July 24", while "24-7" does not. The separator (slash or dash) is thus significant in some languages. In Wikidata, I see that "24/7" appears in multiple instances as a proper name or title of a work, and while proper names maybe typically aren't listed with multiple senses, some cases such as Georgia (L254165) exist. There may also be lesser known uses of either in scientific contexts (ESO 24-7 (Q80020742) etc). I suggest keeping them apart also in English, both to play safe and for consistency with other languages. SM5POR (talk) 14:10, 26 November 2022 (UTC)
Very true, I didn't think about. Indeed in most languages 24/7 could be a date, but I think it's a separate lexeme, not a sense of the 24/7 (L44135) (and I'm not sure dates are lexemes, but this is yet an other subject). Plus, even merged, there would be separate forms, so there would be little to no confusion. Finally, let's go back to sources: at least three dictionaries Merriam-Webster, Collins and say it's variants (and list others variants), do other sources say otherwise? Cheers, VIGNERON (talk) 18:43, 26 November 2022 (UTC)
The date and the always-open expression have to be different lexemes because they belong to different lexical categories or differ on some other significant property affecting the selection of forms. That happens to regular words as well, more often in languages with multiple grammatical genders, such as German or Swedish, than in English.
"24/7" (or "24-7") is a rare case of a numerical expression having become an adverb through abbreviation (of "24 hours a day, seven days per week"), so the observation that this adverb has only a single sense cannot be generalized into the choice of typographic delimiter in a compound lexeme being independent of its sense (somewhat like "a non-rectangular flag is always red, white and blue, regardless of what nation it represents" when there is only one such nation). Until you have made a thorough search of all the world's literature, you cannot claim with certainty that there isn't a single case of a written expression with two senses differing only in the delimiter, and you can never be sure one won't appear in the future.
For this reason, also lexemes with only a single sense come equipped with an Lxxxxx-S1 sub-entry for all the sense-related statements; you don't move those statements up to the base lexeme entry just because there is currently no risk of confusion with a second sense. If and when a second sense appears, you simply add an Lxxxxx-S2 entry without having to rebuild the structure of the entire lexeme.
Whether reputable sources consider the expressions "variants" or not is of minor importance to me; the criterion should be what is practical to do in Wikidata. I consider the choice of delimiter similar to that of letter case; HERTZ, Hertz and hertz are different renderings of the same name/word depending on written context, but only the unit of frequency is ever written "hertz", while the ones with an uppercase 'H' can refer to either the unit, the family, or the car rental firm. Do they make up one, two, or three lexemes? Other examples: "3", "3rd", "iii" and "III"; are they ordinal or cardinal numerals? The letter "Å", the Swedish noun "å" (meaning small river), the locality "Å" and the angstrom unit abbreviated "Å"? Tom & Jerry vs Tom and Jerry? The white house vs the White House? 2,022 vs 2022? SM5POR (talk) 00:21, 27 November 2022 (UTC)
@VIGNERON Merging these is a good idea, and I would even go as far as to suggest making the headword/lemma “twenty-four seven," with both 24/7 and 24-7 as alternative forms.
While the adverb is not as commonly written out in full like this, I think it is important to keep in mind that lexicographical data represents spoken as well as written language. The only way we can say this expression is “twenty-four seven” and not “two four seven” or “twenty four over seven” and so on. That we can format the numbers in different ways does not really change what they are supposed to represent, and without the full words on the same lexeme it is not really indicated that this consists of the words twenty-four and seven, not two, four, and seven. I do not think we need to be concerned that 24/7 looks like a date because this is incidental. It could also be a fraction, or a house number, or type of airplane, etc. but none of this is really related to the context of what 24/7 represents here.
I would also add that as far as consistency with other languages is concerned, it would be easier to make connections between languages from a single “twenty-four seven” lexeme to others. Many languages have adjectives or adverbs with similar or equivalent meaning that are not typically abbreviated, like Punjabi اٹھپہرا which is derived from اٹھ eight and پہر which is a unit of time equivalent to three hours. عُثمان (talk) 22:05, 27 November 2022 (UTC)
I would also add that “twenty-four seven” exists as two parts of speech and should have an additional adjective lexeme. (As in, “twenty-four seven business,” “twenty-four seven service,” etc.)
The adverb and possibly also the adjective have an additional figurative sense. As in, “he talks 24/7,” where just like with “all the time” and “constantly,” this is probably an emphatic exaggeration and the subject does actually stop talking to sleep. عُثمان (talk) 22:21, 27 November 2022 (UTC)
Return to the project page "Lexicographical data".