Wikidata talk:Lexicographical data/Archive/2022/06

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.


Childhood language

I see several language registers items for childhood languages. They seem like duplicates but maybe it’s not

Babbling is the sounds babies make before they can speak. I think the French Wikipedia page is on the wrong item. - Nikki (talk) 16:51, 19 June 2022 (UTC)
@TomT0m: +1 with Nikki I moved the interwiki link and change the French label on the first item to make it clearer. So for childhood language, the third one is probably the best one. Cheers, VIGNERON (talk) 17:14, 19 June 2022 (UTC)

You can now reuse Wikidata Lexemes on all wikis

Hello all,

In 2018, the Wikidata development team enabled Lexemes, Forms and Senses on Wikidata, allowing everyone across the Wikimedia projects to gather structured data about words and languages. Lexicographical data has been growing, thanks to the effort of the community who added, up to this point, more than 661K Lexemes in 846 different languages on Wikidata (13 languages having more than 1500 Lexemes - see statistics here.

In order to make this data usable and useful, one missing feature was the ability to access and use the data from Wikidata’s Lexicographical Data on the other Wikimedia projects via Lua modules. This feature has been requested for a long time by editors, and after a test phase on a few Wiktionaries, we are happy to announce that Lua access to Wikidata Lexeme will be enabled on June 21st on all Wikimedia projects.

Practically, Lua access means that we created some new Lua functions that will allow you to integrate Lexemes, Forms and Senses from Wikidata on any of the pages of any Wikimedia wiki. Among many possibilities that this feature offers, you will be able to create for example: conjugation or declination tables, stubs of Wiktionary entries, tools displaying the meaning of a word on Wikisource, and many other things, depending on what your project needs. Until someone on your project writes a Lua module that makes use of these new functions and then uses this module on a page, nothing changes for your project.

In order to use it, people with experience with Lua modules and templates can look at the documentation listing the available functions. You can also have a look at simple example showing the singular and plural forms of an English noun: the template, the module, the result.

Following the deployment of this feature, we are confident that several editors will start creating their own modules for Wikisource or Wiktionary - we invite you to share your experiments on this talk page, so other people can discover what you have been doing and get inspired.

If you’re involved on a Wiktionary or Wikisource, feel free to share this announcement around and to try the feature with your community.

If you have any questions or feedback, or if you want to discuss with other editors, feel free to use this talk page or the related Telegram group. To report a technical issue, feel free to use this Phabricator ticket.

Thanks for your attention and have fun with Lexemes! Lea Lacroix (WMDE) (talk) 12:19, 20 June 2022 (UTC)

I'm delighted to learn that the French Wiktionary will finally have access to Lexemes. There's a lot of investigation and experimentation to be made to find out what could be the best integration, but I might have some usecases of this feature for Lorrain (Q671198).
Do you have any examples of what the Wiktionaries in the test phase managed to do with that? Poslovitch (talk) 16:03, 20 June 2022 (UTC)
@Poslovitch: I'm happy to hear some interest and I'd love to follow what you manage to do with Lorrain :)
As far as I know, only @Mahir256: implemented modules on Bengali Wiktionary, I found this link but I guess the best explaination of what it does will come from Mahir :) We're also preparing a livestream in the next few weeks to show usecases and create a template live, I'll keep you updated here. Lea Lacroix (WMDE) (talk) 16:33, 20 June 2022 (UTC)
@Lea Lacroix (WMDE) Thank you for your answer. Keep me posted about the livestream. The main issue I can find though, is that unlike Wikipedias, we have no "implicit item" for Wiktionaries - Infoboxes on Wikipedias do not need us to specify a Wikidata item in them. For Wiktionaries, we'll have to do that everywhere. i.e. in templates for wikt:fr:ba#Lorrain, I'll have to put {{MyTemplate|L678668}}... This is definitely going to hinder re-usage of Lexemes on Wiktionaries (let aside the... grudges that may subside between both projects). So... Any workarounds for this? Poslovitch (talk) 07:58, 21 June 2022 (UTC)
@Poslovitch: Nothing I can think of, unfortunately. Since there no 1:1 connection between a Lexeme and a Wiktionary page, you will have to enter the Lexeme ID as a parameter. Lea Lacroix (WMDE) (talk) 12:40, 21 June 2022 (UTC)
Reposted to the Vietnamese Wiktionary with some local context. Minh Nguyễn 💬 01:04, 22 June 2022 (UTC)
Update: I also created Wikidata:Wiktionary#Lua_access_to_Lexemes with a bit of documentation about the feature. Improvements and more examples are welcome on this page. Lea Lacroix (WMDE) (talk) 09:30, 22 June 2022 (UTC)

USPS abbreviation (Q30619513) as a grammatical feature

a few lexemes use this item as a grammatical feature. I'm not sure this is what forms are for – Loominade (talk) 10:58, 16 June 2022 (UTC)

I'd think the abbreviation should have it's own lexeme --Loominade (talk) 12:35, 16 June 2022 (UTC)

@mxn: --Loominade (talk) 12:37, 16 June 2022 (UTC)

I'm not sure that it should be a "grammatical feature", but this is a written form used for the word in English. ArthurPSmith (talk) 13:27, 16 June 2022 (UTC)
It is certainly not a "grammatical feature" in any common sense of this term. It is rather an orthographic variant of certain forms. I would instead create a corresponding property for United States Postal Service abbreviation (Q30619513), and put the abbreviation as part of a statement. If the abbreviation is the same for both singular and plural forms I would create a corresponding statement on the lexeme itself, otherwise one statement per form. AGutman-WMF (talk) 14:13, 17 June 2022 (UTC)
Ah, that's a reasonable solution. What do you think about abbreviation (Q102786) which is used as a grammatical feature similarly in some cases? For example for mademoiselle (L11884). ArthurPSmith (talk) 15:46, 17 June 2022 (UTC)
@ArthurPSmith I think here too it would be better to have a statement with the abbreviation rather than using abbreviation (Q102786) as a grammatical feature. As far as I understand the data model, the "forms" of the lexeme should really only be grammatically inflected forms (also in accordance with Lexical Masks), and not any orthographic (or other) variation of it. AGutman-WMF (talk) 10:14, 20 June 2022 (UTC)
@AGutman-WMF: Restricting forms to inflection is currently impossible in some languages anyways... Minh Nguyễn 💬 08:28, 21 June 2022 (UTC)
Here is an example: Pkw (L678783)Loominade (talk) 10:28, 21 June 2022 (UTC)
@Loominade, mxn, ArthurPSmith, AGutman-WMF: shouldn't we use variety of lexeme, form or sense (P7481) for this? and for me, an abbreviation absolutely should'nt have its own separate lexeme, it's has no meaning in itself, it's just a variation of the form (just like "center" and "centre" or "organisation" and "organization"). Cdlt, VIGNERON (talk) 11:05, 19 June 2022 (UTC)
@VIGNERON: my observation is that lexeme databases have abbreviations as lexemes. why not? --Loominade (talk) 08:40, 20 June 2022 (UTC)
@VIGNERON As far as I can see, using variety of lexeme, form or sense (P7481) would still entail having a separate form for the abbreviations, which is, IMHO, not desirable. As I wrote above, my understanding is that forms should only reflect grammatically inflected forms, and not other type of variation. So I would still advocate to link the abbreviation (using a statement) to the form it abbreviates.
As for the question whether an abbreviation should have its own lexeme entry: I think it would be good here to distinguish two kinds of abbreviations. Most abbreviations are only orthographic short-hand notations of the full lexeme. In these cases I wouldn't list them separately, since they have no independent spoken existence. However, sometimes an abbreviation becomes a word of its own, and it that case it can be listed as a separate lexeme. (This is very common in Hebrew, e.g. תנ"ך (L68688)). AGutman-WMF (talk) 10:32, 20 June 2022 (UTC)
@AGutman-WMF: « forms should only reflect grammatically inflected forms » ? Form is used for any forms, grammatical or otherwise (orthographical, dialectical, etc.). Abbreviations are a bit specific and maybe desserve separate lexemes but forms will always contain flexions that are not purely grammatical (ie. if a tool whant to select a form, it can not solely look at the grammatical features). Cdlt, VIGNERON (talk) 19:10, 23 June 2022 (UTC)
@VIGNERON what you suggest goes, as far as I understand, counter the lexicographical data-model. The forms are clearly defined (and distinguished) in terms of their grammatical features, which are therefore just listed after the form. If you consider Lexical Masks, which are intended to be used as validators of lexical data, you'll see that they define for each part-of-speech a fixed number of forms, all distinguished by purely grammatical features.
An abbreviation which has no independent spoken-form, but is merely read out as the full word it abbreviates, is generally speaking not a lexeme by its own, and not even a form of that lexeme. It is simply an orthographic representation of a form (or the entire lexeme), and it should be linked to that, IMHO, either through a statement, or possibly as a "variant spelling" (e.g. using the code en-x-Q102786). Just as you wouldn't consider the orthography 10 to be a distinct form of the lexeme ten (L338), there is no reason to consider min a form of minute (L2500).
As for dialectal variation, here the situation is different, because a dialectal form is primarily a spoken form, so it is not merely a spelling variant. It may in fact exhibit different inflection patterns or grammatical features (e.g. Swiss German nouns sometimes have a different gender than the corresponding High German noun). So for these I think the ideal solution would be to represent them as a different lexemes with a language code corresponding to the relevant dialect (e.g. gsw-x-Q248682 for for Zurich German (Q248682)). If the differences are only in pronunciation/spelling, then it may be simpler to list the dialectal forms as spelling variants of the standard form (with the appropriate language code), but I think this would need discussion for each dialect.
(This discussion goes much farther than the original scope... Maybe we should start a distinct thread to discuss this. I will also discuss these questions in my upcoming presentation in the Wikidata Quality Days.) Cheers, AGutman-WMF (talk) 07:28, 24 June 2022 (UTC)
@Loominade @VIGNERON: I don't have a strong opinion on how to model these abbreviations, but the USPS abbreviations should be modeled consistently somehow so that they're easy to query for. Maybe a monolingual text statement on a form? There was some previous discussion about a better approach in Wikidata talk:Lexicographical data/Archive/2020/10#Abbreviations. Minh Nguyễn 💬 08:27, 21 June 2022 (UTC)

Work in progress : wiktionary interwikis

Hi all, I started a gadget to add links to wiktkionaries wikipage : User:TomT0m/LexToWiktionary.js

It’s still a work in progress but it is already useful I think, so I announce it here to gather input. It adds a button « wiktionary » you can click on to get a lists of wikipages of wiktionaries with relevant titles (the lemmas of the lexeme) First it only shows a few wiktionaries, those who corresponds to your user language, your babel language and the wiktionary of the lang of the lexeme, but there is a link to load every wiktionaries.

Things to do :

  • Better view,
    • in some skins the popup does not work well atm. (minerva) (the button is hidden in the « plus » menu.
    • in vector2022 the button is hidden when the window is too narrow
  • add a feature to query all forms and not only the main lexemes (asked by @VIGNERON)

Maybe add a searchbar if you are looking for a specific language if the list is long ?

Please tell me if there is any blocker for you to use. author  TomT0m / talk page 17:22, 21 June 2022 (UTC)

Salut TomT0m, merci pour ce gadget. Pourrais-tu préciser comment on l'installe et ensuite comment on s'en sert avec un exemple si possible. Car pour le moment, j'ai ajouté une ligne dans mon common.js mais je n'ai rien remarqué. Par ailleurs, pas sûr d'avoir bien compris à quoi il sert, donc probablement qu'avec un exemple, ça m'aidera à mieux comprendre l'utilité du gadget. Merci Pamputt (talk) 21:52, 21 June 2022 (UTC)
Salut @Pamputt, tu utilises quelle skin ? Le gadget rajoute un bouton qui devrait être bien visible, aux endroit des interwikis habituels, approximativement, pour chaque skin. Pour minerva c’est planqué dans un menu.
J’ai la même ligne dans mon common.js donc ça devrai en principe marcher.
J’en ai pas trop fait pour la doc pour le moment parce que c’est juste une phase de test et que ça devrait en principe être assez autoexplicatif. Il y a un bouton, il faut cliquer dessus si ça marche. author  TomT0m / talk page 05:58, 22 June 2022 (UTC)
Et sinon pour la fonctionnalité c’est de passer facilement de chat (L511) à en:wikt:Chat ou fr:wikt:Chat et ainsi de suite. author  TomT0m / talk page 06:00, 22 June 2022 (UTC)
Ah ok, ça marche, « importScript » ne faisait pas le job dans le common.js Pamputt (talk) 11:18, 22 June 2022 (UTC)
C’est curieux ça fonctionne chez moi … mais en me renseignant sur ImportScript je suis tombé sur cette discussion, il vaudrait mieux utiliser l’autre forme pour des raisons de chargement dynamique. Parfois le script pourrait être chargé avant que la page ne le soit … et ça fonctionnerait pas, je vais modifier la ligne de chargement de mon script pour prendre ça en compte. author  TomT0m / talk page 15:11, 22 June 2022 (UTC)

Forms in Vietnamese

Lexemes in Vietnamese use forms very differently than described in Wikidata:Lexicographical data/Documentation#Data Model. Vietnamese has no grammatical feature in need of multiple forms per lexeme, but a typical lexeme has at least two forms, one for Vietnamese alphabet (Q622712) and the other for chữ Nôm (Q875344). Many lexemes have multiple forms because of different orthographic styles in Vietnamese alphabet (Q622712) or because multiple Nôm character (Q15100640) correspond to a single Vietnamese alphabet (Q622712) word. pronunciation audio (P443) statements are always duplicated among all the forms. Some examples:

In other languages, such as English, these purely orthographic variants would be modeled as representations of a single form. However, Wikibase only allows one representation per locale code per form. This is impossible in Vietnamese, because there's a many-to-many relationship between Vietnamese alphabet (Q622712) words and Nôm character (Q15100640), both by design and because chữ Nôm (Q875344) was never standardized. A single author may use 吧 and 𡝕 interchangeably for (L619034). Moreover, a Han character in this lexeme (P5425) statement only makes sense in the context of a particular representation; it makes no sense when paired with a different character or an alphabetic word.

Hopefully this atypical usage of forms won't cause too many problems for software consuming lexemes. Should we mention the possibility of non-grammatical forms in Wikidata:Lexicographical data/Documentation#Data, to raise awareness among editors and developers?

 – Minh Nguyễn 💬 08:21, 21 June 2022 (UTC)

Thanks for bringing this up!
In principle, as far as I understand, spelling variants should be part of the base lexeme in this case. As you mentioned, there is a problem that you cannot currently add multiple variants of the same code. I would suggest championing for removing this restriction (or alleviating it, see my suggestion on the Phabricator bug rather than working around it. AGutman-WMF (talk) 09:26, 21 June 2022 (UTC)
It isn’t ideal, but the workaround is at least entrenched enough with alternative form (P8530) that maybe it deserves a mention in the documentation. Until the issue is resolved, we shouldn’t simply exclude entire languages from Wikidata lexemes over this issue. I’m sure Wiktionary will find a use for lexemes despite this suboptimal modeling. Fortunately, forms aren’t being used for any other purpose in Vietnamese, so there’s not as much ambiguity. Minh Nguyễn 💬 16:27, 21 June 2022 (UTC)
As far as I understand, the alternative form (P8530) property only works well when you have pairs of alternative spellings (and indeed, you don't use it for the Vietnamese lexemes, as far as I can see), but in fact, if there is only a single alternative form, one could simply use the language code vi-x-Q59342809 using alternative form (Q59342809), or another Q-qualifier, instead of adding non-inflectional forms. I think this suits better that data-model. AGutman-WMF (talk) 09:26, 22 June 2022 (UTC)
(By way of an update, I've restructured the Vietnamese lexemes so that they don't have multiple forms. Instead, each chữ Nôm form is in a separate lexeme linked by translation (P5972).) – Minh Nguyễn 💬 20:20, 28 June 2022 (UTC)

nominal locution (Q29888377) for translations

German has a lot of words that are compound (Q245423). Often but not always, the English equivalent is simply multiple words. In an English-only dictionary, entries like (L678907) would look funny. when provided as a translation for Freibier (L678904), they would make sense. Do you think (L678907) is a valid lexeme? If not, what makes black hole (L2890) a lexeme? are both nominal phrases in the first place? maybe @AGutman-WMF: – Loominade (talk) 09:34, 23 June 2022 (UTC)

Not sure on (L678907) - possibly as it represents a concept (a certain category of freedom) beyond the meaning represented simply by the two words in combination. ArthurPSmith (talk) 13:49, 23 June 2022 (UTC)
In gneral I think it makes sense to represent in a dictionary (and thus in Wikidata) idioms, which are multi-word expressions whose meaning cannot be compositionally deduced from its sub-components. Some linguists use the term lexeme in a stricter sense (as a single word), but I think most definitions would accept non-compositional multi-word expressions as lexemes. This also applies conversely: while a compound is (orthographically) a single word, if its meaning is compositionally derived from its components, there is no need to list it in a dictionary, unless it has some special pragmatic flavor to it, in which case it is not strictly compositional. So insofar (L678907) and (L678907) simply mean bear which is free (of cost) I don't think they merit representation in Wikidata, but as @ArthurPSmith mentioned, these expressions do have a special pragmatic usage of representing a category of freedom (gratis), so with that sense they could be retained, but the sense should be added to their definitions, IMHO. AGutman-WMF (talk) 15:41, 23 June 2022 (UTC)
and with definition, you mean gloss or sense properties? – Loominade (talk) 07:59, 24 June 2022 (UTC)
following your advice I redirected the English free beer lexeme to the German word. – Loominade (talk) 09:52, 29 June 2022 (UTC)
Return to the project page "Lexicographical data/Archive/2022/06".