Wikidata talk:Lexicographical data/Archive/2021/02

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

I like https://www.wikidata.org/wiki/Lexeme:L99, which do you think are our featured lexemes? Should we make a list, vote and indicate it on the lexemes somehow?--So9q (talk) 11:23, 1 February 2021 (UTC)

@So9q: I love the suggestion. We should make some clear workflow for nominating, evaluating, displaying etc. potential featured lexemes. Meanwhile, we can still start. I will work on a Lexeme in Breton to share with you too.
About Luftballon (L99) specifically, I would say it's a good lexemes but it maybe could be improved a bit, for instance : adding references for instance of (P31), be more specific for described by source (P1343) (a specific edition of the Duden and not the global work would be better, and with the page would be even better). A IPA transcription (P898) on each form would be welcome. Shouldn't there be a third sense for balloon (Q183951) (unclear to me, probably not, to be checked). Last, didn't we say there should be no image in Lexemes? (not sure about the last one... did we?)
Cheers, VIGNERON (talk) 08:17, 4 February 2021 (UTC)
I like your improvement suggestions 😃. Images are encouraged since I accepted @fnielsen:s arguments in the previous discussion. (His main argument was that lexemes are localized and the image on the qid might not be relevant or the best in the context of the culture of the lexeme language)--So9q (talk) 08:41, 4 February 2021 (UTC)

Moby Part of Speech List

@Nikki, So9q, VIGNERON: I found a resource about English words, but I am not sure how the list is reliable. The list may be downloaded at: http://www.gutenberg.org/files/3203/files.zip --GZWDer (talk) 14:42, 7 February 2021 (UTC)

Here is the list in txt (3MB). Looks good to me. As long as it's a limited scope bot job with a some examples to judge the quality it sounds like a good idea to me :). Please consider creating a bot request after cleaning up from your former bot jobs.--So9q (talk) 14:13, 9 February 2021 (UTC)
I am not sure how reliable the list is.--GZWDer (talk) 16:25, 9 February 2021 (UTC)
Is there any description of what is this list? I recognize some French words and there are many mistakes (example: "bon march" is "bon marché", "la carte" means only "la"+"carte" and should not be created as one lexeme, "la king" has no meaning, etc.). So we should not do anything with this list. Pamputt (talk) 15:47, 14 February 2021 (UTC)

Ambiguous massness

How to describe sense-specific noun massness (countable vs. uncountable)? Wiktionary lists massness either on lexeme level (e.g. information) or on sense level (e.g. water). I have been adding massness on lexeme level via instance of (P31) pointing to mass noun (Q489168) or count noun (Q1520033). Placing instance of (P31) on senses doesn't seem right though. I can think of several solutions:

  1. Split lexemes along massness axis the same way it's done with gender. This would line nicely with the fact that mass nouns don't have plural forms. Ambiguous massness is however much more common than ambiguous gender, so this would result in a lot of new lexemes.
  2. Create property "massness" that could be either on lexeme or sense level.
  3. Create multiple instance of (P31) properties on lexeme level and create new qualifier (or find existing one) that restricts them to only some senses.

So far the existing practice (in Danish and English lexemes) seems to be to add two instance of (P31) properties on lexeme level without any qualifiers, effectively marking the noun as having ambiguous massness, but that leaves out useful information about senses. — Robert Važan (talk) 23:22, 21 January 2021 (UTC)

Hmm, I'm always uneasy with councept like massness, collective, countbility (which are more or less the same in some languages but not always... it would be too easy, not sure for English and definitely a mess in Breton). Not sure we have enough data to split lexemes along massness (but that's an elegant proposal), also not sure for a new property (and we would need a general property like grammatical gender (P5185), "massness" seems too specific), it leaves us with option 3. @Robert Važan: for more advice (<- uncountable noun  ), you should ping people (<- again!) who already working on this on Lexemes. Cheers, VIGNERON (talk) 08:36, 22 January 2021 (UTC)
@VIGNERON: It's indeed a good idea to ping Fnielsen who added massness to Danish and some English lexemes. I just realized that #1 and #3 lose information about dominant/minority massness, which is expressed in Wiktionary headword as "usually countable" or "usually uncountable". Only #2 captures it. I haven't heard of anyone calling massness a gender. Maybe noun class, which has (unused) property noun class (P7165). BTW, I don't see what's wrong with specialized properties. They can be constrained and easily queried. — Robert Važan (talk) 01:06, 23 January 2021 (UTC)
I've wondered about this - for some words/senses it is clear, but for many there are gray areas on countability/mass meaning, I think partly a reflection of our world where continuum vs discrete perspectives are often complementary (everything is made of atoms or related countable constituents!) Maybe it will become clearer as we continue to work on lexemes here. Though I'm sure linguists have thought about this a lot too... en:Mass noun claims a precise definition for the mass/count distinction, but then seems to confuse several different aspects of the relation. Anyway, I'm also in favor of option 3, with a "applies to sense" qualifier that I think would be generally useful. ArthurPSmith (talk) 15:43, 22 January 2021 (UTC)

I am not sure how we best should handle this, but just point to issues in Danish. The word for beer in Danish (øl (L39743)) is interesting and a rare example. It has two related senses: a mass noun with neutrum gender for a liquid, and a countable noun with common gender for a countable number of a beer (bottle or can). This seems to to be somewhat similar to English: Do you want a beer? Do you want more beer? Here one would need tie the grammatical gender and the massness to a specific sense and form. I am sometimes unsure whether a noun is a mass noun, countable or singulare tantum. For instance, a word such as kærlighed (L45418) (love) is usually a mass noun but may sometimes appear as a singulare tantum, or (unofficial) as a countable noun, like English: more love, a love, "*two loves". The countable noun seems more often to denote a person that is loved. — Finn Årup Nielsen (fnielsen) (talk) 21:49, 15 February 2021 (UTC)

@Fnielsen: A bit off-topic, but I find it strange that you list singulare tantum as an alternative to mass/count nouns. Isn't singulare tantum merely a paradigm defect, an absence of plural, regardless of what causes it? Massness may cause singulare tantum, but singulare tantum can have other causes and, at least in Slovak, mass nouns are sometimes used in plural to strengthen their meaning, so massness is orthogonal to singulare tantum, not an alternative to it. — Robert Važan (talk) 02:10, 16 February 2021 (UTC)

FYI, I have updated Slovak nouns with ambiguous massness to have both classes, the way it's done in Danish and English. People seem to be in favor of the "in sense" qualifier and I like that idea too. I am not going to propose the qualifier yet, because my work on senses is blocked by current API limitations. I will get back to this later. — Robert Važan (talk) 02:15, 16 February 2021 (UTC)

Which language (code) in non-standard cases?

When I am creating symbol or other multilingual stuff or in language which is not supported yet, I am trying to enter "mis", "mul" in Special:NewLexeme but in vain. Nevertheless after creation I can change code to "mis" and language to multiple languages (Q20923490) or language without a specific language code (Q22283016). This is counterintuitive and inconvenient. --Infovarius (talk) 12:59, 15 February 2021 (UTC)

Tools idea: Lexeme Mix'n'Match

Such a tool is very similar to Mix'n'Match, but is about lexemes instead of items. The system may contain multiple catalogues (i.e. database, word list and dictionaries), each of them contain some "words". A word usually corresponds to one lexeme in Wikidata, but they may be one-to-many (e.g. some dictionary consider noun and verb as the same word, and there are two English lexemes for noun "school") or many-to-one (e.g. color/colour, and some word list contains many inflected forms). A word may also have part of speech (not mandatory, as some dictionary does not provide them), forms and senses (including translations).

This is much better than directly use a bot to mass import lexeme. It also make it easier to discuss the dataset before bot action.--GZWDer (talk) 18:55, 10 February 2021 (UTC)

@GZWDer: Are you familiar with Ordia? The Text-to-lexemes component does something like what you suggest, but I'm sure the app could be improved if you have specific ideas for how to do it better. ArthurPSmith (talk) 16:27, 15 February 2021 (UTC)
This assumes the original data is short enough to be handled at once. What I proposed is a place you can upload the full Oxford English Dictionary and match them one-by-one (there are obviously many lexemes not existed yet and should be created). Some dictionaries does not provide part of speech and they need to be manually added when new lexemes are created, instead of importing them fully-automatically.--GZWDer (talk) 09:54, 16 February 2021 (UTC)

Inclusive lexemes (inclusive senses/subclasses) and properties that help?

On the Abstract Wikipedia/Wikifunctions Telegram channel, I brought up the following for discussion, but moving it here instead for wider discussion:

It would be nice to see a distance on the graph somehow between "staff" and "colleague", regardless of "guide". "staff" could be considered a plural synonymous form of "colleague" or even just "teacher". But synonym is not the right relationship, I think. How does one capture information where "teacher" and "employee", etc. are considered to be included in "staff"? So that then in Wikidata it's useful for Wikifunctions or queries that provide data to Wikifunctions later?

I see a few properties existing that might help expose those relationships, but not sure what is the best approach to capture the idea of "this word is typically considered to be inclusive of this other word". I think hyperonym (P6593) is sometimes used by linguists for this. Or if even doing this modeling in the Lexeme namespace is the right way forward, instead of using Entities and Subclass relations instead. Anyways, I noticed we also have pertainym of (P8471) and specified by sense (P6719)

I.E. there should be a way to walk the graph and see some relationship with a distance between "colleague" L6516-S1 and "teacher" L5219-S1 and "staff" L5361-S2 --Thadguidry (talk) 17:47, 13 February 2021 (UTC)

Could you improve your example above with rationale why this is worth having? Please also improve this sentence, it's hard to understand what you mean: It would be nice to see a distance on the graph somehow between "staff" and "colleague", regardless of "guide".--So9q (talk) 07:38, 14 February 2021 (UTC)
@Thadguidry: Sense relationships I think are mainly handled through item relationships, via the "item for this sense" property. Is there a need for something more direct within lexemes? I'm not convinced. ArthurPSmith (talk) 16:30, 15 February 2021 (UTC)
@ArthurPSmith: I'm wondering how, later on, I would put together some Wikifunctions that for Abstract Wikipedia would get "close enough" to interpreted meaning. Imagine a sentence on Wikipedia that says in some language other than English "teachers and school employees were not part of the group". You could rephrase with "school staff were not part of the group". Or simply just "staff were not part of the group", if the target language is vastly condensed. You can then imagine that synonyms, hypernyms, etc. can play a very important part in providing interchange as necessary. That was my thought process and use case. I'm actively thinking about all the kinds of Wikifunctions and programs that I might want to work on in 1-2 years time and what will be needed. I can build a system that holds semantic distances, but hops on the graph can also provide this as input to machine learning. Finding those hops means capturing these relationships, and where I need opinions, in particular around hypernyms in this use case. --Thadguidry (talk) 17:18, 15 February 2021 (UTC)
Yes, I can see that's an interesting use case so it would be worth figuring out how to do it... ArthurPSmith (talk) 18:03, 15 February 2021 (UTC)
Hyperonyms could in principle be handled at least partly through subclass of. « L1 => meaning item A => subclass of B <= L2 », L1 and L2 same language, so L2 hyperonym of L2. In practice this may not be easily handle by a wikifunction (or handled at, if we think in terms of arbitrary data access API in wikis for example the «<=» arrow cannot be crossed currently) and is very sensible to data quality of course, and undersanding of « item for this sense » we share or not. author  TomT0m / talk page 19:08, 16 February 2021 (UTC)

Can’t use Lexemes on Wiktionaries

Hi, I tried to use the lexemes on the French Wiktionary using Lua code, but all the functions defined here are unknown. Is this normal? Lepticed7 (talk) 18:48, 16 February 2021 (UTC)

@Lepticed7: Mahir256 (talk) 19:36, 16 February 2021 (UTC)
Salut @Lepticed7:, thanks for your interest in using Lexemes on Wiktionary :) If everything goes well, we will be able to make it happen later this year. In the meantime, if you have a bit of time, you can help us by telling us more about your usecase: for which purpose do you want to use Lexemes, what part of the data do you want to display, what Lua function would be useful to you? (en français ça marche aussi). Thanks in advance! Lea Lacroix (WMDE) (talk) 08:29, 17 February 2021 (UTC)
@Lea Lacroix (WMDE): Bonjour. En fait, je n’ai pas tellement de cas d’utilisation précis pour le moment. C’était surtout pour essayer de bidouiller, de la même manière que je l’ai fait avec Wikidata récemment. Dans l’idée, et pour en avoir discuter avec d’autres, on voulait essayer de générer les tableaux de conjugaison ou de flexions grâce aux Lexèmes. En cherchant un peu, on doit pouvoir trouver d’autres applications, comme gestion des graphies alternatives, gestion des anagrammes. Et surtout, le faire en douceur, vu que certains par chez nous ont le poil qui se hérisse dès lors qu’on dit le mot Lexème :). Je serai ravi de discuter des applications sur le Wiktionnaire plus en profondeur si jamais la discussion venait à être lancée. À+, Lepticed7 (talk) 08:34, 17 February 2021 (UTC)
Tout à fait, quoi qu'il arrive ce sera en douceur et mené par les contributeurices eux-mêmes, on ne veut en aucun cas forcer l'utilisation des Lexèmes sur le Wiktionaire, mais rendre l'option possible serait déjà une grande étape.
Du coup, prenons l'exemple des tableaux de conjugaison : on aurait donc besoin d'une fonction qui va chercher les Formes d'un Lexème, éventuellement en récupérant aussi ses caractéristiques grammaticales ? Lea Lacroix (WMDE) (talk) 09:40, 17 February 2021 (UTC)
C’est tout à fait ça. Je vois deux solutions : soit l’API que vous proposez permet, pour un lexème, de récupérer la forme selon des caractéristiques grammaticales données, soit il existe une fonction qui me renvoient toutes les formes sans pouvoir filtrer, et de mon côté, je les range et les utilise dans les bonnes cases des tableaux.
À creuser également, mais ça pourrait permettre, si les enregistrements Lingua Libre sont liés aux formes, d’un peu mieux gérer les enregistrements affichés. Parce que pour l’instant, c’est une liste un peu brute chez nous.
Avec un peu de réflexion, je pense qu’on peut dégager de nombreuses applications intéressantes sur les Wiktionnaires. :) Lepticed7 (talk) 09:49, 17 February 2021 (UTC)

Where to describe grammatical gender?

Hi,

by seeing this French lexeme, a question came to my mind. Where do we describe grammatical gender? In the declarations of the lexeme, or of the forms? For few reasons, I think it’s better to put them in the forms’ declarations: it’s also the place of the grammatical number, and, for example, in French, we have some nouns for which the gender changes depending on the number, like amour. For these reasons, I think grammatical gender should always be described in forms’ declarations.

Lepticed7 (talk) 09:18, 17 February 2021 (UTC)

Many nouns have invariable gender, so it should be a lexeme property for them. --Infovarius (talk) 20:43, 17 February 2021 (UTC)
@Lepticed7: interresting case but it's not many, it's 99% of the nouns who have one unique clear gender. The case where gender is attached to the forms are exception, in French there is only 3 well-known words (orgue (L471), amour (L1021)), délice (L15976)) and it also depends on the sense and others particularities. These exceptions can be solved with qualifiers like subject form (P5830) (like I did on amour (L1021)) and maybe subject sense (P6072) (I'm not sure L:L1021#F3 is correct, après-midi (L25740) is also a good example needing qualifiers). I dont think we should model lexemes based on exceptions. Cheers, VIGNERON (talk) 13:18, 22 February 2021 (UTC)

Search using traditional Chinese characters?

When using the search it appears that it does not “understand” traditional Chinese. E.g. searching for 哈爾濱北站 will yield no result, alas 哈尔滨北站 does. --Zenwort (talk) 13:29, 24 February 2021 (UTC)

Hi @Zenwort:,
This is not related to Lexemes (which this talk page is about) and this is not related to traditional or simplified chinese, the reason is much more simple: we had no data about 哈爾濱北站 while we had one about 哈尔滨北站. I've added the label 哈爾濱北站, it give results now. Cheers, VIGNERON (talk) 08:55, 25 February 2021 (UTC)
Return to the project page "Lexicographical data/Archive/2021/02".