Wikidata talk:Lexicographical data/Archive/2018/10

Active discussions

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Onomatopoeia in etymology

I need it for kszyk (L24788). How would you model onomatopoeia (Q170239):

  1. with instance of (P31) set to onomatopoeia (Q170239)
  2. with derived from (P5191) set to "no value" and mode of derivation (P5886) set to onomatopoeia (Q170239) or some other value (new item?)
  3. with some other way

KaMan (talk) 08:11, 25 September 2018 (UTC)

Option 2 seems to make the most sense, but maybe someone else has better ideas. —Rua (mew) 10:27, 25 September 2018 (UTC)
An onomatopoeia is... an onomatopoeia. It's a word class, kinda. Other words can derive from onomatopoeia, and are not onomatopoeia themselves (French verbs like ahaner and ululer fall into that category, as do English to meow, to tick and to click). Circeus (talk) 04:45, 1 October 2018 (UTC)
But the intermediate "direct" onomatopoeia may not always be known, only the verb, noun etc that is derived from it. Especially with old/reconstructed languages. —Rua (mew) 10:41, 2 October 2018 (UTC)

Has/does not have

With the new inflection class (P5911) property, I added qualifiers to giella (L20900) and ealli (L25088) to indicate that they do and do not have consonant gradation in their inflection, respectively. Is this a good way to indicate this, or is there an alternative? —Rua (mew) 10:58, 26 September 2018 (UTC)

There is nothing wrong with creating a separate subconjugation "even inflection without gradation". If you want to write it with does not have part (P3113), I think even inflection (Q56633409) ought to refer to consonant gradation (Q731363) directly. I've added a has quality (P1552) statement to it, as it seems more fitting than "has part" (I suspect this means a "does not have quality" property is warranted lol). Circeus (talk) 18:18, 2 October 2018 (UTC)

Over 10,000 in 1 language (French)!

According to Wikidata:Lexicographical data/Statistics there are 10221 in fr; all other languages are still under 10,000. The total is getting close to 30,000. ArthurPSmith (talk) 18:13, 4 October 2018 (UTC)

  • I think there are two languages: English with an entity for every 1-5 letter word! Congrats Arthur! --- Jura 18:48, 4 October 2018 (UTC)
    Actually for me English is about twice bigger than French because every lexeme in English contains complete set of forms while I can hardly find lexeme with complete set of forms in French. Describing English and others languages usually goes "deep", while French goes "flat". For example average size of last 10 lexemes created in English is 855,1 while for last 10 French lexemes it is 449,6. For Polish the same number is 17332,2. KaMan (talk) 07:29, 5 October 2018 (UTC)
    • It's hard to say. The good thing about the Polish ones is that the generally have statements about the entities (e.g. gender) which others lack. For other languages, some are missing because the relevant properties haven't been created yet. I'm not sure if we are quite ready for forms yet. I think the bug report about the automatic creation of F1 is still open and it's layout is probably only going to happen next year.
      An interesting find from the approach for English is that lookup for short lemmas (and forms) starts getting saturated. --- Jura 07:42, 5 October 2018 (UTC)

Q2428747 or Q1084

Which items should be used for nouns? On L:L189, both are used. Pamputt (talk) 18:53, 4 October 2018 (UTC)

I remeber it was stated recently in some threads that if lexical category is not set to proper noun (Q147276) then there is silent assumption that lexeme is common noun (Q2428747). Oh I found it. It was stated here KaMan (talk) 07:37, 5 October 2018 (UTC)

Over 10,000 in 1 language (French)!

According to Wikidata:Lexicographical data/Statistics there are 10221 in fr; all other languages are still under 10,000. The total is getting close to 30,000. ArthurPSmith (talk) 18:13, 4 October 2018 (UTC)

  • I think there are two languages: English with an entity for every 1-5 letter word! Congrats Arthur! --- Jura 18:48, 4 October 2018 (UTC)
    Actually for me English is about twice bigger than French because every lexeme in English contains complete set of forms while I can hardly find lexeme with complete set of forms in French. Describing English and others languages usually goes "deep", while French goes "flat". For example average size of last 10 lexemes created in English is 855,1 while for last 10 French lexemes it is 449,6. For Polish the same number is 17332,2. KaMan (talk) 07:29, 5 October 2018 (UTC)
    • It's hard to say. The good thing about the Polish ones is that the generally have statements about the entities (e.g. gender) which others lack. For other languages, some are missing because the relevant properties haven't been created yet. I'm not sure if we are quite ready for forms yet. I think the bug report about the automatic creation of F1 is still open and it's layout is probably only going to happen next year.
      An interesting find from the approach for English is that lookup for short lemmas (and forms) starts getting saturated. --- Jura 07:42, 5 October 2018 (UTC)

Q2428747 or Q1084

Which items should be used for nouns? On L:L189, both are used. Pamputt (talk) 18:53, 4 October 2018 (UTC)

I remeber it was stated recently in some threads that if lexical category is not set to proper noun (Q147276) then there is silent assumption that lexeme is common noun (Q2428747). Oh I found it. It was stated here KaMan (talk) 07:37, 5 October 2018 (UTC)

Regional words

Do we have a property that would work for marking where regional words/senses are used? For example, wikt:en:car hire says Australia, New Zealand and Britain. wikt:en:wisht says Cornwall and Devon.

The closest I can find is indigenous to (P2341) but that sounds really weird to me. There's used by (P1535) and valid in place (P3005) too but they don't seem appropriate either. Maybe we need a lexeme-specific property?

- Nikki (talk) 15:28, 27 September 2018 (UTC)

Use "Language Australian English (Q44679), New Zealand English (Q44661), British English (Q7979)" in lieu of "Language English (Q1860)" for wikt:en:car hire; and create Wikidata items for "English in Cornwall" (or Anglo-Cornish (Q3299229) ? see this) and "English in Devon" for wikt:en:wisht. Visite fortuitement prolongée (talk) 15:30, 29 September 2018 (UTC), 15:42, 29 September 2018 (UTC), 15:45, 29 September 2018 (UTC), 15:54, 29 September 2018 (UTC)
I think that is a really bad idea and I'm strongly opposed to it. They are English words found in English dictionaries, the language should be English. It's not realistic to expect people to create and maintain numerous identical lexemes just because a sense is not used by everyone. It would be a nightmare when a word has lots of senses (e.g. wikt:en:pot, most of those senses would belong on the regional lexemes too) or when a word is used in a lot of regions (e.g. wikt:en:pants, which would need at least 11 lexemes for the first sense if every region has its own). A very similar issue is that some senses are specific to certain subjects (e.g. sports, nautical terms) and we need a way to mark those. Not everyone uses those senses either, but that doesn't mean they should be separate lexemes. - Nikki (talk) 16:15, 30 September 2018 (UTC)
I'm very much with Visite fortuitement prolongée here. If they are regional dialect words, they have to be marked as such.
The argument about words with multiple senses is a misunderstanding and highly misleading: these can be marked as regional in the exact same way on the relevant sense (duh!) when senses become accessible in mid-october.
Now, I'm not saying a "dialect" subproperty wouldn't be potentially useful, but as long as one doesn't exist and items for the dialects do, "language"+dialect is a perfectly reasonable property combination to use. Circeus (talk) 18:09, 2 October 2018 (UTC)
Nobody said they shouldn't be marked. My argument is that treating dialects as independent languages and therefore requiring independent lexemes is an awful way to do it. I am strongly in favour of statements on senses instead.
We can't mark individual senses on a lexeme without using a property. The only language codes senses have are the ones which mark the language that the gloss (the text describing the meaning) is written in. For example, the German word "blau" could have an en-gb gloss which uses "colour" and an en-us gloss which uses "color". - Nikki (talk) 21:14, 2 October 2018 (UTC)
Since nobody has made any suggestions for which property to use, I've proposed a new one: Wikidata:Property_proposal/location_of_sense_usage. - Nikki (talk) 12:01, 6 October 2018 (UTC)

L:L21070 should not exist

Hello, what do you think about L:L21070? To me, it should not exist since International Phonetic Alphabet (Q21204) is not a language. If we want to use analogy, it is closer to a script than a language. Any opinion? Pamputt (talk) 18:42, 4 October 2018 (UTC)

Well it was created in French, then changed to IPA if you look at the history. I agree IPA is not a language. ArthurPSmith (talk) 18:48, 4 October 2018 (UTC)
It's not a French word either. Or I would say it's as much a French word as it is a Spanish one. Pamputt (talk) 18:55, 4 October 2018 (UTC)
Well, the lexical category is correct for French. Wikidata_talk:Lexicographical_data#Count_number_of_vowels explains why it's needed there.
For entities for characters, there is a discussion at Wikidata:Requests for permissions/Bot/GZWDer (flood) 3. --- Jura 18:59, 4 October 2018 (UTC)
I strongly disagree that /o/ is a French word, this is a IPA symbol. Otherwise, what about [o], \o\, ...? What would be the meaning of this word (we can continue in French if you want)? Just because you need this lexeme for storing some information does not mean it is a French word. Pamputt (talk) 19:05, 4 October 2018 (UTC)
We can continue the discussion in the thread above. As for the entity for IPA, I currently don't see the need. --- Jura 19:07, 4 October 2018 (UTC)
I have no specific opinion about the previous discussion. I just say that how it is doing now is not the good way. Pamputt (talk) 19:12, 4 October 2018 (UTC)
It's hard to say. Maybe Infovarious eventually details the scheme they have in mind. I don't want to be the person who doesn't contribute anything to his attempt and just fills the forum with idle comments. --- Jura 19:16, 4 October 2018 (UTC)

[UNDENT] Wasn't there a debate not too long ago about whether phonemes could be created as lexemes? I though the consensus was clearly against it? Circeus (talk) 22:35, 4 October 2018 (UTC)

Yes, @Circeus, Pamputt: See, Wikidata_talk:Lexicographical_data/Archive/2018/09#Is_phoneme_a_lexeme? KaMan (talk) 06:15, 5 October 2018 (UTC)
Looks more like a general discussion about the nature of phonemes. It seems to have ended with an unanswered question about how to store IPA. --- Jura 06:39, 5 October 2018 (UTC)
From my POV it ended with consensus that phonemes should not be stored in lexeme namespace. And I see the same from this thread above. KaMan (talk) 06:44, 5 October 2018 (UTC)
I think we all agree that phonemes and letters aren't words, but that doesn't really help us building a structured database. Maybe you have a constructive suggestion for the point raised below? --- Jura 06:50, 5 October 2018 (UTC)
Store phoneme as Q-item. These are not lexemes. See close-mid back rounded vowel (Q862579) for example. Pamputt (talk) 09:43, 5 October 2018 (UTC)
How would the information be included? What would be the advantages over the approached favored by active contributors? How does it compare when they create entities for these? What would you suggest to them? Active contributors need to make editorial choices that store the information in an optimal way given the features of various entities.
For L21070, it seems to be that the current lexical category is sub-optimal if not wrong. If this is coded as IPA, it should probably be defined as a letter in that alphabet. --- Jura 08:35, 6 October 2018 (UTC)
Maybe those active contributors could explain to other active contributors why they need to add not-lexeme content into lexeme namespace? What can't be achieved without doing so? --Lexicolover (talk) 12:39, 6 October 2018 (UTC)

Pronunciation respelling for English

Some people like IPA to express spelling, but this doesn't seem to work for everyone. The result is that we have plenty of pronunciation files. Also some dictionaries attempt to express pronunciation in regular English language graphemes. See w:Pronunciation respelling for English. We could obviously attempt to store them in several formats for every word as it seems to be done for some Slavic languages, but a better solution might be to find a structured way to map the sounds to regular graphemes. How could this be done for English in a structured way with Wikidata? From your experience with creating entities such as Q- and L- ones, how would you go about it? --- Jura 06:39, 5 October 2018 (UTC)

Notice for tool developers

I’ve just documented the API which the Wikidata Lexeme Forms tool uses to search for potential duplicates of a lexeme you’re about to create, over at User:Lucas Werkmeister/Wikidata Lexeme Forms#Duplicates. You’re welcome to use the same API, either in your own tools that create lexemes (also to prevent creating duplicates), or as a stricter version of lexeme search (wbsearchentities with type=lexeme) which only returns exact matches on the lemma and language code. --Lucas Werkmeister (talk) 13:52, 5 October 2018 (UTC)

  • Does that mean we can finally use tools to create entities? --- Jura 08:36, 6 October 2018 (UTC)
@Jura1: I don’t think that was really disallowed, as far as I understand… in the release announcement, Lea Lacroix (WMDE) asked us to refrain from any mass imports, which I understood to be more about bots – a tool that lets editors create individual lexemes, one by one, is okay as far as I understand. And she also suggested that we wait a bit before building tools or scripts, but, well, I’m willing to deal with the risk of API changes breaking my tool :)
This API to find potential duplicates could also be used for a bot that automatically creates a bunch of lexemes, that’s true, and I hope no one will build that for now – but I think there’s still potential for some more tools like Wikidata Lexeme Forms, where editors create individual lexemes at a reasonable pace, and hopefully this API can help with that. --Lucas Werkmeister (talk) 08:18, 7 October 2018 (UTC)
Building tools on the top of the API is possible, just be aware that the system is not completely stable yet, some things may change, and require the tool developers to rewrite their code later. Lea Lacroix (WMDE) (talk) 08:39, 9 October 2018 (UTC)

A few lexemes disappeared

@Lea Lacroix (WMDE): I noticed strange thing. Lexemes from L20540 to L20543 just disappeared. They were neither merged, nor deleted. They for sure existed because they were created by me and I document every Polish lexeme I created on this page. As far as I remeber they were created 13 September (after datacenter switch) and they disappeared yesterday/today (after datacenter switch back)‎. Could it be that this is related to ? What should I do next with this case? KaMan (talk) 08:10, 11 October 2018 (UTC)

That's indeed very weird. Can you create a Phab ticket? I'll make sure that it's looked at soon. Lea Lacroix (WMDE) (talk) 08:24, 11 October 2018 (UTC)
@Lea Lacroix (WMDE): Ok, KaMan (talk) 08:39, 11 October 2018 (UTC)
Latest status: the data reappeared. Can you check again and see if you find other problems or missing items or Lexemes? Lea Lacroix (WMDE) (talk) 15:42, 11 October 2018 (UTC)
@Lea Lacroix (WMDE): I don't see problems now. Thanks to all involved, it was really fast response as I observed on phabricator. KaMan (talk) 16:00, 11 October 2018 (UTC)
Thanks for reporting this issue :) Lea Lacroix (WMDE) (talk) 16:41, 11 October 2018 (UTC)
Looks good. Thanks for fixing it. Seems like lexemes helped the global WMF project ensure stability! Q56604437 doesn't exist, but it might never had. Q56604439 was available on query server even when it had disappeared here. --- Jura 17:57, 11 October 2018 (UTC)

Queries to improve Lexemes quality


Since the SPARQL Query service works with lexemes now, here some queries:

Feel free to add more !

Cdlt, VIGNERON (talk) 12:28, 18 October 2018 (UTC)

is’nt there images of the lexeme itself, sometimes ? I mean something like a painting of the word, for example, a calligraphic image … Or does this deserves a specific property ? author  TomT0m / talk page 12:31, 18 October 2018 (UTC)
@TomT0m: ah yes, I didn't thought about that but I see that Portez ce vieux whisky au juge blond qui fume (L25181) and Voyez le brick géant que j'examine près du wharf (L25367) are similar to what you talk about, that said I'm not sure if these kind of images are really relevant and in most case, image (P18) should be deleted or moved to the senses section. Cdlt, VIGNERON (talk) 12:39, 18 October 2018 (UTC)
@VIGNERON: Interesting edge case you get here with a pangram ! The interesting thing with this is that actually nobody cares about the sense of « Portez ce vieux whisky au juge blond qui fume », it highlights a bad use of item for this sense (P5137) because not only it’s used on a non sense but also the item is not about a sense at all, the wikidata item is about the lexeme. Maybe we should have statements like instead. author  TomT0m / talk page 12:54, 18 October 2018 (UTC)
… nobody cares about the sense… Maybe that's because pangrams, palindroms, polyptotons and other puns (and alliterations too, btw) are not lexemes… --Shlomo (talk) 08:07, 19 October 2018 (UTC)

@VIGNERON: Feel free to add them to "Maintenance and repairing" of Wikidata:Lexicographical data/Ideas of queries. There is already one for images. KaMan (talk) 12:34, 18 October 2018 (UTC)

Thanks KaMan, I'll work on this page. Cdlt, VIGNERON (talk) 12:39, 18 October 2018 (UTC)

What to put into sense gloss

I am a little unsure if I understand glosses correctly. For example in this edit I thought of "general meaning" and it was improved to "human female": (@ArthurPSmith:) Is the gloss not very similar to the linked item? How can we keep a text-field consistent in style? I also worry about biases in the gloss field. "human female" is maybe just one of the definitions for a woman. Will this hold up for more nuanced or controversial lexemes? --Tobias1984 (talk) 18:56, 18 October 2018 (UTC)

For other definitions, you would add additional senses, no? I have been generally adopting the approach of treating the senses as (shortened) versions of standard item descriptions - i.e. a brief disambiguation of the meaning, to be supplemented by appropriate statements that may clarify. ArthurPSmith (talk) 19:07, 18 October 2018 (UTC)
@ArthurPSmith: Hmm, but if we put there only shortened versions of item descriptions then how we will be able to produce dictionary items from Wikidata? And how we will be able to share definitions of lexemes with external usages? I prefer to put in glosses standard, dictionarylike definitions like in Wiktionaries. KaMan (talk) 06:46, 19 October 2018 (UTC)

Serious definitions should be added via statements and we need an appropriate monolingual-text property for it. The glosses are not suitable because they can't have multiple values and they can't bear references. @ArthurPSmith: Even for the same sense, we need multiple values for definitions. Different sources can provide different (broader, narrower, more general, more specific) definitions of the same sense.--Shlomo (talk) 09:00, 19 October 2018 (UTC)

Gloss are gloss, I wouldn't put definitions in gloss, that no't meant for it. Like @ArthurPSmith: I only put the minimum necessary to discriminate between several senses of a lexeme, I think that's the best long-term method (especially if you take into account the multilinguism). Cdlt, VIGNERON (talk) 10:13, 19 October 2018 (UTC)

Senses are now part of Lexicographical Data

Hello all,

We now have Senses :)

Senses will allow you to describe, for each Lexeme, the different meanings of the word. By using multilingual glosses, very short phrase giving an idea of the meaning. In addition, each of these Senses can have statements to indicate synonyms, antonyms, refers-to-concept and more. By connecting Senses to other Senses and to Items, you will be able to describe precisely the meaning of words with structured and linked data. But the most important thing is that Senses will be able to do is collect translations of words between languages.

Feel free to try editing Senses, and let us know if you have questions or find bugs.

Note: there are still issues with sorting the IDs of Senses, Forms and sorting the glosses, that will be solved later this week. Thanks for your understanding.

Cheers, Lea Lacroix (WMDE) (talk) 10:15, 18 October 2018 (UTC)

Note: Senses will appear on Lexemes during the next few minutes, in the order of L-IDs. We just passed administratif (L19000) :) Lea Lacroix (WMDE) (talk) 10:30, 18 October 2018 (UTC)
I've tried senses on tour (L2330), tour (L2331), tour (L2332). I just added 1 or 2 obvious and trivial senses to these lexemes right now, I'll add more later (these 3 words are *very* polysemic in French, the first might have 10 or 20 senses in the end).
More globally, where are we for structure and properties specific to senses? (I must admit I don't care much about senses so I didn't follow it closely)
Cheers, VIGNERON (talk) 10:40, 18 October 2018 (UTC)
There seem to be the most interesting one to me : item for this sense (P5137)  , although the naming is terrible :) Ideally senses could consist of only one statement with this property, see the pictures in « tour(fr) » for example, they are redundant with the one on the « tour » item. author  TomT0m / talk page 10:52, 18 October 2018 (UTC)
@TomT0m: ideally yes, but for at least 95 % of the senses there will be no items, and even when there is an item, it may not be granular enough or too much granular (qv. tour (L2332) where there is only one item lathe (Q187833) for the two senses). So I would prefer to keep the image (P18) as a claim of the senses. Cdlt, VIGNERON (talk) 11:15, 18 October 2018 (UTC)
This is something we actually need to think about. Senses could become actually second class items, that we could describe with standard properties such as « subclass of » exactly as an item, with the only difference that they are not proper items. There is a clear overlap, so this might lead to information duplication, inconsistencies in the representation of the same kind of informations and so on. I think the fact that there is no item right now should not stop us to create one. author  TomT0m / talk page 11:24, 18 October 2018 (UTC)
@TomT0m: yes, we need to think about it, I don't fear to create new items, but I fear the reaction of the community if I flood them with billions of items ;) VIGNERON (talk) 12:36, 18 October 2018 (UTC)
I made some tests in the sandbox, seems to work. What are the Senses you're trying to link to each other @KaMan, Nikki:? Lea Lacroix (WMDE) (talk) 14:42, 18 October 2018 (UTC)
@Lea Lacroix (WMDE): If I load that lexeme and click "add value" for the existing "translation" property, I get the message KaMan mentioned. If I try to exit the existing one, I get "Handling of "wikibase-entityid" values is not yet supported." - Nikki (talk) 14:47, 18 October 2018 (UTC)
@Nikki: It looks like I can't reproduce :( Can you create a ticket with screenshots? A few translation statements have been created here and here, and I filled this ticket because the display is not great. Lea Lacroix (WMDE) (talk) 14:55, 18 October 2018 (UTC)
@Lea Lacroix (WMDE): I searched Phabricator and it looks like it's the same problem as phab:T195402 (the error message is slightly different because it's a different datatype). Removing the local storage entry "MediaWikiModuleStore:wikidatawiki" (which was suggested in the comments there) has fixed it for me. I don't know if there's already another ticket for fixing it properly. - Nikki (talk) 15:11, 18 October 2018 (UTC)
translation (P5972) works fine for me, see janvier (L1183) or Genver (L8146). I'm just wondering if this property is really useful and relevant. Can't we easily have the same information with a query for the item of same item for this sense (P5137). So why do we need this property? Cdlt, VIGNERON (talk) 15:41, 18 October 2018 (UTC)
@KaMan, Nikki: Per phab:T195402: If your experiencing this issue try deleting your cookies (logging out and back in might be enough). Let me know if the problem still appears. Lea Lacroix (WMDE) (talk) 16:06, 18 October 2018 (UTC)
@Lea Lacroix (WMDE): As I said above, deleting "MediaWikiModuleStore:wikidatawiki" from local storage worked. The problem came back though and I had to delete it again to fix it. - Nikki (talk) 17:47, 18 October 2018 (UTC)
And again... :( - Nikki (talk) 18:24, 18 October 2018 (UTC)
@Lea Lacroix (WMDE): fortunately I was operating in incognito mode of the browser, so instead of deleting cookies or deleting something in local storage I just restarted my session in incognito mode. It helped. KaMan (talk) 06:39, 19 October 2018 (UTC)
Congrats to the team :) one big and important step for the project. author  TomT0m / talk page 10:52, 18 October 2018 (UTC)
@Lea Lacroix (WMDE): Thank you for your hard work on delivering senses. I have just added some translations into leden (L1202) and what I find strange there is that I can only see words in my prefered languages (which means English only in my case), for others I only get ID. I think I should see all lemmas. The second deal is, that one does not see the language of the translated word which might become confusing. And I don't think there should be language qualifier added manually. --Lexicolover (talk) 22:32, 18 October 2018 (UTC)
@Lexicolover: I don't think it's based on preferred languages - I have en, es, and fr, and I still only see the English translation in your translations list for this lexeme sense. ArthurPSmith (talk) 14:20, 19 October 2018 (UTC)
@ArthurPSmith: You are right. Now I can see Finnish translation, so it definitely isn't caused by prefered languages. I guess maybe something with "wikibase cache"? Either way it seems to me that translations are not good enough to be used yet. --Lexicolover (talk) 19:54, 19 October 2018 (UTC)

I refresh information about Template:Lexicographical properties existence - all new sense-related properties are mentioned there. KaMan (talk) 08:17, 20 October 2018 (UTC)

Indicate that a word exists in a dictionary

How should we indicate that a word exists in a dictionary? I have been using described by source (P1343) (see, e.g., mandag (L10723)), but I vaguely recall seeing another approach. I suppose those dictionaries that have a proper identifier could have a dedicated property. In the Danish case I have left the deep link in the reference, which might not necessarily be the best way. — Finn Årup Nielsen (fnielsen) (talk) 14:37, 19 October 2018 (UTC)

I use described by source (P1343) a lot (and I'm not the only one), I don't think to remember an other way (except maybe described at URL (P973) but I don't think it's a good idea here). Cdlt, VIGNERON (talk) 15:04, 19 October 2018 (UTC)
I use nine dedicated properties for most important Polish online dictionaries and described by source (P1343) for all others. KaMan (talk) 15:21, 19 October 2018 (UTC)

If you indicate that every words of a dictionary are part of the dictionary, you basically duplicate the nomenclature of the dictionary. When the dictionary is normative, a peculiar choice have guided the presence or absence of words in the nomenclature. Don't you think there is a risk of legal infringement here? Noé (talk) 07:48, 20 October 2018 (UTC)

@Noé: see m:Wikilegal/Lexicographical Data « The organization of words in alphabetical order typically will not be creative enough to be copyrighted, barring some very unusual choice or arrangement by the dictionary author. » so in most case, I don't see any problem (even less if you work with PD or free dictionaries). Would you have an example of dictionary where the choice is unusual? Cdlt, VIGNERON (talk) 09:06, 20 October 2018 (UTC)
My comment was not about organization but on nomenclature, the selection of words inserted and words let out of the dictionary. About organization, some dictionaries are by roots with derivatives gathered in under the root, when others have one entry for each derivative words, or same for colloquial expressions, sometimes under a root, sometimes each under an entry. Some dictionaries have an entry for nouns for feminine agent (like menuisière), when other put it under the masculine entry. -- Noé (talk) 09:36, 20 October 2018 (UTC)
I don't know, for me both organization and nomenclature are (except unusual choice) based on an objective method not on subjective creativity. Anyway, as a Wikisourcerer, I will work on PD dictionaries (there is already plenty of them to keep me busy for years), so no problem here. Cheers, VIGNERON (talk) 09:53, 20 October 2018 (UTC)
Of course, PD dictionaries are ok. For nomenclature, I think it is a separate issue not analysed in Wikilegal/Lexicographical Data. I think it should be evaluated to avoid any legal infringement. -- Noé (talk) 09:59, 20 October 2018 (UTC)

Senses for forms

As far as I can tell the senses can currently only be added for the base form of a lexeme. In the example of repülőtér (L31634) this is airport (Q1248784). But L31634-F19 actually means "at the airport". This concept will probably not have an item any time soon, but it would be nice to add the gloss to the form. Is it planned to also have senses for forms? --15:12, 21 October 2018 (UTC)

Generally, forms don't need separate data about sense. The meaning "at the airport" results from the fact, that "repülőtéren" is a superessiv form of a word which means "airport".--Shlomo (talk) 18:11, 21 October 2018 (UTC)
@Shlomo: What about the locative cases that also have abstract meanings? Delative case (L31634-F31) can mean "over the airport" or "about the airport" when e.g. one is "talking about the airport". Maybe that kind of information about how the case can be used should be part of an item that describes the delative case in Hungarian? --Tobias1984 (talk) 19:00, 21 October 2018 (UTC)
That seems more like something Wikipedia should handle. Wikidata doesn't seem equipped to handle a description of all the intricacies of grammar in every language. —Rua (mew) 19:45, 21 October 2018 (UTC)

Order of Forms (and Senses) is fixed

Hello all,

Just to let you know that this bug making the Forms ID sorted alphabetically and not numerically (ie F10 appeared before F2) is now fixed. It's also working correctly for Senses. You may have to purge the page to see it correct. If you notice anything weird, let me know. Lea Lacroix (WMDE) (talk) 10:08, 22 October 2018 (UTC)

Feature request: link labels/aliases from items back to the lexemes

Now that we have senses and that some are linked to an item using item for this sense (P5137), would it be possible to link the labels/aliases from the linked item back to the lexeme? I assume that it would be a matter of checking automatically which label matches any form in a given language of the linked lexeme item, and enable a link to that lexeme item. Thoughts about this?--Micru (talk) 13:49, 19 October 2018 (UTC)

@Micru: Firstly: not lexeme but sense. Secondly: do you want to have in each item links to 6000+ lexemes (according to number of languages)? at least, not mentioning synonyms... --Infovarius (talk) 15:56, 19 October 2018 (UTC)
@Infovarius: Why not? The label text is already on the items, we wouldn't be adding more than what there is, only converting the plain text into links as a way to access more quickly the lexicographical data.--Micru (talk) 09:05, 20 October 2018 (UTC)
@Micru: I'm not sure to understand, do you want an inverse property for item for this sense (P5137)? or is it something else? and why? As said by Infovarius, this is a one-to-many relationship (and I would say a potential 100 000+ as several lexemes - even inside one languages - can refers to the same concept) and in this case, we usually don't create an inverse property (same thing for author (P50) for instance). Cdlt, VIGNERON (talk) 09:11, 20 October 2018 (UTC)
@VIGNERON: No, I do not want an inverse property. I want that the labels/aliases of items are converted into links to the appropriate form/lexeme. This is a 1-to-1 relationship. As an example take for instance algeps (L16117), that has a sense linked to gypsum (Q82658). What I want is that the text "algeps" on Q82658 is transformed into a link to L16117, because a) there is a sense connected to that item, and b) there is at least a form on that lexeme that matches that specific string on the item in that language. There should be only one lexeme for each label/alias that has a sense linked to the item, if there are more, it is a duplicate.--Micru (talk) 14:20, 22 October 2018 (UTC)
@Micru: oh I see now. To be sure we are on the same page, what you want is: for a given Qitem if there is a Litem linking back to it with item for this sense (P5137), then add a link to this Litem on the same language label of the Qitem (and indeed this is a 1-to-1 relation). This sound like something that could be done with a javascript gadget. @Lea Lacroix (WMDE): do you know someone who could do look into that? Cdlt, VIGNERON (talk) 14:26, 22 October 2018 (UTC)
@VIGNERON: Yes! That's exactly what I had in mind. I hope it is possible with a gadget. Looking forward to hearing what Léa says.--Micru (talk) 15:42, 22 October 2018 (UTC)
Hello, and thanks for the feature request. This was not part of the initial plan, so we need to evaluate the feature more precisely, both its feasibility and how important it is for the community. I can't promise anything for now. I agree that this could be something done in Javascript on the top of the interface, therefore it doesn't have to be done by the development team: a community developer with Javascript skills could also do it :) Lea Lacroix (WMDE) (talk) 06:58, 23 October 2018 (UTC)

Showcase Lexemes with Senses

Hello all,

Thanks for your work on Senses and the experiments you already made. Thank you also your feedback, that I'm currently collecting and will analyze in the next weeks how we can improve the interface based on your needs.

I was wondering if you already noticed some Lexemes that are well filled and organized, including Senses, and could be used as examples to show the structure of Lexemes?

Cheers, Lea Lacroix (WMDE) (talk) 09:52, 22 October 2018 (UTC)

@Lea Lacroix (WMDE): What about Polska (L9751) KaMan (talk) 10:25, 22 October 2018 (UTC)

I came to request the same thing. Specifically it would be great to fill out the example words given in the table at Wikidata:Lexicographical_data/Documentation#Data_Model, so that it can act as a temporary list of Showcase_lexemes similar to Wikidata:Showcase_items. Currently "book" is the only Arabic, English, or German sample lexeme with any sense info. Thanks! Quiddity (talk) 16:00, 23 October 2018 (UTC)

What form should senses translate?

When a lemma has multiple forms, the question arises: which of them should be translated into a sense? A verb for example could be translated as "to walk" but also "I walk", "walks", "walked" or many other possibilities. English verbs are lemmatised with the infinitive, but each language has its own customs regarding this. If a verb is lemmatised with something other than the infinitive, should senses reflect this other form's meaning (translate the lemma form exactly), or should senses be written in the lemma form of the language the sense is written in (translate lemmas with lemmas)? Both approaches have downsides: translating the lemma form exactly can be tricky and unintuitive (Arabic lemmatises verbs in the past tense for example), and even unnecessarily complex if the language uses a lemma form that does not translate easily into English. On the other hand, translating lemmas with lemmas could mean that the lemma form and the sense don't match exactly in meaning, like an Arabic verb that's lemmatised in the past tense, being translated in an English sense with the infinitive. —Rua (mew) 10:44, 22 October 2018 (UTC)

@Rua: I'm not sure what you mean by "translate" here - you're not referring to the new translation (P5972) property are you? If you are referring to the wording that should go into the "gloss" in a sense, I don't think that's hard to do for nouns, adjectives, or adverbs - just speak in simplest case - nominative, positive, etc.. For verbs it is trickier - in English I think I've generally been using the infinitive (minus the word "to") but sometimes the present participle ("" form) reads better. I guess we should try to be consistent? ArthurPSmith (talk) 17:52, 22 October 2018 (UTC)
I don't know about nouns, probably you are right (but I can admit the situation when there is no nominative case in a language). But as for verbs: some languages have no infinitive and some has different "simple" form - like "I walk" in Greek or Latin. So I understand the question by Rua. I suppose that still we can link lemmata with different initial forms to each other and then there will be a possibility to derive this form by comparing main lemma label with a list of forms inside. --Infovarius (talk) 20:12, 22 October 2018 (UTC)
I think you are somehow ovethinking it. We build Wikidata on lexemes not lemmas, which means that the sense apply to all possible forms the word can have, and it is only the matter of convention that one of those forms is used as lemma and to describe the sense of the word. Trying to fit translations to to forms of lemmas is simply impossible because Wikidata are not bilingual (or should not be) so it is not a matter of English × another language only. Some languages can use as dictionary form some kind of concept that does not exist in another language or does not make sense there. And there are common cases where such approach would not even come to mind - like to fit forms of plurale tantum noun in one language to common noun in another. So my vote is on lemma - lemma approach, just keep it simple. --Lexicolover (talk) 20:55, 22 October 2018 (UTC)
+1, "to walk", "I walk", "walks", "walked" is the same lexeme (no matter what is the main lemma indicated, like for a Qitem the main label doesn't always make sense depending on the context). At least in English, if in an other language it's multiple lexemes, just put multiple values. Then, when someone want to use a specific forms and the main lemma is not appropriate, it's possible to query the corresponding forms (or the closest one if no exact corresponding forms exists), I see no example where it's needed to explicitly store data about translated lemmata in Wikidata. Cdlt, VIGNERON (talk) 17:50, 24 October 2018 (UTC)

Standards about use of 'synonym' property?

Some senses may have a large number of synonyms. Given a sense with 6 synonyms, if we are to give a complete representation in the data, would we then list the other 6 synonyms on each of those 6 synonymous senses? This would mean listing NxN synonyms for each synonymous word sense denoting some concept, where that concept is denoted by N word senses. This seems to add an unnecessary level of redundancy. Would it perhaps be better to choose one word sense from the set of synonymous senses to be a hub that includes all synonyms, and just have the other senses only link to the hub synonym? Have there been any standards established for how to enter synonymy relations? Liamjamesperritt (talk) 01:15, 23 October 2018 (UTC)

The ideal solution would be to have a separate Wikidata object that represents the meaning of a word, and when multiple senses link to this object, they are considered synonyms of each other. I don't know if that can be done in Wikidata as it stands now, though. It would also solve the problem of translations, which have the same NxN problem because they are in essence cross-language synonyms. —Rua (mew) 10:40, 23 October 2018 (UTC)
That would make a lot of sense. I suppose for nouns, we can use Wikidata items to act as this "meaning object" that we can link noun senses to, but there is no equivalent object for verbs, adjectives, adverbs, etc. Should we hold off from entering synonyms and translations for verbs, etc. until there exists such an object? Liamjamesperritt (talk) 20:42, 23 October 2018 (UTC)

Hypothesis about senses


I was thinking about how senses will be structured and will contains, and here some ideas. They may be right or wrong, I don't know and in the contrary, I'd love to get others points of view :

  • most lexemes will have one of few senses (it depends where we put the threshold of granularity and obviously some exceptional lexemes will have a lot of senses)
  • senses will contains few data
    • in most cases I guess there will only be item for this sense (P5137) (as most data are already been stored in the correspond items, in general there is no need to duplicate them) and at most 2-3 properties
  • most senses won't even have item for this sense (P5137),
    • temporarily because the corresponding items doesn't exit yet
    • permanently because an item can't be created (not for noun but I'm thinking to others classes like verb or adjective, how to create an item in these cases?)
  • several item for this sense (P5137) will point to the same items
    • obviously across languages, as most items have a lexemes in each languages (which is the label and/or the alias)
    • even inside one language (gwenn (L30900) and gwenn (L30901) are the noun and the adjective for the color white (Q23444), there will never be two items for such close concepts)
    • but will it be closer to a one-to-many or to a many-to-many relationship?

What do you think of these ideas? Do you have any other general idea on how senses will work?

Cdlt, VIGNERON (talk) 09:43, 20 October 2018 (UTC)

  • As they are hypotheses, you have to wait a year to test them. How would you see the thesaurus function working? --- Jura 10:11, 20 October 2018 (UTC)
  • permanently because an item can't be created (not for noun but I'm thinking to others classes like verb or adjective, how to create an item in these cases?)
    For verb
    see eating (Q213449) which is an interesting mixed up item : in french it is supposed to denote the food, while in english its supposed to denote the act of eating. In the english sense, it’s an exact match for « manger »(fr). Ontologically, verbs can denote some kind of action (Q4026292), so any subclass item of it is a candidate for a verb sense, no problem at all.
    for adjectives
    hints could probably be found with a property like has quality (P1552)   or in articles like fr:Beau, beauty beeing the quality of something beautiful. Ontologically it’s tempting to associate to this kind of adjective the class of all objects who have this quality (blue object « has quality » beeing blue, for example, which could be linked to the fictional lexeme « blueity ». It’s so tempting that actually indeed most of the time we could very well assimilate the quality itself as the definition of the class and use the same item for the class of object with this quality. Maybe another property for the senses of adjective would be a good idea : « denotes the quality of elements of this class » (which would link a sense of « childish » to « child (Q7569) » for example). on the color, I would not be surprised we would actually surprised we end up with different item, one for « blue light », one for « blue object » (object emitting blue light directly or under white light)
    It also should not be too hard to find items for adverbs like « donc(fr) » that express logical consequence. author  TomT0m / talk page
  • several item for this sense (P5137) will point to the same items sure, there is routinely many synonyms in a language … author  TomT0m / talk page 16:53, 20 October 2018 (UTC)
For verbs: The subclasses from this AAT facet are still waiting to be (matched/)created from Mix'n'match or elsewhere; and they are needed to properly describe narrative works, literary works and visual artworks. For adjectives: From here needed to describe materials etc. Progress on both in item namespace isn't that far yet, but I hope that will change. --Marsupium (talk) 11:09, 21 October 2018 (UTC)

Regarding the number of senses per lexeme, there is previous work on this stating a correlation between word frequency and number of senses in a language.

Regarding the small number of statements per sense, I am not convinced. I hope for example sentences per sense, frequency counts, first usage with this sense, register, etc.

Regarding items for adjectives, verbs, etc. - a good question, and I have no idea what the answer will be. I am hoping that we will figure this out over time together. I am sure we will discover interesting patterns in how to represent the senses, and that we might a few times restructure and revise our approach. --Denny (talk) 17:01, 22 October 2018 (UTC)

@Vigneron: to answer your question about a thesaurus function: I think this would work with statements on senses. So in one way or the other, you need to factor this into your hypotheses. --- Jura 21:37, 23 October 2018 (UTC)
@Jura1: do you mean a query like French words about countries ? (crude query, needs improvements but the data are not here yet) VIGNERON (talk) 08:59, 25 October 2018 (UTC)
Just noticed that w:Thesaurus is very different from fr:Thésaurus. More like the first one. BTW, lexemes for country names should all be there (in French), adjectives and demonyms still need work. --- Jura 09:22, 25 October 2018 (UTC)

Lexeme data modeling - do we need a wikiproject, or somewhere here?

From reading and participating in some of the property proposal discussions for lexemes I think we need a better place to discuss how to model relationships between lexemes (and their components). The current Wikidata:Lexicographical data/Documentation page is focused on the underlying wikibase modeling. That's fine, but it means it's probably not the best place to describe the community consensus on things like compound words, contractions, etymology, relationships between similar words, etc. How do we best use the various properties and the fields provided in the interface? Notability might also be something we want to set down better for lexemes. We've discussed a lot of these things on this page, but most of those discussions have been bumped to archives where's it's hard to find. So I guess I'm thinking we should have either:

  1. A page here under Wikidata:Lexicographical data for "best practices" or something like that, or
  2. Create a new Wikidata:WikiProject Lexemes that would focus on the property and best practices side of lexemes

Any strong opinions on this? ArthurPSmith (talk) 19:03, 24 October 2018 (UTC)

Definitely, as someone not much involved in this I've tried to find such a page yesterday for some time after deciding that it obviously doesn't exist. --Marsupium (talk) 20:04, 24 October 2018 (UTC)
  • This page seems fairly useful for that purpose. Having more pages isn't really going to simplify this. Maybe feature requests should be made on Wikidata:Contact the development team going forward. When discussed here, they don't seem to feed into the development cycles anyways. --- Jura 03:03, 25 October 2018 (UTC)
@ArthurPSmith: No strong opinion from my side :) This page used to centralize the feedback, both technical and structure-related, at the very beginning of the project, but whenever you feel like it's the good time to create a proper WikiProject, feel free to do so.
@Jura1: Features requests can also go to Wikidata:Contact the development team. Both pages are watched and the pieces of feedback are taken in account. You can observe that by all the bugs that have been fixed and feature requests implemented over the past months. Now, this is not an immediate process. The requests are analyzed, prioritized, and integrated to a bigger picture. This can take some time. Lea Lacroix (WMDE) (talk) 14:55, 25 October 2018 (UTC)

Entering senses

I couldn't find a discussion or a Fabricator bug about it so it should be noted. I have to enter something like "L2222-S1" in sense-valued properties (like translation (P5972)) now. When will it be possible to have human-writable input (like "monday" and then choose from the list)? --Infovarius (talk) 14:56, 25 October 2018 (UTC)

Yes I noticed this also, there's no autocomplete for Senses that works right now. ArthurPSmith (talk) 15:02, 25 October 2018 (UTC)
It is restricting the growth because I can't see which values can be added and I have to find them somehow else. P.S. Funny that when I click edit I cannot see "L-S" notation and cannot copy it. --Infovarius (talk) 15:15, 25 October 2018 (UTC)

New elements compared to Wiktionary / print dictionaries

I think it would be good to attempt to list things that can be done with Lexemes that aren't possible with data from Wiktionaries or a print dictionary. Obviously, over the years, thanks to templates and categories, many Wiktionaries have grown fairly structured.

Maybe Wikidata:Lexicographical data/Statistics/indirect translations is a good example, though experienced Wikidata users might suggest that this was already possible with items.

Wikidata:Lexicographical_data/Statistics/translations still needs some work. --- Jura 11:59, 25 October 2018 (UTC)

Just a comment, those are neat list pages, thanks for generating them. ArthurPSmith (talk) 20:55, 25 October 2018 (UTC)
  • Maybe the storage of quotes attesting specific lexemes or forms are the most useful ones. --- Jura 10:55, 26 October 2018 (UTC)

Format and inflection

Please see Property talk:P5911. --- Jura 11:40, 26 October 2018 (UTC)

usage example (P5831) at lemma or sense level

If I add usage example (P5831) to Lexeme, at which level should I add it? I would associate usage example (P5831) to the sense (see, e.g. minute (L2500)) and not to the Lemma (see, e.g. mandag (L10723)). Any thoughts? Mfilot (talk) 09:07, 28 October 2018 (UTC)

@Mfilot: usage example (P5831) was designed to be used at lexeme level with two qualifiers for forms and senses, but second qualifier is delayed due to discussion about it - see property proposal. At least all 390 usages in Polish lexemes are at lexeme level, not sense level. KaMan (talk) 09:35, 28 October 2018 (UTC)
@KaMan: Thank you for pointing at the discussion, I wasn't aware of. I'll move the usage example (P5831) from senses to lexeme for minute (L2500). Mfilot (talk) 10:48, 28 October 2018 (UTC)
Is there a particular reason for putting them at the lexeme level? English Wiktionary puts them at the sense level. —Rua (mew) 12:31, 28 October 2018 (UTC)
From query service point of view it is not important whether it is lexeme or sense level - it is just little differently written query. From lexical point of view one can argue that assigning example to form is as important as to sense. There can be also cases when lexeme is intentionally used twice for different senses in the same usage example - with it at sense level it would be hard to assign second sense. Polish Wiktionary puts usage examples at the lexeme level with pointer to sense - like Wikidata. @Lucas Werkmeister: as the author of the original proposition. KaMan (talk) 14:02, 28 October 2018 (UTC)
I should like to point out that there are two movements regarding usage example (P5831), image (P18) and item for this sense (P5137). On one hand usage example (P5831) is set on lexeme level but on the other hand image (P18) and item for this sense (P5137) are set on sense level (see, e.g. tour (L2330) mentioned earlier in the discussion). In my opinion the three properties should be handled equally. Currently my preferred option would be to set those on sense level. Mfilot (talk) 16:09, 28 October 2018 (UTC)
These three should not be handled equally because image (P18) and item for this sense (P5137) are related only to senses while usage example (P5831) is also as importantly related to forms. KaMan (talk) 16:12, 28 October 2018 (UTC)
I don't see how that changes anything, though. A sense-level property can still have a form as the value. The purpose of usage examples is to illustrate the sense, not to illustrate the form, so that makes them fundamentally different. —Rua (mew) 17:54, 28 October 2018 (UTC)
Why? You may very well need an example to illustrate the use of a particular form. In fact, you may need an example to to illustrate other properties as well (pronunciation, gender, [in]animality, [in]transitivity etc. etc.), so that using examples as qualifiers of corresponding statements makes sense too.--Shlomo (talk) 22:51, 28 October 2018 (UTC)
@KaMan: I would say, put the example at the sense level, if its objective is to illustrate sense, and put it at the form level, if it should demonstrate use of a particular form. As for now, I can't find a good reason to put an example-statement at the lexeme level, but if there would be some, why not?--Shlomo (talk) 22:51, 28 October 2018 (UTC)

QuickStatements v2 now supports lexicographical data

As a birthday present from Magnus Manske and me, QuickStatements v2 now supports editing lexicographical data on the statement level: lexeme, form and sense IDs can be specified as the subject of statements to be added or removed, or as the values of statements, qualifiers or references. For example, the following code produced this diff:


Editing lexeme lemmas, languages, or lexical categories, form representations or grammatical features, or sense glosses is not supported yet; neither is creating new lexemes, forms and senses.

Go forth and edit! (at a reasonable pace (and preferably with sources)) --Lucas Werkmeister (talk) 21:53, 28 October 2018 (UTC)

Query Lexemes in the Query Service

Hello all,

Graph of Lexemes derived from L2087

I’m very happy to announce that another important feature for Lexicographical Data has been deployed: the ability to query Lexemes in the Query Service.

Here are a few examples:

The queries are based on the RDF mapping that you can find here. Feel free to help improving the documentation, so people can understand how to build queries out of Lexemes.

Thank you very much to Tpt who’s been doing a huge part of the work by mapping Lexemes in RDF, and Smalyshev (WMF) who made the RDF dumps available and integrated in the Query Service.

Feel free to play with it, bring some of these ideas of queries to life, and let us know if you find any issue or bug. These can be stored as subtasks of this one on Phabricator. If you have questions, you can also ping Stas onwiki or on IRC.

Cheers, Lea Lacroix (WMDE) (talk) 08:06, 16 October 2018 (UTC)

Many thanks to all involved. That's great news and a lot of testing to do :) KaMan (talk) 08:33, 16 October 2018 (UTC)
  • Good work. BTW, there seems to be a licensing incompatibility with some of the schemes referenced in the triples. Can they be replaced with "wikibase:". Makes writing queries easier too. --- Jura 10:21, 16 October 2018 (UTC)
    • @Jura1: Could you explain a bit more about licensing? Smalyshev (WMF) (talk) 17:55, 16 October 2018 (UTC)
    • WMDE wanted lexemes to be CC0. If you are adding a primary mapping for key features of them to a scheme that isn't, somehow they fail that objective. Supposedly, you could still add it as a secondary mapping. It might also limit re-use of the software outside WikiMedia. --- Jura 13:15, 17 October 2018 (UTC)
      • @Jura1, Smalyshev (WMF): Are you talking about ontolex? The file at states a licence (dc:rights) value of CC-Zero, so it's fine. ArthurPSmith (talk) 15:17, 17 October 2018 (UTC)
        • That seems to be contradicted by statements elsewhere. Using the standard wikibase: seems preferable. --- Jura 15:24, 17 October 2018 (UTC)
          • (citation needed). Using a common standard makes federated querying easier and so is preferable to using custom URI's if the meaning is the same. ArthurPSmith (talk) 16:10, 17 October 2018 (UTC)
            • Supposedly you read Lea's announcement about not using others'. Obviously, it would have been easier to this over at Wiktionary, especially I came to the conclude that the French one is actually fairly complete. --- Jura 16:18, 17 October 2018 (UTC)
              • You said this added a "primary mapping for key features of [lexemes] to a scheme that isn't [CC0]". What scheme is not CC0 in the new mapping? As I just linked, ontolex is definitely CC0. ArthurPSmith (talk) 17:21, 17 October 2018 (UTC)
    • @Smalyshev (WMF): forgot to ping you. Can we go ahead and change this. If we do it now, it's still fairly easy to update things. --- Jura 10:49, 18 October 2018 (UTC)
      • @Jura1: It's no easier to do it now than in any other moment in the future, however I am not sure why do it. Technically it is possible, sure, but why? ontolex: is a standard ontology, which is used in structured data word and would make it easier to integrate with non-wikibase resources. For querying, there's zero difference between them - it's just class name. I'm still not sure what's the license problem - seems to be CC0. It certainly can be changed, technically, but I'd like to understand the argument why. Smalyshev (WMF) (talk) 05:32, 19 October 2018 (UTC)
        • @Smalyshev (WMF): have a look at [1]. It's not really my role to check this, so you might want to ask the relevant staff. In any case, it doesn't seem to meet the intent of WMDE lexeme namespace proposal (maybe @Lydia Pintscher (WMDE): wants to clarify). For you, the effort may be the same, but for users it's better to fix this now than later. Even if a detailed study may find it to be compatible, I don't think there is much to be lost by making the prudent choice now. --- Jura 21:07, 23 October 2018 (UTC)
          • I've spend some time looking into this and for all I can tell we're fine based on among others --Lydia Pintscher (WMDE) (talk) 03:54, 26 October 2018 (UTC)
            • @Lydia Pintscher (WMDE): thanks for looking into this. The odd thing is that it's contradicted by the more detailed documents (the one I linked above and the full description of the framework). So the statement there might apply to the file, but not to the framework. Accordingly, users of Wikibase or Wikidata might run into problems assuming that it may be. Is there any downside in switching the prefixes? --- Jura 11:05, 26 October 2018 (UTC)
  • This is great! I created a page using the Wikidata list template to automatically do a few stats: Wikidata:Lexicographical data/Statistics/AutoGenerated. ArthurPSmith (talk) 14:15, 16 October 2018 (UTC)
    • @ArthurPSmith: Why did some numbers go down here and here? I don't see any recent activity that would explain it. Older versions seem to exhibit similar behaviour. --Njardarlogar (talk) 13:31, 27 October 2018 (UTC)
      • @Njardarlogar: and up and down again more recently. That's odd. I'm guessing possibly one of the servers responding to queries is missing some data? @Smalyshev (WMF): do you have any ideas why things might not be consistent from one day to the next? ArthurPSmith (talk) 14:50, 29 October 2018 (UTC)
  • @Lea Lacroix (WMDE): Hmm - to Jura's point just above, when I query using wikibase:Lexeme or wikibase:Form I get nothing, but using the ontolex types I find everything. It looks like the export doesn't quite match what is stated in mw:Extension:WikibaseLexeme/RDF mapping? ArthurPSmith (talk) 14:45, 16 October 2018 (UTC)
    • @ArthurPSmith: this is intentional, for performance reasons we only keep one class. The dump and RDF export have both. Smalyshev (WMF) (talk) 17:55, 16 October 2018 (UTC)
      • Ah, ok I guess that's fine. I'm running into an issue with query timeouts in lexemes, not sure why it should be happening since we don't really have many yet - do you want to hear about it?... ArthurPSmith (talk) 17:58, 16 October 2018 (UTC)
      • And whatever the performance problem was seems to have resolved - or maybe I just changed the query enough to get it to work now, but it's quite fast. I've added a number of specific examples to the Wikidata:Lexicographical data/Ideas of queries page.

Possible split on lexical category. Merging?

common noun (Q498187) and common noun (Q2428747) seems to be about the same thing as I understand. Can someone acknowledge that they are the same? The Russian Wikipedia has two concepts [2] [3] though, so it cannot be merged directly. I think that Апеллятив should maybe has its own Wikidata item and the rest of the Wikipedia language links should be merged? I note that the item with the highest ID number has the most linked lexemes. I am not sure that the merge bot works on lexemes? — Finn Årup Nielsen (fnielsen) (talk) 21:20, 22 October 2018 (UTC)

I've seen KRbot fix some merge issues with lexemes before - specifically, various versions of "present participle" were merged together (see the archives of this page), and there were a lot of old lexemes that needed fixing. But that is probably new functionality and may be still in development. ArthurPSmith (talk) 18:16, 23 October 2018 (UTC)
I am not referring to the technical issue of merging, but rather the issue of whether they are the same, and if the are, then why are there two Russian articles? We would need some users with an understanding of Russian. — Finn Årup Nielsen (fnielsen) (talk) 08:26, 30 October 2018 (UTC)

Dj Pava Hm Music

Lexeme:L34111 @Lea Lacroix (WMDE): What should we do with items accidentally created in lexeme namespace? (create lexeme is next to create item link in sidebar). Just request deletion or something else? KaMan (talk) 11:43, 30 October 2018 (UTC)

We currently have no tool or process to transform a Lexeme into an Item. So I guess request deletion is the safe way to go. Ideally, also write to the user and indicate the correct link. Lea Lacroix (WMDE) (talk) 12:30, 30 October 2018 (UTC)

Danish missing genitive

Danish (Q9035) is said to have no genitive case (Q146233), see, e.g., [4] (in Danish (Q9035)). Nevertheless, in Danish (Q9035) an -s is added to the end of a word in the case similar to English (Q1860): "en uges varsel" -> "one week's notice" (no apostrof in Danish (Q9035), the basic form is "uge"). My question is if forms such as "uges" should be added to Wikidata? And if yes, what grammatical feature can we associate with these words. I was orginally made aware of this issue by @Rua: [5]. — Finn Årup Nielsen (fnielsen) (talk) 17:57, 19 October 2018 (UTC)

  • If it's not a canonical genitive, but an actual genitive, I'd still use the item. Obviously, if Danish has some word for the form or case, use that instead. If not, you could always make a descriptive item, e.g. "Danish s-form". It's likely that we end up with forms that can be attested, grouped by feature, but that aren't described by every print dictionary. --- Jura 18:10, 19 October 2018 (UTC)
  • I think there is a certain difference between school grammar, which I think is heavily simplified and tries to map latin grammar onto other language (obviously citation needed, just my opinion) and the way language uses certain constructs. School grammar teaches that German has 4 cases. But we have for example "in" (inessive case (Q282031)) and "heraus", which I would argue can be a word that exists only in elative case (Q394253). Cases should be viewed perhaps than more than just forms of nouns? (Disclaimer: I am not a language scientist). --Tobias1984 (talk) 18:55, 19 October 2018 (UTC)
  • As I mentioned, it's no different from the English possessive "'s", which can attach to any word that appears that the end of a noun phrase, not just to nouns. This makes it a clitic, rather than a case. Otherwise, we'd have to start adding genitive forms to everything, from prepositions to even verb forms! —Rua (mew) 19:49, 19 October 2018 (UTC)
  • I would definitely add "uges" as a form but I'm not sure for the feature. What is clear to me is that (almost) all attested lemmata are admissible and should be documented, whether there considered correct or not. Cdlt, VIGNERON (talk) 08:54, 20 October 2018 (UTC)
    So would you include meds ("with's") as a genitive form of the preposition med? —Rua (mew) 10:17, 20 October 2018 (UTC)
    • That is an interesting question, but I think it's more about what to do when words aren't used in their main lexical category. --- Jura 10:22, 20 October 2018 (UTC)
    @Rua: I don't speak Danish, so if one day I find meds in a Danish books, yes, it would be useful to also find meds in Lexemes. I'm not sure how to structure it (it depends largely on the references) but I'm sure I want to find it. Cdlt, VIGNERON (talk) 10:40, 20 October 2018 (UTC)
    English Wiktionary decided to treat words with clitics as sum-of-parts and thus not includable. The Latin suffix -que was given the same treatment. —Rua (mew) 15:15, 20 October 2018 (UTC)
  • So where does that leave us? Should I erase the s-form words from the lexemes that I have already entered? Should I leave them as they are, just not enter new s-forms? Should we add s-form but not call them "genetive"? My initial thought for adding the s-form was to ease computational lookup of the form, e.g., would "hus" be hus (L1111)-F1 or hu (L31704) in a s-form? As far as I can see Danish dictionaries do not include the s-form. — Finn Årup Nielsen (fnielsen) (talk) 17:30, 22 October 2018 (UTC)
    • I think they should be removed, for the reasons I outlined above. Any word can have this -s, so a word parser just has to be aware of this possibility. It can be compared to the aforementioned Latin -que and -ve, which could also be theoretically present on any word. Finnish -kin, -kaan -ko, -han, -pa and others are also good examples of such clitics. The Finnish example is especially illustrative of the mess we get into if we start including them as forms: Finnish nouns not only have 15 cases, but also 6 possessive forms for each of those cases, giving 105 forms per noun. Now consider that all of these forms could appear with a clitic, or even multiple clitics combined, and you get a combinatorial explosion. This is not relevant for Danish of course, but I do think we should be consistent in our treatment of clitics across languages. —Rua (mew) 10:36, 23 October 2018 (UTC)
    • I'd keep them for Danish nouns. I have no opinion on features that may resemble them in some framework for fi-lexemes. --- Jura 20:52, 23 October 2018 (UTC)
      • But why only for nouns? —Rua (mew) 10:44, 24 October 2018 (UTC)
        • I guess that the genitive/s-form is used most often on nouns. In noun phrases, the last word would most often be a noun. nominalized adjective (Q4683152) exists in Danish (example with gammel (L31494): "de ældres helbred", "the old's (the old people's) health"/"health of old people") or with a verb used as a noun (vælge (L32432): "de valgtes roller", "the elected's role"/"the role of the elected (people)"). I have a hard time coming up with examples for preposition (Q4833830) (such as meds, with's given above). There is a few pronoun (Q36224): The Den Danske Ordbog (Q1186741) notes the -s as genitive for "det" [6], otherwise I have not run into dictionaries listing genitive of Danish (I suppose that might also be because it is entirely predictable). — Finn Årup Nielsen (fnielsen) (talk) 12:50, 24 October 2018 (UTC)
  • @Lucas Werkmeister: I see that the "svenskt substantiv" forms in lexeme-forms ([7] [8]) include Swedish genitive. I imagine that the same question arises for Swedish as for German?
    • @Fnielsen: Whether Swedish has any cases in the first place is a matter of some dispute as well. I think sv:Kasus#Svenska explains it well. The conclusion is that a two-case system is the most traditional view, and that Svenska Akademiens grammatik supports the genitive as a case. --Vesihiisi (talk) 13:05, 24 October 2018 (UTC)
      • The situation for the genitive in Swedish and Norwegian is exactly the same as it is for Danish and English. It is a clitic that can attach to any word. See w:Swedish grammar#Genitive and w:Norwegian language#Genitive_of_nouns. The Swedish example given in the article is particularly illustrative. If the genitive were a true case, then in Konungen av Danmarks bröstkarameller the word "bröstkarameller" would be modified only by "Danmark", so the phrase would mean "the king of the cough drops of Denmark". But it's modified by the entire phrase "Konungen of Danmark", which means it's a clitic. —Rua (mew) 15:26, 24 October 2018 (UTC)
        • The statement that the -s can attach to "any" word is a wild exaggeration. Cases like "kongungen av Danmark", where this makes any difference, are really rare, and it is less than a century ago when the only correct form was considered to be "konungens av Danmark bröstkarameller" with the -s on the noun in question. Furthermore, even when the -s moves to Danmarks, it is still on a noun, not on any kind of word. Sentences where you try to attach an -s to the end of a phrase such as "hästen jag rider på" (the horse I ride on) are frowned upon by the vast majority of native speakers (above the age of five). Thus, "pås" is not a Swedish word. The -s simply does not attach to a preposition (you can try, but it does not stick), only to nouns. I'd say Rua is wrong. But why not give Rua a chance to rule how Lexemes in Wikidata should work! That would make the whole project fail, and so the classic Wiktionary will be victorious. I'm not at all against this development. Go ahead! --LA2 (talk) 21:25, 30 October 2018 (UTC)

@LA2, Rua, Vesihiisi, Jura1, VIGNERON, Tobias1984: If we do not add the s-form then there is an issue when referring to a form. broderskab (L34118) has "De er udstyret med fornuft og samvittighed, og de bør handle mod hverandre i en broderskabets ånd." as a usage example (P5831) and could use demonstrates form (P5830) but what should it refer to? L34118-F1 (broderskab)? — Finn Årup Nielsen (fnielsen) (talk) 08:24, 30 October 2018 (UTC)

  • invalid ID (L34118-F2) I would say. A word with a clitic isn't really any different from a word followed by another word. The famous Latin senatus populusque Romae demonstrates the form populus. Even the Romans themselves saw it that way, because they abbreviated it SPQR and not SPR. —Rua (mew) 11:30, 30 October 2018 (UTC)

A game with German articles

Hello all,

I just wanted to let you know about a game that that I developed on my volunteer capacity for the Wikidata birthday. DerDieDas is using lexicographical data to present German nouns and let you guess its grammatical gender. It's an idea I put in the ideas of tools a while ago, so I was happy to experiment with querying Lexemes and parsing the existing data :)

This is just a prototype to show what is possible to do with structured lexicographical data. The game doesn't have a lot of features and some issues may occur, but I hope it will give other people ideas to continue in this direction.

Also note that I adapted it for French, and the results are reflecting the current state of the French nouns in Wikidata (a lot of nouns ending with -ion were created, and they are currently very present in the game).

Cheers, Léa Auregann (talk) 13:11, 30 October 2018 (UTC)

And now a Danish Version by fnielsen, don't hesitate to make your own ;) (I'm a bit jealous it can't really be done in Breton :/ ). Cdlt, VIGNERON (talk) 11:13, 31 October 2018 (UTC)
Return to the project page "Lexicographical data/Archive/2018/10".