Wikidata talk:Lexicographical data

Active discussions









Support for Wiktionary


How to help






Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2021/10.

Milestone - 200k lexemesEdit

My bot just created spiritualistically (L200000), the 200000th lexeme, while import Wiktionary adverbs! It means "in a way relating to being spiritual".  – The preceding unsigned comment was added by SixTwoEight (talk • contribs) at 22:04, 11 October 2019 (UTC).[]

New tool: Lexemes PartyEdit


I created a new tool: Lexemes Party. It shows lexemes related to a list of Wikidata items. It was built with two uses in mind:

  • To give an overview of the coverage of lexicographical data in Wikidata for a small list of Wikidata items (example: Planets of the Solar System).
  • To work with a long list of Wikidata items to improve a specific language (example: Animals).

Several examples are available on the tool.

A "challenge" mode will be added soon. The idea is to give a small list of Wikidata concepts for a limited duration. The community could create or improve corresponding lexemes in their languages. The progression will be saved. The idea is to encourage emulation between languages.

Feedback is of course welcome! There are probably bugs and room for improvement.

Cheers, — Envlh (talk) 17:15, 26 July 2021 (UTC)[]

@Envlh: This is an interesting tool! One immediate concern I see is that for lexemes with multiple lemmata, such as Hebrew, Hindustani, or Malay, each lemma/sense combination is listed separately: in the 'colors of the rainbow flag' query, there look like there are four Hindustani lexemes given for 'green' when there are actually two, each with a Devanagari and an Arabic lemma. Mahir256 (talk) 17:49, 26 July 2021 (UTC)[]
@Mahir256: Thank you! With your example, would this layout be better?
ہرا / हरा (L298913-S1)
سبز / सब्ज़ (L298914-S1)
Cheers, Envlh (talk) 20:41, 26 July 2021 (UTC)[]
@Envlh: Yes, this is better. Mahir256 (talk) 21:36, 26 July 2021 (UTC)[]
@Mahir256:   DoneEnvlh (talk) 21:47, 26 July 2021 (UTC)[]

The challenge mode has been added, with a first challenge about Olympic Games. Cheers, — Envlh (talk) 19:56, 1 August 2021 (UTC)[]

@Envlh: what a lovely tool! Can you please change the query for colors from LGBT-flag to neutral spectral colors: "SELECT ?concept { wd:Q43213808 p:P527 [ rdf:type wikibase:BestRank ; ps:P527 ?concept ; pq:P1545 ?rank ]} ORDER BY xsd:integer(?rank)"? --Infovarius (talk) 13:56, 4 August 2021 (UTC)[]
@Infovarius: thank you for your feedback! These are only examples to show how the tool can be used. You can create your own lists and then share them. For your query: Prismatic colors. Cheers, — Envlh (talk) 10:48, 5 August 2021 (UTC)[]

Hello! Thanks to feedback, several improvements were made to the tool (readability, more consistent use of language codes, etc.). A documentation has been started. We are at the 9th weekly challenge (statistics about previous challenges). Feel free to contact me privately (by email or by DM on Twitter, Telegram, ...) if you have ideas for new challenges. Cheers, — Envlh (talk) 22:02, 3 October 2021 (UTC)[]

Chinese vs. MandarinEdit

Hey all, I randomly searched for Chinese lexeme, at first I used

The following query uses these:

  • Items: Chinese (Q7850)     
    SELECT ?lexeme ?lemma ?modified
    WHERE {
       ?lexeme dct:language wd:Q7850; wikibase:lemma ?lemma; schema:dateModified ?modified.

Only 1 result (and rather new). Hm... strange, I thought. Then I tried to search using

The following query uses these:

  • Items: Mandarin Chinese (Q9192)     
    SELECT ?lexeme ?lemma ?modified
    WHERE {
       ?lexeme dct:language wd:Q9192; wikibase:lemma ?lemma; schema:dateModified ?modified.

, lo and behold, there's a couple hundreds of them.

So my question is, shouldn't they use Chinese (Q7850) instead of Mandarin Chinese (Q9192)?  – The preceding unsigned comment was added by Bennylin (talk • contribs) at 17:38, August 2, 2021‎ (UTC).

If you want to find lexemes for all varieties of Chinese, you could use dct:language/wdt:P279* wd:Q7850; instead. If you only want to find lexemes for Mandarin, then you should use the item for Mandarin, because Chinese is far more than just Mandarin. :)
If you're asking why we separate varieties of Chinese, they are largely considered different languages. Dictionaries, textbooks, language courses, etc, are normally for a specific variety of Chinese. They have separate language codes, separate Wikipedias, separate romanisation systems, separate pronunciation files. Merging them together would make some things easier or require less duplication, but it would also create a lot more work marking all senses, transliterations, audio files, etc, with which variety they apply to and would make querying the data for a particular variety harder.
- Nikki (talk) 07:26, 3 August 2021 (UTC)[]
OK, as a Chinese speaker, all written forms are Chinese (Q7850). No matter how they're pronounced, they're written the same. No Chinese person would saw a text, eg. 省级行政区/省級行政區 (L504064) and say they're "Mandarin" or "官话" or "Northern Chinese" (北方方言). Only the most uninformed foreigner would make such mistake. Mandarin Chinese (Q9192) is a speaking variant, and as the Wikidata item stated, is a subclass of Chinese. So, for me, it doesn't make sense why the written forms are categorized as Mandarin, a spoken, albeit dominant variant. They should all be changed into Chinese, as the first part of my comment indicate, we had only 1 (and then 0, because it was changed to Mandarin) Lexeme of one of the most important language on the world. That's baffling to say the least! But no, apparently it was systematically re-categorized into something it's not.
Where is it documented regarding Wikidata's decision to separate varieties of Chinese, if I may know? Or discussion whether to merge/separate them? I could only found 3 separate discussion: 1, 2, 3 and there are far from reaching any consensus, and I noticed the lack of Chinese voice as well. Was there any consensus/discussion that I'm not aware of?
I'm aware there are Chinese, Cantonese, Wu, and Gan Wikipedia editions that uses Chinese characters, but AFAIK there's no "Mandarin" Wikipedia. But as my title suggest, the question is only Chinese vs. Mandarin (not Chinese v. Cantonese, or even traditional Chinese v. simplified Chinese). Why categorizing/re-categorizing them as Mandarin? On what rationale? The lexemes are in written, not spoken form, so I think we should change all of them to Chinese. Bennylin (talk) 14:26, 8 August 2021 (UTC)[]
First, this discussion is funny for me because Q7850 is "langues chinoises" (Chinese languages, plural) so it's obviously not "one" language for me at first glance. I feel that most of this discussion is about the label itself.
Then, "No Chinese person would" is a weak argument, most person don't think is term of lexicography anyway ; with the same logic, should we replace all "noun, verb, particle" with "word" because most people would say it's just a word? "The lexemes are in written, not spoken form" is also wrong, lexeme are conceptual, it's both written and spoken (or neither, it doesn't matter really), for instance we could have lexemes for non spoken languages (and we do have a few like AS18507S20600S33b00M518x538S33b00482x483S18507498x511S20600476x520/FlatO@Mouth-PalmBack (L8881) who is "kind of" written).
Plus, do you have references? For the references about separate languages in Chinese, you might want to start by the international standard ISO 639 (Q33547). I vaguely remember that there is written differences between Chinese languages (especially for Southern Min (Q36495) which is actually the Chinese language with the most lexemes right now).
That said, they is indeed something very strange, it should be the pair zh + Chinese (Q7850) or cmn + Mandarin Chinese (Q9192) but zh + Mandarin Chinese (Q9192) is not really logical.
Cheers, VIGNERON (talk) 18:48, 25 August 2021 (UTC)[]

Enable all ISO 639 codesEdit

Is there a way to enable all ISO 639-3 (Q845956) codes on Wikidata? Wikidata only supports a few hundred languages, whereas there are more than 7,000 languages in the world.

I tried creating a lexeme for bàbò (Nupe (Q36720) for Lagenaria siceraria (Q1277255)), but technical restrictions prevented me from doing so. I also could not add the Nupe name to Lagenaria siceraria (Q1277255), since Wikidata items cannot be linked to Incubator pages. In the future, I would like to add lexemes for dozens of African languages that do yet have any officially launched wikis, but it appears that Wikidata cannot yet support this. Sabon Harshe (talk) 09:13, 21 August 2021 (UTC)[]

@Sabon Harshe:Hi, welcome to this section of Wikidata. Lexemes for which we do not yet have selectable language codes can be given "mis" as language code. I created the lexeme for you, see bàbò (L585993). I can warmly recommend joining the telegram group listed here to chat with the rest of the community. :) If you think the template for the "create new lexeme"-page could be improved you are very welcome to open a ticket here--So9q (talk) 19:30, 21 August 2021 (UTC)[]
@So9q: Thank you! I am also now trying to learn how to add more lexemes using QuickStatements. Sabon Harshe (talk) 08:21, 26 August 2021 (UTC)[]

Adding link to the project in the sidebarEdit


Talking with @Lepticed7:, we're wondering if we could add a link to the "main page" of the project Wikidata:Lexicographical data in Wikidata sidebar (for instance, before "creating a new lexeme").

Cdlt, VIGNERON (talk) 07:26, 29 August 2021 (UTC)[]

I'd find it useful for sure. --Lexicolover (talk) 19:45, 7 September 2021 (UTC)[]
Hi, I would also find this change useful. Cheers, — Envlh (talk) 22:05, 3 October 2021 (UTC)[]

Thank you page for data donators?Edit

Hi, I recently asked a bunch of website owners to release their Yiddish proverbs as CC0 so we can import them. One asked if we can list the donation somewhere and link back to their site? Do we have a page like that? OSM has a quite prominent one here and another here.--So9q (talk) 16:08, 1 September 2021 (UTC)[]

@So9q: I thought (wrongly) that such a page existed but all I could find is Wikidata:Data_donation#Organisations_who_have_worked_with_Wikidata. @LydiaPintscher: you created this page, does it ring a bell? A gln, VIGNERON (talk) 06:38, 3 September 2021 (UTC)[]
Hehe yes. But that is our landing page for anyone who wants to give us data so I would very much discourage making it a laundry list of every organisation who ever gave us data :D --LydiaPintscher (talk) 16:33, 3 September 2021 (UTC)[]
Ok, I agree we should put mentions of smaller donations somewhere else and then link to it. What about a subpage titled "List of all data donations"?--So9q (talk) 04:58, 4 September 2021 (UTC)[]

Multiword "nouns" for speciesEdit

I am unsure how one should best record multiword "nouns" for species, e.g., killer whale (L42998) I see the lexical category set to noun (Q1084), while I have set grøn kølleguldsmed (L590625) (that consists of one adjective (Q34698) and one noun (Q1084)) to noun phrase (Q1401131) (I am unsure what difference there are to nominal phrase (Q29888377)). For combines lexeme (P5238) how should we record that there is a whitespace between the words? — Finn Årup Nielsen (fnielsen) (talk) 16:22, 9 September 2021 (UTC)[]

@Fnielsen: I'd say "noun phrase" is fine for such words, although correcting uses of "noun" for them isn't absolutely necessary if the lexeme in question can't have its parts split when it is used. As for the whitespace issue, we could introduce a convention in the series ordinal (P1545) values where e.g. "1" and "2" refer to separate words, while "3.1" and "3.2" refer to parts of a single word. Mahir256 (talk) 16:32, 9 September 2021 (UTC)[]
Isn't noun phrase (Q1401131) more how the lexeme is constructed rather than it's actual lexical category? I mean, no matter how it's constructed and where it comes from, it acts as a noun, no? Not sure for the theory but pragmatically at least, for instance for constraints and schema/lexical mask, I guess sticking with only a few and basic lexical categories would be better. noun phrase (Q1401131) is interresting and useful but it should go elsewhere (maybe instance of (P31) ? or a more specific property ?). Cheers, VIGNERON (talk) 18:39, 11 September 2021 (UTC)[]
I believe that lexical category should be phrase (Q187931) (or phraseme (Q5551966), word combination (Q1774041)?). --Infovarius (talk) 13:50, 13 September 2021 (UTC)[]
Return to the project page "Lexicographical data".