Wikidata talk:Lexicographical data/Archive/2021/08

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Change regular expression in a property

Latest comment: 2 years ago3 comments2 people in discussion

Hello. I need to change the regular expression (format as a regular expression (P1793)) of the property Diccionario de la lengua española word (non-ID) (P7790) since the URL of this dictionary also allows the following characters:

ñÑáéíóúÁÉÍÓÚüÜ

Some examples: armiño, pingüino, rangífero

Can someone help me with this? --Hameryko (talk) 13:06, 17 August 2021 (UTC)

Done Pamputt (talk) 14:42, 17 August 2021 (UTC)

Thanks for your help. --Hameryko (talk) 21:34, 19 August 2021 (UTC)

Chinese vs. Mandarin

Latest comment: 2 years ago3 comments3 people in discussion

Hey all, I randomly searched for Chinese lexeme, at first I used

The following query uses these:

Items: Chinese (Q7850)  
```
SELECT ?lexeme ?lemma ?modified
WHERE {
   ?lexeme dct:language wd:Q7850; wikibase:lemma ?lemma; schema:dateModified ?modified.
}
```

Try it!

Only 1 result (and rather new). Hm... strange, I thought. Then I tried to search using

The following query uses these:

Items: Mandarin (Q9192)  
```
SELECT ?lexeme ?lemma ?modified
WHERE {
   ?lexeme dct:language wd:Q9192; wikibase:lemma ?lemma; schema:dateModified ?modified.
}
```

Try it!

, lo and behold, there's a couple hundreds of them.

So my question is, shouldn't they use Chinese (Q7850) instead of Mandarin (Q9192)? – The preceding unsigned comment was added by Bennylin (talk • contribs) at 17:38, August 2, 2021‎ (UTC).

If you want to find lexemes for all varieties of Chinese, you could use dct:language/wdt:P279* wd:Q7850; instead. If you only want to find lexemes for Mandarin, then you should use the item for Mandarin, because Chinese is far more than just Mandarin. :)

If you're asking why we separate varieties of Chinese, they are largely considered different languages. Dictionaries, textbooks, language courses, etc, are normally for a specific variety of Chinese. They have separate language codes, separate Wikipedias, separate romanisation systems, separate pronunciation files. Merging them together would make some things easier or require less duplication, but it would also create a lot more work marking all senses, transliterations, audio files, etc, with which variety they apply to and would make querying the data for a particular variety harder.

- Nikki (talk) 07:26, 3 August 2021 (UTC)

OK, as a Chinese speaker, all written forms are Chinese (Q7850). No matter how they're pronounced, they're written the same. No Chinese person would saw a text, eg. 省级行政区/省級行政區 (L504064) and say they're "Mandarin" or "官话" or "Northern Chinese" (北方方言). Only the most uninformed foreigner would make such mistake. Mandarin (Q9192) is a speaking variant, and as the Wikidata item stated, is a subclass of Chinese. So, for me, it doesn't make sense why the written forms are categorized as Mandarin, a spoken, albeit dominant variant. They should all be changed into Chinese, as the first part of my comment indicate, we had only 1 (and then 0, because it was changed to Mandarin) Lexeme of one of the most important language on the world. That's baffling to say the least! But no, apparently it was systematically re-categorized into something it's not.

Where is it documented regarding Wikidata's decision to separate varieties of Chinese, if I may know? Or discussion whether to merge/separate them? I could only found 3 separate discussion: 1, 2, 3 and there are far from reaching any consensus, and I noticed the lack of Chinese voice as well. Was there any consensus/discussion that I'm not aware of?

I'm aware there are Chinese, Cantonese, Wu, and Gan Wikipedia editions that uses Chinese characters, but AFAIK there's no "Mandarin" Wikipedia. But as my title suggest, the question is only Chinese vs. Mandarin (not Chinese v. Cantonese, or even traditional Chinese v. simplified Chinese). Why categorizing/re-categorizing them as Mandarin? On what rationale? The lexemes are in written, not spoken form, so I think we should change all of them to Chinese. Bennylin (talk) 14:26, 8 August 2021 (UTC)

First, this discussion is funny for me because Q7850 is "langues chinoises" (Chinese languages, plural) so it's obviously not "one" language for me at first glance. I feel that most of this discussion is about the label itself.

Then, "No Chinese person would" is a weak argument, most person don't think is term of lexicography anyway ; with the same logic, should we replace all "noun, verb, particle" with "word" because most people would say it's just a word? "The lexemes are in written, not spoken form" is also wrong, lexeme are conceptual, it's both written and spoken (or neither, it doesn't matter really), for instance we could have lexemes for non spoken languages (and we do have a few like AS18507S20600S33b00M518x538S33b00482x483S18507498x511S20600476x520/FlatO@Mouth-PalmBack (L8881) who is "kind of" written).

Plus, do you have references? For the references about separate languages in Chinese, you might want to start by the international standard ISO 639 (Q33547). I vaguely remember that there is written differences between Chinese languages (especially for Southern Min (Q36495) which is actually the Chinese language with the most lexemes right now).

That said, they is indeed something very strange, it should be the pair zh + Chinese (Q7850) or cmn + Mandarin (Q9192) but zh + Mandarin (Q9192) is not really logical.

Cheers, VIGNERON (talk) 18:48, 25 August 2021 (UTC)

Enable all ISO 639 codes

Latest comment: 2 years ago3 comments2 people in discussion

Is there a way to enable all ISO 639-3 (Q845956) codes on Wikidata? Wikidata only supports a few hundred languages, whereas there are more than 7,000 languages in the world.

I tried creating a lexeme for bàbò (Nupe (Q36720) for Lagenaria siceraria (Q1277255)), but technical restrictions prevented me from doing so. I also could not add the Nupe name to Lagenaria siceraria (Q1277255), since Wikidata items cannot be linked to Incubator pages. In the future, I would like to add lexemes for dozens of African languages that do yet have any officially launched wikis, but it appears that Wikidata cannot yet support this. Sabon Harshe (talk) 09:13, 21 August 2021 (UTC)

@Sabon Harshe:Hi, welcome to this section of Wikidata. Lexemes for which we do not yet have selectable language codes can be given "mis" as language code. I created the lexeme for you, see bàbò (L585993). I can warmly recommend joining the telegram group listed here to chat with the rest of the community. :) If you think the template for the "create new lexeme"-page could be improved you are very welcome to open a ticket here--So9q (talk) 19:30, 21 August 2021 (UTC)

@So9q: Thank you! I am also now trying to learn how to add more lexemes using QuickStatements. Sabon Harshe (talk) 08:21, 26 August 2021 (UTC)