Wikidata talk:Lexicographical data/Archive/2023/10

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Import single-word senses from Wiktionary

Latest comment: 8 months ago9 comments3 people in discussion

To increase the amount of senses in Wikidata, I would like to import senses from Wiktionary. Due to the license incompatibility, this can only be done for items below the threshold of copyrightability. In the telegram chat, it was assumed that limiting the import to senses consisting of a single word would be sufficient to meet this criterion. Additionally, senses should only be imported for lexemes without any existing senses to avoid duplication. My current import suggestions can be found at https://static.karl.berlin/wikidata/ and I plan to use https://gitlab.wikimedia.org/toolforge-repos/twofivesixlex to execute the import.

Do you agree on the license situation?
Does anyone want to proof-read the suggestion list?
Is there anything else I should be aware of in this context? This is my first import.

Karlb42 (talk) 15:08, 1 October 2023 (UTC)

@Karlb42: I don't think there's a copyright issue here, other than possibly the one associated with databases which we routinely ignore. I did look through your English list and I'm not sure these will be super useful. A sense of "chirp" is "insects"? But some of them will be helpful. It's also not all that long a list (915 for English) - is this all the single-word definitions in enwikt (for lexemes with no senses)? Oh - one other thing is how do you expect to handle cases where we have multiple lexemes (for different lexical categories for example - nouns and verbs etc.) that have the same string value? ArthurPSmith (talk) 17:57, 2 October 2023 (UTC)

> A sense of "chirp" is "insects"?

These terms make sense to disambiguate the word before translating it into different languages (birds and insects making sounds are different words in other languages), but I agree that they are not a good sense description. Limiting the glosses to single words probably highly overrepresents cases like this compared to the total data set. I'll reconsider if the approach is viable.

> is this all the single-word definitions in enwikt (for lexemes with no senses)?

@ArthurPSmith I restricted it to nouns for now to reduce the scope and limit the amount of different problems. But the code works on other parts of speech, too. I also only included glosses with at least one translation in my subset of the Wiktionary dataset (the one used in www.wikdict.com). Apart from these limitations and potential bugs, it is all.

> do you expect to handle cases where we have multiple lexemes

Yes, I match by part of speech. Karlb42 (talk) 16:40, 3 October 2023 (UTC)

Ok, well it seems to me it wouldn't hurt to do this especially limiting it to nouns; at least this should be better than no sense at all on the lexemes. ArthurPSmith (talk) 19:16, 3 October 2023 (UTC)

Can you please generate for Russian words too? --Infovarius (talk) 20:46, 5 October 2023 (UTC)

Unfortunately, there are hardly any single-word senses in the Russian Wiktionary, if I read the data correctly. Karlb42 (talk) 12:41, 8 October 2023 (UTC)

Hm, I thought you were extracting all languages from en-wikt? I suppose there are single-word senses of Russian words in en-wikt. --Infovarius (talk) 20:12, 8 October 2023 (UTC)

I'm working on the respective languages Wiktionary, so ru.wiktionary.org for Russian (with potential limitations due to the extraction process from the wiki markup done by the dbnary project). Karlb42 (talk) 12:08, 15 October 2023 (UTC)

So you take German words from German Wiktionary, English words from English Wiktionary etc.? I suppose your approach is not useful then. Such glosses are often non-demonstrative at all. Probably it is better to take foregin words from each Wiktionary and their "one-word" translations, than native words with their one-word explanations. Argue? Infovarius (talk) 08:19, 26 October 2023 (UTC)

Mapping toponyms, grouping them in their linguistic family (using lexicographical data)

Latest comment: 8 months ago1 comment1 person in discussion

Hello, in a previous post in the Wikidata main discussion (https://www.wikidata.org/wiki/Wikidata:Request_a_query#Mapping_toponyms_organized_in_their_linguistic_family) I asked if it is possible to link municipality of Colombia (Q2555896) with some linguistic information, specifically to its language family (Q25295). That way I can organize them and map them. Apparently there is no etymological or toponym (Q7884789) information in Wikidata, that links, for example, a city name (Chía (Q1093102)) to its family language. There is a property called native label (P1705) that may work for this, but there is no consistency in how this property is used, or maybe, the thing that confuses me the most is that I can't find a connection between the toponym and some data that tells something about its language family.

WD's Lexicographical data, I think, can be useful here, and as I was answered in the other discussion: "[...] almost certainly this is where toponym to linguistic family information should be found." My concrete answer is: Is it possible to link the municipality of Colombia or each toponym to something inside de Lexicographical data to identify each of them with a "linguistic root" or Language family?

I am new to all this Wikimedia world. Thank you for reading! Duityors (talk) 16:39, 25 October 2023 (UTC)