Wikidata talk:Lexicographical data/Archive/2019/10

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

ASJP import? No

Today I learned about en:Automated Similarity Judgment Program, and that their dataset is CC0. The project tries to collect the words for a short, central list of 40 concepts in all the world's languages.

Shall we set up a Wikiproject to import it to Wikidata? Reach out to the ASJP folks, let them know?

I like this idea because it would help with setting up an initial set of lexemes for many languages, and thus we would have coverage no matter what we test. Who's interested? --Denny (talk) 16:55, 4 October 2019 (UTC)

@Denny: I'm not sure this data is very well suited to our lexeme approach. At the least, it would take considerable work to map it in, I think. Issues I have: (A) the "words" are recorded in what seems to be an idiosyncratic romanization, not in the native writing system (though I expect that many of these languages do not have a standard written form so that might work for them). See for example the words for "person" here, or the English page. (B) They appear to only have a single form for each word (which we could use, but that limits things). (C) They have 7655 "word lists" (I think that means languages?) which is way more than we support here - so we'd need to either chop the list down to what we know about, or figure out how to support MANY more languages here! All that said, it does seem like something very related to what we're trying to do, so I think reaching out to them would be a great first step! ArthurPSmith (talk) 18:08, 4 October 2019 (UTC)
Denny, where have you seen that the ASJP data are CC0? On the main page, it is written that the ASJP Database is licensed under a CC by 4.0 licence. Pamputt (talk) 20:07, 4 October 2019 (UTC)
@Pamputt:, ah, darn, I misread the logo. My mistake, you are right. Sorry. This topic can be archived. :( --Denny (talk) 22:43, 4 October 2019 (UTC)
License matters only for the case of definitions. But spelling and translations are pure facts which are not copyrightable at all. So we can use them, I suppose (adjusting wrong scripts, of course). --Infovarius (talk) 19:33, 7 October 2019 (UTC)
I am not a lawyer so I do not know. So, may I ask a naive question? If the word lists are not protected by CC by 4.0, which content of the website is covered by this licence. Do you think this is a copyfraud? At least, I think we should contact them to get their opinion. Pamputt (talk) 13:38, 8 October 2019 (UTC)
"7655 languages"? That's cool, we should support them all! Yes, Amir? ;-) --Infovarius (talk) 19:33, 7 October 2019 (UTC)
User:Infovarius, I think that it's the first time that I see a website that has content in more languages than jesusfilm.org! Even if it's very little content in each, it's still impressive.
And yes, we should support all of these languages eventually. It's knowledge, and we want all human knowledge, and this means all languages, even the extinct ones.
It's not exactly a dictionary, but a project with particular purpose in which I'm personally less interested, but it could also be used as a dictionary.
They indeed use a somewhat unusual romanization, but from a quick look it appears to be consistent, so it should be usable. If not for Wikibase Lexemes, then maybe for Wiktionary. --Amir E. Aharoni (talk) 10:49, 8 October 2019 (UTC)

Storing word components

Some (many?) languages use word composition -- combining simple components (prefixes, roots, interfixes, suffixes, and endings) to create new words. Russian is that way for sure, but I think it is also common in German and Finnish languages. English has some of that too - "prepend" -- "pre" implies "before", and the root of the word has the sense of addition/joining (?). We already have combines lexemes (P5238) and root (P5920) properties, plus the series ordinal (P1545) qualifier, implying two ways to store data:

with combines lexemes (P5238)

Not sure how to indicate the type of the lexeme part here, or if its even needed. We may even have to store parts with dashes, e.g. -suffix, -interfix-, +ending, prefix-, .... The dash/plus will also immediately make it clear that the given lexeme is not a word, but rather a part of the word.

"prepend" (en)
combines lexemes (P5238)  ->  link to "pre- (prefix)" lexeme
  series ordinal (P1545) = 1
combines lexemes (P5238)  ->  link to "pend (root)" lexeme
  series ordinal (P1545) = 2
with root (P5920) + ...
"prepend" (en)
prefix (new prop) -> link to "pre- (prefix)" lexeme
  series ordinal (P1545) = 1
root (P5920) -> link to "pend (root)" lexeme
  series ordinal (P1545) = 2

Second approach visually disconnects different parts of the word across multiple properties, which is also not that great, but it allows the data user to tell word parts apart without looking at the part lexemes themselves... Which approach should we use? --Yurik (talk) 16:41, 10 September 2019 (UTC)

What about doing both? ArthurPSmith (talk) 17:14, 12 September 2019 (UTC)

What should be the lexeme of an adj based on noun

If I describe the word Athenian, should the lexeme be Athenian or Athena?

I am using Hspell to upload Hebrew lexemes, and they use the base word as lexeme, even though it has different part-of-speech. I changed some manually, but I'm not sure what the right way to deal with it.Uziel302 (talk) 10:11, 26 October 2019 (UTC)

@Uziel302: It may be language-dependent; for English I definitely think the lexeme should exhibit the lexical category being referred to, but if it makes sense in Hebrew to do it the way you suggest then it doesn't seem like that would be harmful. ArthurPSmith (talk) 12:49, 26 October 2019 (UTC)
ArthurPSmith, thanks for your opinion, I tend to agree, I felt inconvenience showing the base word with the wrong part of speech of it, but since that's the way Hspell guys built it, I think it is forgivable until I get to it and decide how to change it. It may not be obvious which form to select to represent the lexeme. Uziel302 (talk) 12:55, 26 October 2019 (UTC)

Storing "corresponds to" words

In the gender-aware languages, nouns often have feminine and masculine versions. How should they link to each other? For example, a lexeme "doctor (feminine)" should have a connection to the "doctor (masculine)", and the reverse. --Yurik (talk) 20:50, 11 September 2019 (UTC)

MachtSinn: new tool to quickly add Senses to Lexemes

We have huge amount of lexemes that lack senses, often we also have items describing the concept a sense of a lexeme is describing. I therefore wrote a Tool to match those and suggest missing senses of lexemes: MachtSinn. You can log-in with you Wikidata-Account and quickly endorse or revoke potential matches. If you endorse a sense it is automatically added to the lexeme and linked to the corresponding item with our account. It works with every language. Since I'm not good at design and CSS, the design of the site is a bit minimal – help is welcome. The code can be found on Github. -- MichaelSchoenitzer (talk) 20:43, 22 September 2019 (UTC)

MichaelSchoenitzer, this is awesome! Could you add some common shortcuts for each button please, e.g. "s", "r", and "n" to quickly perform the command without the mouse? And also add that as a tooltip for each of the buttons for easy discoverability? Also, please show lexical category (noun/adjective/...) next to the word, and possibly some well known top-level claims (i.e. gramatical gender and "has quality" values), and the list of forms?
And one other thing - some lexemes are duplicated on purpose despite having identical word, they correspond to different meanings, might have different origin, and different forms - you might want users when the current word has more than one lexeme. For example, L99999 and L100000 are both "мир", one in the meaning of peace (so no plural forms), and another in the meaning of the world (could have plural, i.e. worlds), and we wouldn't want to attach the wrong sense. Thank you for an awesome tool! --Yurik (talk) 01:26, 23 September 2019 (UTC)
@MichaelSchoenitzer: Wow, that's addictive! I have noticed a few issues that could maybe be improved (let me know if I should add this at github): (1) It seems to repeat some matches after I had hit "next" on them (after I did many others in between). (2) It doesn't seem to check that the match is already there? Maybe this is due to a WDQS delay? For example I matched Q983927 to L24318 (by hand, after your system had suggested Q58795659) but then less than an hour later I was given that specific suggestion. (3) There seem to be a lot of suggestions from "heraldic figures" or elements of some genome, rather than what I would think would be more common links. Maybe suggestions should be prioritized by number of sitelinks or some other measure of popularity of the item? ArthurPSmith (talk) 18:58, 23 September 2019 (UTC)
3) Ideal is to give the complete list of items with specific label and mark some of them that should be added to specific lexeme. But it is a dream :) --Infovarius (talk) 17:00, 24 September 2019 (UTC)
@Yurik, ArthurPSmith, Infovarius: Answering the questions: The tool at the moment only contains nouns (that got already enough results for now). (2) 'Next' just gets a new random potential match, so yes it the pool or matches in you language is small it might repeat soonish. "Reject" marks them as false-positive so that they won't be shown again (to anyone). (3) The tool uses matches that are saved in a local database (WDQS wouldn't be fast enough) so yes it might be, that it shows a match that someone already added by hand – if a match is saved with the tool it's however removed from the pool and should never be shown again. In any case the tool checks if the match is already there before saving, so it should never add duplicates.
I'm currently out of time that I can invest in the tool, but feel free to make pull requests, I'll merge them and update the tool. Especially the hotkeys sound like an awesome little improvement. -- MichaelSchoenitzer (talk) 20:41, 24 September 2019 (UTC)

This is brilliant! Thank you so much. And it is fun too! It looks like this tool helped increase the number of senses within a few hours by several percent, this is pretty awesome! --Denny (talk) 03:44, 25 September 2019 (UTC)

Warning! There's problem with homonymous lexemes! The tool doesn't differ them and tries to add each sense to each of them :( --Infovarius (talk) 20:21, 26 September 2019 (UTC)

@Infovarius, Yurik: I blacklisted all homonyms (as well as duplicates). -- MichaelSchoenitzer (talk) 17:23, 28 September 2019 (UTC)

Fantastic work! Love it! Liamjamesperritt (talk) 04:37, 30 October 2019 (UTC)

Hello! How should we link king (L9670) and queen (L1380)? -Theklan (talk) 18:37, 23 October 2019 (UTC)

There are some cases (maybe ones like "actor" and "actress") where they should be put under the same lexeme as different forms. Particular senses can also be linked as synonyms or antonyms, which may be the better solution for "king" and "queen". They can also be linked through item for this sense (P5137) and the relations on the corresponding wikidata items, which is perhaps the best generic solution. ArthurPSmith (talk) 18:55, 23 October 2019 (UTC)
Briefly: there is no common way to do this :) --Infovarius (talk) 22:03, 23 October 2019 (UTC)
I'm proposing a new property for this. -Theklan (talk) 13:43, 25 October 2019 (UTC)
I am not sure the lexemes should be linked (as senses). One problem for Danish (Q9035) is that skuespillerinde (L205164) means a female actor (actress). skuespiller (L46039) means in modern Danish a gender-agnostic actor, while in older Danish it might have referred exclusively to male actors. Possible the property should be in the sense level. In the Danish case, the "female" lexeme are currently linked by combines lexemes (P5238) due to a feminine suffix (Q71282088) (-inde (L52286)). — Finn Årup Nielsen (fnielsen) (talk) 14:34, 29 October 2019 (UTC)

Milestone - 200k lexemes

My bot just created spiritualistically (L200000), the 200000th lexeme, while import Wiktionary adverbs! It means "in a way relating to being spiritual".  – The preceding unsigned comment was added by SixTwoEight (talk • contribs) at 22:04, 11 October 2019 (UTC).

Congratulations! Cheers, — Envlh (talk) 19:51, 17 August 2022 (UTC)
Return to the project page "Lexicographical data/Archive/2019/10".