Wikidata talk:Lexicographical data/Archive/2020/08
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion. |
Lexeme Forms documentation improvement for variants
Variants seem to be a very confusing state of things currently, with several overlapping practices. I would like this discussion to help improve the documentation, since there seems to be evolving standards, or simply that the standards and best practices are not well documented. Let's improve that and discuss this here.
This page section https://www.wikidata.org/wiki/Wikidata:Wikidata_Lexeme_Forms#Multiple_variants is lacking a bit more information to help with the following problem:
Example of how to deal with phrases or idioms that include hyphens? (I am not asking about "compound words", since the meaning of a compound word is one that expresses a meaning different from its individual words it is made from (ex. moonlight - moon, light or sunflower - sun, flower). Instead I am simply asking about variants where the sense or meaning is the same, but only differs in Form or its Representation)
We already know that in many languages any phrase potentially could be hyphenated or not. For example:
eye to eye https://www.wikidata.org/wiki/Lexeme:L190266
eye-to-eye https://www.wikidata.org/wiki/Lexeme:L190270
This might affect pattern matching and tokenization efforts with Abstract Wikipedia later on or not, don't know but maybe @Denny: could chime in on that a bit. So I think it's worth mentioning here. I do know that in some languages, various glyphs (like hyphen -)are deemed important enough where sometimes it changes the meaning of a phrase. In other languages, the optional glyphs are sometimes not important and have no effect on changing the meaning of a phrase. For English, my hunch is that these 2 Lexemes should be merged, and instead 2 Forms are created for the hyphenated form and regular form? But I'm not 100% sure based on current reading of some documentation and other Talk pages floating around.
Once someone can help me to update the wiki docs with this kind of example and best practices for variants, I think it will be much more useful to understand how to handle more kinds of spelling variations for a particular sense's form that occur within a single language.
I also see alternatives of spelling variants being done like so: ax | axe https://www.wikidata.org/wiki/Lexeme:L14679
and also would like to see documentation updated in regards to how that actually works and the meaning of spelling variant en-x-
?
It seems there are also side discussions and questions around 'variants' handling in the following:
- Property proposal https://www.wikidata.org/wiki/Wikidata:Property_proposal/Alternative_form
- Czech Lexicographical doc https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation/Languages/cs#Spelling_variants
Thanks in advance for any advice! Thadguidry (talk) 17:36, 15 August 2020 (UTC)
COMMENTS
- On a meta level across languages, I don't think you get around considering all three approaches (i.e. as separate form, on the same form, as separate lexeme).
- That said, from your comment it's not entirely clear if you are primarily interested in phrases (which have additional problems) or just the color/colour type of thing.
- Phrases can have the same meaning without the same words being present in every use of the phrase. As a linguist once summarized it, they have key elements that are always present. --- Jura 07:15, 16 August 2020 (UTC)
- @Jura1: In general it is about handling spelling variants (specifically about helping later with compound analysis like in Lucene's analysis class for Hyphen) where words change spelling if they are split across lines, like german's `backen` hyphenates to `bak-ken`. But the question is also about intra-word delimited phrases (words split into subwords) where you would liken this to the WordDelimiterFilter perhaps? Where the hyphens are often optionally kept as part of the phrase. I'm thinking that I really need Denny's answer here and he will understand a bit more my underlying question. I'm probably not explaining it well enough. --Thadguidry (talk) 00:06, 21 August 2020 (UTC)
- @ArthurPSmith: Maybe alternative form plays a role in this same discussion as well? --Thadguidry (talk) 00:22, 21 August 2020 (UTC)
- If the question is mainly about hyphens and hyphenation, have a look at hyphenation (P5279). As for L190266 and L190270, I'd merge them and use one lexeme with two forms. It can't really be on the same form, as creating a language code for the hyphen-less format would be odd. (BTW I had pinged its creator, maybe it was made as 2 for some purpose.) --- Jura 05:02, 21 August 2020 (UTC)
- @Thadguidry: In general I'd prefer having a single lexeme for hyphenated and non-hyphenated otherwise-identical phrases. However in cases where there may be a difference in meaning (not just of grammatical aspects) or in etymology, then separate lexemes may be justified. It does sound like we need to have separate forms. And yes the proposed property could be used for this. ArthurPSmith (talk) 13:50, 21 August 2020 (UTC)
- @ArthurPSmith: Hmm, merge doesn't complete...just spins (and nothing in the js console...darn). Also, do you have a preference where I should document our mutual agreement on this best handling for hyphenated words? -- Thadguidry (talk) 15:17, 21 August 2020 (UTC)
- It seems to be me that there is nothing really special about that, with the exception that hyphenation (P5279) should generally be used. BTW, please see Help:Merge#Special:MergeLexemes about merging. --- Jura 07:31, 22 August 2020 (UTC)
Hello, because you deal with lexicographical data on Wikdata, could you give your opinion about this property proposal, especially whether it should be applied to item or to lexeme. Pamputt (talk) 06:45, 25 August 2020 (UTC)
Looking for a co-speaker to present Lexemes to the Italian-speaking community
Hello all,
I got asked to present Wikidata Lexemes & how they can be used on Wiktionary at the Italian WikiCon (taking place online on October 24-25, the presentation would be in English). I can of course give an overview of Lexemes, the features, etc. but I think it would be much more interesting if there's also someone from the community who is editing Lexemes, and trying to connect Lexemes and Wiktionary content.
Would anyone be interested to work with me on giving this presentation? Thanks in advance :) Lea Lacroix (WMDE) (talk) 09:58, 31 August 2020 (UTC)
- @Lea Lacroix (WMDE): I'd be down to do it, as I've recently given similar introductions to speakers of Marathi and Sanskrit. (Not sure how connecting Wiktionary to lexemes has been going in the absence of work on phab:T212843 and its subtasks.) Mahir256 (talk) 12:26, 31 August 2020 (UTC)