Wikidata talk:Lexicographical data/Archive/2023/07

Latest comment: 1 year ago by عُثمان in topic P5402 - homographs
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.


In Ukrainian Wikisource there are pages with word senses from proofread dictionaries (See. призьба, for example). Would it be possible to somehow link these pages with lexemes? Bicolino34 (talk) 15:12, 18 June 2023 (UTC)

gloss quote (P8394) + described by source (P1343) + Wikisource link as reference may be used for this —Vis M (talk) 08:20, 22 June 2023 (UTC)
described by source (P1343) is indeed the best way to go (for any dictionary, on Wikisource or not). See ki (L69) for an example of link to a Breton dictionary on the French Wikisource. Cdlt, VIGNERON (talk) 10:40, 1 July 2023 (UTC)

P5402 - homographs

Hi y'all,

For the record, I did some QuickStatements imports of the homograph lexeme (P5402) property based on the main lemma of Lexemes, around 50 000 in total. I did several batches, lang by lang and only based on the main lemma (I didn't took secondary lemmata like "ama" on ama/𒂼 (L1) - as transcription doesn't seems to fit into the definition of homographs - and I also didn't look at the forms - as it was often too heavy for SPARQL queries and I'm not sure how it should be stored exactly, eg. fils (L15917) and fils (L10371-F2)).

Although this property maybe not the most important for Lexemes, as Lexemes are more and more numerous (and thus tools like SPARQL getting heavier and more and more often times out), it seems important to do this important (and better a crude one than nothing). Also, it's more explicit and appears directly on the lexemes (thus signaling a potential duplicates, I'm thinking about African languages pointed above by GZWDer in particular).

Pinging top users of the homograph lexeme (P5402) property @Jon Harald Søby, Hameryko, 白布飘扬, Nikki, عُثمان: (source : Navel Gazer)

Cheers, VIGNERON (talk) 13:30, 5 July 2023 (UTC)

@VIGNERON This seems fine, I'm mostly curious how you prepare these batches. I would like to try this but with diacritics stripped from lemmas in languages where it makes sense to treat them as homographs. عُثمان (talk) 13:48, 5 July 2023 (UTC)
@عُثمان: first I did some exploratory query (to see how much homographs where missing, in which languages, etc.) and then the main part was, language by language, a query like this :
SELECT ?lemma ?l1 ?l2 WHERE {
  ?l1 dct:language wd:Q150 ; #language
     wikibase:lemma ?lemma .
  ?l2 wikibase:lemma ?lemma .
  FILTER (str(?l1) < str(?l2) )
  MINUS { ?l1 wdt:P5402 ?l2 }
  MINUS { ?l2 wdt:P5402 ?l1 }
}
Try it!
And then I simply format the results into Calc and send them to QuickStatements.
I forgot to say an important point: I mainly worked on languages written in Latin alphabet, I could do the ones in Arabic abjad if need (and if someone give me some hints ;) I had a look at Hebrew but there is some strange cases like (L68471) and שופכה/שָׁפְכָה (L68273) - the lemma with diacritics is the same but the lemmata without are unexpectedly different - so I preferred not to touch them).
Cdlt, VIGNERON (talk) 16:59, 5 July 2023 (UTC)
@VIGNERON: Thanks, I had been looking for a query to search for this type of lexemes for some time, and this was useful. The duplicate lexemes in Spanish that did not have this property are already corrected (adding the homograph lexeme (P5402) property or doing a merge). --Hameryko (talk) 20:55, 5 July 2023 (UTC)
@VIGNERON Thanks! I don't know much about Hebrew but I can give some tips regarding the Arabic script. I will preface this by noting that I treat affix, suffix, and prefix lexemes separately. Only if a suffix is a homograph to another suffix will I link them together, since I think linking a suffix to a homograph word can be misleading, since we never see suffixes independently from a word. (Unlike the Latin script, hyphens are not used to differentiate these.)
Generally, if the lemmas are stripped of any of the diacritics I have written over "be" here and compared to each other, they may be considered homographs because when in the context of a larger work any of them may be omitted at the discretion of the author: بِ بّ بُ بَ بْ بٰ بٖ . On the other hand, the fatthan-alef is something that I would leave and consider part of the full spelling اً (this results in a consonant sound at the end of a word and is typically includes in writing). Alef madd is also always written and is not equivalent to plain alef آ
Persian has a simpler vowel phonology than most languages written with the Arabic script and accordingly the lemmas do not include diacritics as this is the preference in modern Persian writing. So you don't have to worry about any of the above for Persian.
For Punjabi (pnb) in particular, the letter ݨ can be treated as canonically equivalent to ن . They represent different sounds but the first letter is a newer incorporation into the alphabet and isn't used by everybody yet. So we may consider کھاݨا and کھانا to be homographs. This is not the case for other languages which use both of these letters.
Besides the Arabic script, for Devanagari and Gurmukhi we can consider these letters equivalent to their dotless counterparts:
Devanagari: क़ क, ख़ ख, ग़ ग, ज़ ज, फ़ फ, ड़ ड, ढ़ ढ
Gurmukhi: ਸ਼‌ ਸ, ਖ਼ ਖ, ਗ਼ ਗ, ਜ਼ ਜ, ਫ਼ ਫ, ਲ਼ ਲ عُثمان (talk) 09:10, 6 July 2023 (UTC)
Return to the project page "Lexicographical data/Archive/2023/07".