Wikidata talk:Lexicographical data/Archive/2021/12

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

male and female variants of lexemes

Latest comment: 2 years ago6 comments3 people in discussion

in german almost every lexeme that represents an occupation has a male (supposedly gender neutral) and a female variant. Example:

Kanzler (L204275) → item for this sense (P5137) → (implicitly male) Chancellor of Germany (Q56022)

Kanzlerin (L204274) → item for this sense (P5137) → (explicitly female) Chancellor of Germany (Q56022)

Now Chancellor of Germany (Q56022) does nor represent an implicitly male head of the German government, neither an explicitly female head of the German government. But my thinking is that since both senses have the property item for this sense (P5137) → Chancellor of Germany (Q56022), they can be considered synonymous. But that's not true. How do we model this discrepancy? I propose:

Kanzler (L204275) → item for this sense (P5137) → Chancellor of Germany (Q56022)
sex or gender (P21) (qualifier) → implicitly male (item does not exist yet)

Kanzlerin (L204274) → item for this sense (P5137) → Chancellor of Germany (Q56022)
sex or gender (P21) (qualifier) → female (Q6581072)

thoughts? --Shisma (talk) 10:18, 21 November 2021 (UTC)

Suboptimal but this is what I use too. --Infovarius (talk) 18:40, 22 November 2021 (UTC)

@Shisma, Infovarius: After some discussion with @Nikki: about it, and considering how attributes of nouns and relations between them may be modeled, I have done the following:

Added a new sense on "Kanzler" for specifically male chancellors of Germany, keeping L204275-S2 for the meaning "chancellor, regardless of gender". (How to mark its use with this meaning as "controversial" is yet to be determined.)
Linked both L204275-S5 and L204274-S1 to L204272-S2 via "hyperonym", since a "male chancellor" and a "female chancellor" are both "chancellor"s.
Removed "item for this sense: Chancellor of Germany" from L204274-S1 (and not adding it to L204275-S5), since the item for "Chancellor of Germany" does not specify a particular gender for the holder of that position.
Added "has thematic relation: attribute" with qualifier "object sense: 'männlich'-S1/'weiblich'-S1" as appropriate to both L204275-S5 and L204274-S1, so that (along with the 'hyperonym's on both senses) paraphrases into "male chancellor" and "female chancellor" in languages which lack a similar pair to "Kanzler/Kanzlerin" become possible.

As with "has thematic relation"'s use on verbs, use of this property for such specifiers on nouns/adjectives is very likely to be more flexible in the long run than random qualifiers on P5137. Mahir256 (talk) 21:58, 22 November 2021 (UTC)

Having link to german lexeme "männlich" is strange and badly queriable I think. I would advocate for some language-independent value. --Infovarius (talk) 21:59, 27 November 2021 (UTC)

@Infovarius: Describing German lexemes with other German lexemes (as opposed to, say, Bengali or Russian ones) seems quite appropriate to me, since it has been made manifestly clear, both through the numerous complaints about Wikidata's superclass structure and the 'tough love' Anasuya Sengupta gives to Wikidata, that the single ontology we have will eventually fail languages in different ways. If we are going to describe the components of individual languages, a good start would be to do this in terms of themselves, and the p5137/p9970 links that do exist should only serve to complement it rather than support it. Mahir256 (talk) 02:00, 28 November 2021 (UTC)

I agree. that's odd. But I agree that giving the male form two senses (one specifically male) and connect them via hyperonym (P6593) makes sense. Example: Hufschmied (L620801)/Hufschmiedin (L621137). It's fairly simple to implement. As for the rest, I'd rather stick with sex or gender (P21) or not doing anything at all. It would just be nice to be able to query female forms of occupations – Shisma (talk) 12:45, 4 December 2021 (UTC)

Wikidata compared to Dux-Liliput

Latest comment: 2 years ago4 comments3 people in discussion

English entries in the bilingual dictionary without a lexeme: https://w.wiki/4Uft --- Jura 19:07, 30 November 2021 (UTC)

@Jura1: There's something wrong with your query, it's not picking up forms even though it looks like it should. "has", "his", "were" etc. are definitely in our English lex data. ArthurPSmith (talk) 18:28, 3 December 2021 (UTC)

@Jura1: +1 ArthurPSmith with it should be ontolex:lexicalForm/ontolex:representation and not ontolex:lexicalForm. Other parts of this querry are weirdly construced also (for instance en-us is never used on Lexemes and there is other codes like en-ca - and probably more in the future). Also to sanitize the entry title, one could do something like BIND (if(contains(?title, " ("), strBefore(?title, " ("), ?title) AS ?lemma). Cdlt, VIGNERON (talk) 10:12, 5 December 2021 (UTC)

Yeah, the query could be improved. Ideally only forms would be checked, but somehow that would lead to incomplete results.

We may lack a good way to do such checks. Subcodes in forms are somewhat problematic. Maybe forms should include a normalized triple, either with the form as string value ("flavour") and with the main language code ("flavour"@en), possibly both. Didn't we also plan having reverse strings as triples? --- Jura 11:37, 6 December 2021 (UTC)

missing triples for forms

Latest comment: 2 years ago1 comment1 person in discussion

To follow up on Wikidata_talk:Lexicographical_data#Wikidata_compared_to_Dux-Liliput:

As a sample L:L34331#F1 currently includes the following triples for "flavour":

wd:L34331-F1 rdf:type ontolex:Form
wd:L34331-F1 wikibase:grammaticalFeature wd:Q3910936
wd:L34331-F1 ontolex:representation flavour@en-gb

The suggestion is to add (either or both):

wd:L34331-F1 wikibase:normalizedlangform flavour@en
wd:L34331-F1 wikibase:normalizedstringform flavour

Datatype of the first would rdf:langString (as for monolingual strings), datatype of the second string.

For the Dux-Liliput usecase, the #1 would be better, but for more general string comparison #2 might be a better fit. The language could also be obtained from the lemma. I hesitate between one, the other, or both.

Compared to triples for other features, this doesn't include that much data. Possibly we could drop the "rdf:type ontolex:Form" one. I'd doubt that helps much and adds any information.

When completing these triples, we could also add reverse strings, i.e. "ruovalf" for the above. The triple would be:

wd:L34331-F1 wikibase:reversestringform ruovalf

I think this is indeed preferable over the approach to add them as statements. --- Jura 09:53, 8 December 2021 (UTC)

Initial version of Lua access to Lexemes on Bengali and Basque Wiktionaries

Latest comment: 2 years ago3 comments2 people in discussion

Hi everyone,

We’ve been collecting lexicographical data for quite a while now. The first applications are being built on top of it as well. One big remaining wish is to make it possible to access the data in Lexemes also from Wiktionary. We’re enabling it on the first Wiktionaries now.

On December 15th, we’ll enable the initial version of Lua access to lexicographical data on the following wikis:

Bengali Wiktionary https://bn.wiktionary.org/
Basque Wiktionary https://eu.wiktionary.org
Wikidata https://www.wikidata.org/
Test Wikidata https://test.wikidata.org/

The Lua interface (i.e. the available functions and methods) is documented at mw:Extension:WikibaseLexeme/Lua (you can see a live example, with links to the relevant template and module, at Beta English Wiktionary: cat). If you have suggestions for improving it (e.g. extra functions that would be useful), feel free to add them to phab:T294637.

Please note that the Lua interface is not stable yet; breaking changes may be made at any time, though usually not without some note on Phabricator first. We recommend not fully relying on this yet. If your wiki would nevertheless like to have Lua access while it is not stable yet, feel free to leave a note at phab:T294159. Otherwise, we hope to stabilize the interface soon, at which point we’ll start enabling this feature on more Wiktionaries. If you're active on another Wiktionary and would like to see the feature enabled, feel free to reach out to us after talking to your community, and we will make sure to include it in the list for future deployments..

Change dispatching, i.e. the automatic updating of local wiki pages when relevant changes on Wikidata are made, as well as integration with recent changes and the watchlist, ought to work; however, usage tracking (i.e. determining which changes on Wikidata affect the local wiki) is not very fine-grained yet: if a local page uses any data from a Lexeme, then any change to the Lexeme on Wikidata will cause the page to be rerendered and that change will be added to the recent changes on the wiki, even if the change is unrelated to the data which the page actually uses. We will improve this at a later point; in the meantime, wikis that heavily use lexicographical data may see “too many” Wikidata entries in their recent changes and watchlists.

We are really happy to beta-test Lua access to Lexemes with you, and we are looking forward to your feedback and bug reports (they can be tracked under phab:T294159 or on the Report a Technical Problem page). We would love to know how you are using Lexemes on Wiktionary, so when you are running your first experiments, feel free to let us know!

Cheers, Lea Lacroix (WMDE) (talk) 14:28, 13 December 2021 (UTC)

Update: the feature has been successfully added to the wikis listed above. You can now start creating Lua modules using Lexemes, test the different calls, and let us know if you encounter any issues. Lea Lacroix (WMDE) (talk) 13:16, 15 December 2021 (UTC)

Memo: such lua tech could interest Lingualibre, which has basic lexicographic needs and is an ideal place to code and prototype in peace. Yug (talk) 21:38, 18 December 2021 (UTC)