Wikidata talk:Lexicographical data/Archive/2023/02

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.


Unified Chinese lexeme

Why we need this kinds of lexeme:

  1. Most Chinese content is cross-topolectal, not restricted to Mandarin; and we can use location of sense usage (P6084) or the proposed Wikidata:Property proposal/variety of sense to tag terms or senses only used in one variety (see wikt:Wiktionary:Votes/pl-2014-04/Unified Chinese).
  2. If we only have lexemes in Modern Standard Chinese, it is impossible to express the etymology of Việt Nam (L1003547). Creating lexemes for Classical Chinese is a bad choice as (1) this will result in large number of redundant senses; (2) the scope of Classical Chinese is vague (c.f. m:Requests for new languages/Wikisource Literary Chinese); (3) There are historical stages of Chinese that is neither Classical Chinese nor MSC, such as Old Mandarin (Q837169).
  3. Using a narrow variety (such as Cantonese (Q9186)) as language make Wikidata difficult to cover terms usage in the entire macrolanguage (this is also why I propose to unify some languages). Examples in Chinese:
    1. wikt:嘎#Etymology_4 - covers multiple Mandarin dialects and this can not be described as one single MSC lexeme. Creating individual lexemes for term in each Mandarin dialects is also bad as senses must be duplicated and such lexemes have no other infomation (there are no Romanization for most lects).
    2. wikt:Template:zh-dial-map/太陽 - if we only have lexemes for some of narrow varieties this can not be expressable.

--GZWDer (talk) 19:46, 31 January 2023 (UTC)

This is the first draft of what a unified Chinese lexeme will look like: 朋友 (L1007712), (L1008271), 葫芦/葫蘆 (L1008653). This version of modelling is far from perfect: some are not really good design. Feel free to comment better way to modul such information (such as using additional lexemes for specific dialects together with the unified Chinese lexeme, though my expectation is Unified Chinese lexemes will eventually replace all lexemes for specific dialects)--GZWDer (talk) 19:55, 31 January 2023 (UTC)
In general this idea of unifying languages appeals to me; the one concern I have is if this might promote a bias toward one version of the language (for example Mandarin) over the others. Can we do this in a way that treats all the variants equally somehow? I think the structure provided in Wikidata allows that. ArthurPSmith (talk) 19:19, 1 February 2023 (UTC)

Maybe adding lexemes in Middle Chinese could be away to do. Middle Chinese phonology (Q60988691) is one the ancestral language of the languages you used in your example. And the spelling of those han characters in Middle Chinese quite fixed due to rich literature and language research.Supaplex (talk) 03:21, 3 February 2023 (UTC)

Propose to remove all chữ Nôm (Q875344) lexeme

Currently there are 59 lexemes using chữ Nôm (Q875344) as language, such as (L679721). These lexemes have senses duplicated from corresponding Vietnamese lexemes. As Vietnamese is a language with no inflection, we can store these chữ Nôm as forms of existing lexemes instead. GZWDer (talk) 14:08, 14 January 2023 (UTC)

@Mxn: Thoughts? Mahir256 (talk) 17:21, 16 January 2023 (UTC)

@GZWDer: The Nôm entries used to be combined with quốc ngữ entries, but this proved untenable because of the many-to-many relationship between quốc ngữ words and Nôm characters. For example, each sense of xanh (L705061) corresponds to a different set of multiple Nôm characters, and some of those characters corresponds to multiple quốc ngữ words with different spellings and senses. In a modern context, it would be utterly pedantic to split the lexeme on each difference that would be relevant only in the historical Nôm logographic writing system but not in the quốc ngữ alphabet. Modern dictionaries and Vietnamese speakers never distinguish these words as homonyms.

Even in lexemes that have only a single sense, multiple Nôm characters can be used interchangeably, because chữ Nôm was never standardized. It would be inaccurate to return to representing each Nôm character as a separate form, as though there is any grammatical significance to each form. (I got some spirited pushback against this approach originally.) Moreover, Wikibase does not support assigning multiple representations of the same language to a single form. The only workaround would be to assign a character-specific language code to each character. For example, xanh (L705061-F1) would have additional representations such as 𩇢 for vi-x-Q109872913 and 𫕹 for vi-x-Q109901121. Obviously, it would be counterproductive to introduce thousands of language codes in this manner.

Duplicating senses is an unfortunate tradeoff. However, there is plenty of precedent for this duplication, for example at the English Wiktionary and in any chữ Nôm–quốc ngữ "translation" dictionary.

 –  Minh Nguyễn 💬 00:09, 17 January 2023 (UTC)

@Mxn: 1. "many-to-many relationship" - So I proposed Wikidata:Property proposal/form applies to sense (even if the examples there are all in English). 2. What I mean is to add a form for each Nôm characters, so that one form only contains one Nôm characters. 3. "any grammatical significance to each form" - Wikidata does not require such. The data is still unambigously interpretable with no grammatical features present. 4. English Wiktionary does not treat Nôm characters as lemmas, they are only soft redirects to the main entries.--GZWDer (talk) 06:01, 17 January 2023 (UTC)

@GZWDer: Would any Vietnamese lexeme have chữ Nôm lemmata alongside the quốc ngữ lemma? In the very common case where a single Nôm character has multiple dialectal readings, such as 𡗶 (Q109809238), would we create an identical form in the lexemes for trời, giời, lời, etc.? In order for a Wiktionary to list the readings of 𡗶 via a module, we would need to build out coverage of Vietnamese reading (P5625), reading pattern of Kanji (P5244), and chữ Nôm reading (Q56066660) and ignore the lexemes entirely.

@AGutman-WMF @Ariel Gutman: Do you agree that Vietnamese lexemes can have multiple forms without any grammatical properties?

 –  Minh Nguyễn 💬 06:31, 17 January 2023 (UTC)

@Mxn: The lemma of the lexeme will not contain any chữ Nôm, since they are mostly obsolete in the modern time. For chữ Nôm with multiple readings, there will be forms added to each of lexemes, each of which will only contain few information by itself ( (L679721)Han character in this lexeme (P5425) and the proposed "form applies to sense" property). --GZWDer (talk) 06:40, 17 January 2023 (UTC)

@GZWDer: Though chữ Nôm is obsolete, it’s quite within the scope of Wiktionary modules based on Wikidata lexemes. The indirection required to discover the multiple readings of a character would be quite challenging. (Basically, we’d have to rely on something like Listeria instead of built-in Wikibase functionality, but then we might as well eschew lexemes.)

There are other issues too. For example, you contend that the Nôm forms would have no other statements, but would the pronunciation (P7243) statement be duplicated across each form or reserved for the quốc ngữ form purely as a matter of convention? For all the flaws of the current multiple lexeme approach, I would contend that it’s less surprising to data consumers (such as SPARQL queries) that aren’t specifically special-casing Vietnamese.

As I see it, the main downside of the current approach is that it overloads translation (P5972). Maybe what we really need is a new property to express the precise relationship between the quốc ngữ and Nôm lexemes, something like Vietnamese reading (P5625) but typed as a lexeme rather than a monolingual string?

 –  Minh Nguyễn 💬 16:42, 17 January 2023 (UTC)

@Mxn: 1. pronunciation (P7243) is only used in quốc ngữ form. There make no sense on duplicating them in each Nôm form. 2. Not every Han character are valid words in Vietnamese, but they still have translation (P5972)Vietnamese reading (P5625). Also, Wikidata have different lexemes for different part-of-speechs (for example, wikt:nữ#Vietnamese will be three lexemes), and using lexeme datatype in Vietnamese reading (P5625) will be awkward in this case.--GZWDer (talk) 18:11, 17 January 2023 (UTC)
@GZWDer: That’s why I suggested proposing a new sense-typed property to accurately express the relationship between a given sense in a quốc ngữ lexeme and its corresponding sense in a Nôm lexeme, or vice versa. It could be specific to Vietnamese or generic enough to account for other languages in a similar situation, such as Tày (Q2511476). Minh Nguyễn 💬 21:51, 17 January 2023 (UTC)
@Mxn: If relations between quốc ngữ and Nôm can be expressed using forms (and items, for Han characters), it make no sense to create separate lexeme for Nôm. Such lexemes will duplicate existing senses of quốc ngữ lexemes.--GZWDer (talk) 12:16, 18 January 2023 (UTC)
@GZWDer: But this is the cruxt of the issue: the relation cannot be expressed adequately using forms as they are today. Your proposed form applies to sense property is only a partial workaround. It would remain impractical to list the quốc ngữ associated with a given Nôm character – a feature not uncommon in Vietnamese dictionaries. Moreover, it would inaccurately represent that, for instance, trời and giời only happen to correspond to the same character 𡗶 by coincidence. I would be more than happy to refine the modeling of these Nôm lexemes, but so far the only argument for eliminating them is that they "make no sense". Well, they make some sense to me, a Vietnamese speaker attempting to facilitate the adoption of Wikidata at the Vietnamese Wiktionary. If not for the many-to-many relationship, it would be no big deal to structure Vietnamese lexemes similar to Japanese lexemes: for example, 柴犬/しばいぬ (L2305) duplicates 柴犬/しばけん (L2306), but the relationship is clear due to homograph lexeme (P5402) and the duplicated lemmata. Minh Nguyễn 💬 21:44, 22 January 2023 (UTC)
@Mxn: (1) vi-readings template should be use Vietnamese reading (P5625) (or something like phab:T195411) instead of using any separate lexemes. (2) One option is to link different quốc ngữ forms with same meanings and different spellings alternative form (P8530), so users can see they are related. GZWDer (talk) 10:26, 27 January 2023 (UTC)

@GZWDer: Great, we can remodel all the lexemes either once phab:T195411 is implemented along with a corresponding Lua method (rather less likely), or once Vietnamese reading (P5625) is deprecated in favor of a new property that uses senses as values. But using this sense-typed variation of Vietnamese reading (P5625) on senses would more robustly and conveniently solve the problem you raise than using it on items about characters, given that compound words also have quốc ngữ readings.

You're trying hard to make your proposal apply to this specific aspect of Vietnamese, but it just isn't working. alternative form (P8530) gets us no closer to a solution, as trời and giời are synonyms in different dialects, not alternative forms. What you are essentially saying is that 𡗶 is a synonym of 𡗶 – that makes no sense.

 – Minh Nguyễn 💬 05:59, 5 February 2023 (UTC)

@Mxn: We can use lexeme sense (P7018) (or a new property) as a qualifier of Vietnamese reading (P5625).--GZWDer (talk) 08:37, 5 February 2023 (UTC)
If there is no inflection anyhow in Vietnamese, then I suppose it would be fine, since anyhow the "inflected forms" would not be used to store inflection, but another type of variation. I still, however, believe that storing these orthographic forms as "spelling variants" is better, for instance using the code vi-x-Q875344 (and possibly some alternative codes, if you have more than one such variant). Ariel Gutman (talk) 18:49, 18 January 2023 (UTC)
@Ariel Gutman: The alternative codes would end up being the QIDs of the Han characters themselves – that's the extent to which we can generalize about the choice of one character over another. Minh Nguyễn 💬 21:35, 22 January 2023 (UTC)
I have created some remodeled Vietnamese lexemes at Việt Nam (L1003547), (L1003552) and (L1003559).--GZWDer (talk) 11:02, 31 January 2023 (UTC)

P31 on lexemes

Hi y'all,

I noticed that there is a lot of unexpected value for instance of (P31) on lexemes.

If we take a look at this query (counting number of values) :

SELECT ?instance ?instanceLabel (COUNT(?q) AS ?nb) (SAMPLE(?q) AS ?sample) WHERE {
  ?q dct:language ?lang ;
     wdt:P31 ?instance .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?instance ?instanceLabel
ORDER BY DESC(?nb)
Try it!

The most common values are :

value number of uses sample
count noun (Q1520033) 12847 poŝtelefono/poshtelefono/posxtelefono (L406481)
Whitaker's Latin frequency type C (Q86850539) 10314 accuro (L254988)
toponym (Q7884789) 9515 Pollando (L270060)
compound (Q245423) 7248 poŝtelefono/poshtelefono/posxtelefono (L406481)
Whitaker's Latin frequency type F (Q86851213) 6260 accumulator (L254983)
Whitaker's Latin frequency type E (Q86851081) 5913 aceratus (L255011)
Whitaker's Latin frequency type D (Q86850702) 3988 acorion (L255101)
weak verb (Q60655) 3251 atmen (L301509)
mass noun (Q489168) 2839 akvo (L8286)
Whitaker's Latin frequency type B (Q86850447) 2294 accumbo (L254980)

Does these values really belong in instance of (P31)? It feels some should go in more specific properties.

For instance, if we focus on verbs, there is 589 lexemes with instance of (P31) = transitive verb (Q1774805) and 200 with intransitive verb (Q1166153) (and more alike) where it probably should be moved to transitivity (P9295), no ?

Not sure what to do with the "Whitaker's Latin frequency type"...

Also, can't compound (Q245423) be just removed if there is already combines lexemes (P5238) ? (not sure, but it look likes it's already implied). Same case for occupation name (Q116003388), if the sense links to an occupation, isn't it obviously an occupational term? (@AdamSeattle: for this last one).

Finally, I see a lot of plain mistakes (especially at the bottom with low number of uses) that should be corrected.

Cheers, VIGNERON (talk) 08:07, 12 February 2023 (UTC)

Some is usually handled by has characteristic (P1552) but in my opinion this is just another vague property. GZWDer (talk) 13:37, 12 February 2023 (UTC)
I have now proposed Wikidata:Property proposal/countability. GZWDer (talk) 14:05, 12 February 2023 (UTC)

Lexem for place names with multiple wikidata objects

Hello, there are over 50 different suffixes for different island types in Swedish Finland (see article en:Skärgårdsnamn. There are also over 17 000 Wikidata objects (and Wikipedia articles in Swedish) for islands in Finland's archipelago with around 13 000 unique names (that are also used in Sweden's archipelagoes).

How should place names with multiple Wikidata objects be reflected with Lexems? I suppose every unique island name should have a lexem, and they could have the property combines lexemes (P5238) with at least the appropriate suffix, as well as one sense (en:"name of island") with one or multiple entries with item for this sense (P5137) pointing at the Wikidata objects with same name.

I have tried this for the lexem Kalskär Lexeme:L1016964 - is this the correct way of thinking - if I create more? Robertsilen (talk) 11:58, 13 February 2023 (UTC)

Previous discussion: Wikidata_talk:Lexicographical_data/Archive/2022/02#Please_advise_-_place_names. In my opinion it's preferred to have one sense for each different meaning of the name (and not just "place name"), but one lexeme for each unrelated (i.e. not cognate) etymology of the name.--GZWDer (talk) 13:44, 13 February 2023 (UTC)

Proposal for implementing frame semantics in Wikidata

Hello,

I have made proposals for a number of new properties meant to support an implementation of frame semantics in Wikidata. As one of these properties involves lexemes, it may be of interest to people here to read about my proposals. I post below a description of the first property ('frame element of'), plus a short bit about how my proposed frame semantics is meant to relate to a current lexeme project involving Sumerian and Akkadian.


I propose introducing a family of properties which allow the description of frame semantics on Wikidata, roughly following the schema established by the Berkeley MetaNet project for English (https://metaphor.icsi.berkeley.edu/pub/en/index.php/Category:Frame), which in turn roughly follows the schema of FrameNet. Many of the proposed properties can be externally linked to the MetaNet project or to FrameNet. The reason I turn to MetaNet is because that project's setup is specifically geared towards conceptual metaphors, which is also my interest. The property of frame element is equivalent to MetaNet's 'role', and is one of the most fundamental properties in frame semantics. In the MetaNet schema a role has a somewhat more generalized sense than common English usage, describing not just an actor but also a specific action or result. Thus for the frame WALKING THE DOG, important roles include: DOG, DOG OWNER, WALKING, EXERCISE. The property of role is different from subframe or 'subcase', where one frame is a constituent element of another (e.g. DRIVING TO WORK may have the subframe STARTING THE CAR). I make these proposals because I wish to start a project involving building a database of frames for a low-resource language (Akkadian), which currently has no representation involving frame semantics on the internet. Although there are web projects for frame semantics involving languages like English (e.g. FrameNet, MetaNet), much of the data of these projects cannot be readily imported for other languages due to the cultural idiosyncracies behind some of their more complex frames. Hence the need to develop property types specifically within Wikidata.

Currently the only other property proposals in Wikidata involving frame semantics are for FrameNet Frame ID and FrameNet Lexical Unit ID. I propose to go beyond this with a minimal but functional set of properties describing general frames and their components.

These proposed properties are immediately relevant to the set of Akkadian (and Sumerian) lexemes being developed on Wikidata by Adam Anderson and Timo Homburg out of data from Oracc (an online repository of lemmatized Akkadian and Sumerian texts - http://oracc.museum.upenn.edu/). An example of such a lexeme is 'abala' (https://www.wikidata.org/wiki/Lexeme:L709438). With the addition of the proposed frame semantic properties, these lexemes would acquire added relevance to natural language processing projects involving deep semantic parsing. On the other hand, Anderson/Homburg's project is currently focused on Sumerian. My own project involving building a database of frames/metaphors in Akkadian would complement their work, starting out with Akkadian. We would hopefully 'meet in the middle' as it were. Sinleqeunnini (talk) 17:03, 13 February 2023 (UTC)

Only add suffix for P5238, ok?

There are roughly 50 suffixes for island place names in Swedish. If I create lexemes for a few thousand islands (that also have Wikidata/Wikipedia), is it ok if I add “property combines lexemes” (P5238) with just the suffix lexem, but not other parts? My thinking is that I can do what’s possible semi-automatically (which also makes statistics possible), and leave adding prefix lexemes for later. Is incomplete P5238 ok? 86.114.223.210 18:47, 13 February 2023 (UTC)

No, it's not really ok. In fact, there is a constraint on combines lexemes (P5238) to prevent exactly that. Cheers, VIGNERON (talk) 15:34, 19 February 2023 (UTC)

Default parameters on Special:NewLexeme

Hi, when creating a new lexeme, is there a way to start with default parameters? For example, if I use ?lemma= tag, lemma name gets preloaded (as in https://www.wikidata.org/wiki/Special:NewLexeme?lemma=abc).

Is there a way to preload Lexeme's language, Spelling Variant and lexical category? Joseph (talk) 19:10, 18 February 2023 (UTC)

Those parameters are lexeme-language, lemma-language and lexicalcategory. - Nikki (talk) 11:16, 19 February 2023 (UTC)
Thank you, it's working! Joseph (talk) 12:44, 19 February 2023 (UTC)
Return to the project page "Lexicographical data/Archive/2023/02".