Wikidata talk:Lexicographical data/Archive/2022/11

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

How to search for lexemes?

Latest comment: 1 year ago3 comments2 people in discussion

I think Wikidata:Lexicographical data should describe how you can search for lexemes ... because that is a very important aspect. What I currently gathered:

The search autocompletion on this wiki does not yield Lexemes at all.
The search on this wiki by default does not search the Lexeme namespace but that can be configured.
There are some third-party websites to search for lexemes listed on Wikidata:Tools/Lexicographical data.

Is there some search-as-you-type search for lexemes? If so I think it should be linked from Wikidata:Lexicographical data.

--Push-f (talk) 08:20, 5 November 2022 (UTC)

The only way I know of is by prefixing a regular Wikidata search with L:; no autocomplete yielded. I do not know of way to search that has autocomplete for lexemes عُثمان (talk) 17:37, 6 November 2022 (UTC)

Practically speaking, I find the Ordia tool linked on the third-party site list to be the easiest way to get an overview of the lexemes in a given language. عُثمان (talk) 17:38, 6 November 2022 (UTC)

Useful P5830 here?

Latest comment: 1 year ago4 comments2 people in discussion

I tried to represent accurate forms used in proverb with combines lexemes (P5238) so I used subject form (P5830) like here: był w Pacanowie, wie jak kozy kują (L733871). What do you think, it is correct? Gower (talk) 11:52, 13 November 2022 (UTC)

@Gower: The appropriate qualifier to use for forms on combines lexemes (P5238) is object form (P5548); I have also added series ordinal (P1545) qualifiers to indicate the order in which the words appear. Mahir256 (talk) 15:21, 13 November 2022 (UTC)

Thank you for answer. I knew about object form (P5548) but it seems to be inapropriate in that case, because it is not derived from some form but is the form (variant) indeed. What do you think? Gower (talk) 16:24, 13 November 2022 (UTC)

@Gower: While object form (P5548) was originally created as a qualifier for "derived from", its applicability has since been widened to clarify the objects of statements more generally (such as the values of "combines" statements), and similarly with respect to subject form (P5830) and "usage example" and being widened in applicability to clarify statement subjects (such as lexemes with certain described by source (P1343) and has characteristic (P1552) statements). Think of them as analogues to subject named as (P1810) and object named as (P1932) but with lexeme forms and senses as values rather than strings. (Perhaps this generality has not been exported to languages other than English in property labeling?) Mahir256 (talk) 16:35, 13 November 2022 (UTC)

Spelling variants

Latest comment: 1 year ago2 comments2 people in discussion

We have e.g. Phoenician language (Q36734). Here: Help:Wikimedia_language_codes/lists/all we have (I think so) language code for Punic: xpu. Why it isn't working when I put xpu as spelling variant e.g. here: Lexeme:L738123. That kind of problem is repeating with other languages. What should I do? Type: "mis" in "spelling variant" frame? Gower (talk) 08:52, 18 November 2022 (UTC)

Even if a language code is available elsewhere it has to be specifically enabled for lexemes as far as I know. These can be requested on Wikimedia Phabricator but the backlog has been accumulating for around a year and it is not clear when or if new codes will be added.

We do have the option of distinguishing mis codes by suffixing with -x-QID. I have updated your linked lexeme to use mis-x-Q36734 with the QID for Phoenician.

This does not work for glosses however. I have been using the "gloss quote" property with the language as a qualifier for lack of a better option, but it is currently not clear how glosses in missing languages are supposed to be distinguished. عُثمان (talk) 20:01, 24 November 2022 (UTC)

Derivation property?

Latest comment: 1 year ago3 comments3 people in discussion

Do we have property to mark, adnotate derived lexemes at lexeme record? If not, should we create one? Gower (talk) 16:02, 16 November 2022 (UTC)

@Gower: We do have derived from lexeme (P5191), possibly qualified with object form (P5548) and object sense (P5980) if necessary. Mahir256 (talk) 18:07, 16 November 2022 (UTC)

@Gower, Mahir256: and also mode of derivation (P5886) (among many other qualifiers). Cheers, VIGNERON (talk) 09:34, 26 November 2022 (UTC)

Links between Wiktionaries and lexemes

Latest comment: 1 year ago5 comments4 people in discussion

Do we have property to link lexemes with Wiktionary entries? If not, shouldn't we create it, should we? Gower (talk) 12:53, 18 November 2022 (UTC)

We have automatic tool - User:Nikki/LexemeInterwikiLinks.js. --Infovarius (talk) 19:29, 19 November 2022 (UTC)

@Infovarius thanks, but why it isn't default option for everyone like "Wikidata item" at sidebar? LexemeInterwikiLinks.js works well in Wikidata, but doesn't work on local Wiktionaries, where we have interwiki only to the Wiktionaries in other languages… Gower (talk) 10:14, 20 November 2022 (UTC)

I think it is just a matter of implementing it on the given Wiktionary. I am in the very beginning stages of doing this on pnbwiktionary. On this entry the senses and inflection table are pulled from a lexeme (ਮਾਅਨਾ/معنیٰ (L729524)) using a module based on one implemented by @Mahir256 for Bengali Wiktionary. Sidebar links are probably possible as well in this case because pnbwiktionary is for the most part monolingual. (There are two Polish entries and a Chinese one but that's it for non-Punjabi words.) For wiktionaries like enwiktionary, an entry-to-lexeme sidebar could get quite long as entries there are all shared based on one string. I am quite interested in implementing some more human-readable renderings of lexicographic data for speakers/readers of the language. pnbwiktionary has not had active editors since 2014, but likely more for lack of resources for the language than lack of interest. Lexemes could be used to automate the creation of entries to a certain extent. عُثمان (talk) 19:49, 24 November 2022 (UTC)

@Gower: we don't need a property ; just as there is not stored interwiki on the Wiktionaries, the Cognate extension create them based on the string, we don't need to store on Wikidata and several gadgets already exist from both Wikidata and Wiktionaries sides (on the Wix side, I'm using fr:wikt:Utilisateur:VIGNERON/LienLex.js). Cheers, VIGNERON (talk) 09:08, 26 November 2022 (UTC)

How to model templatic pattern morphemes of Semitic languages?

Latest comment: 1 year ago3 comments2 people in discussion

Arabic languages or varieties have morphemes that are templatic patterns determining a skeleton of often vowels and sometimes prefixes, infixes and suffixes in which the radicals (often consonants) of a root are inserted. (This applies to other Semitic languages as well, but is maybe less codified than in Arabic, I myself are mostly familiar with the situation in Arabic.)

My question is: How should we model these templatic patterns? Especially: Should they get entities in the item namespace or the lexeme namespace? As they are not lexemes by themselves, I've started to create entities for some in the item namespace, a list can be found at User:Marsupium/Arabic morphological patterns.

But some qualities mostly expressed by lexemes in the lexeme namespace apply to these templatic patterns a well. For example, I'd like to express Fi3iiL (Q115287997)derived from lexeme (P5191)Fa3iiL (Q115287998). But that would violate derived from lexeme (P5191)'s allowed-entity-types constraint (Q52004125). How to solve this? Should these patterns be moved to the lexeme namespace or should the property's constraint get widened? Thanks in advance for any comments, --Marsupium (talk) 18:23, 20 November 2022 (UTC)

While the item namespace is designed to describe objects in a language-independent fashion, with the labels in multiple languages primarily intended to aid the human reader/editor of these items, entries in the lexeme namespace are specific to each language. In order not to spend a huge amount of database memory on the language of work or name (P407) qualifier, entities specific to each language should preferably go into a language-specific part of Wikidata. It makes little sense to "translate" the "Fi3iiL" pattern label (or any of its "aliases") from Arabic into English, German, or any other of the 400 languages supported.

The lexeme entry format, listing senses and forms as sub-entries, does however not suit every possible language-specific construct such as a morphological or word order pattern. To the extent they are language-independent (spanning multiple languages) they might fit into the main item space, but I'm not convinced the triplet format is an optimal design element by which to define linguistic patterns. I'm therefore thinking in terms of adding yet another section to Wikidata.

While I first thought of it as a "grammeme" namespace (G-items), this may be too limited a purpose to serve also your morphological patterns. Before deciding how to implement them, I would suggest trying out a few different structures using custom entries in WikiMedia Commons databases (say, one file per language described). Keeping all the entries for a language in a single file also takes care of the notability requirement and helps avoid creating thousands of structural items.

Here are a number of language constructs or rule sets that could use a language-specific entry format that is not necessarily a lexeme (there may be partial overlap between some of these):

Rather than suggest a specific entry format for each one of these constructs, I'd prefer a generic format that can easily be adapted for several different purposes. SM5POR (talk) 07:44, 23 November 2022 (UTC)

Here is an example of how it could be done:

Define a record format that will be sufficient for your immediate need, but also leave room for future expansion. I suggest the following columns/fields:
- Language (ISO code "arz" or Q-item Egyptian Arabic (Q29919) as you prefer)
- Type (Q-item, morphological pattern (Q6913446) in your case)
- Object (Q-item, adjective (Q34698) from your example)
- Pattern (probably a string like "Fi3iiL" or whatever you find convenient to process)
- Options (if you need to express some limitations; think of it as a qualifier to the main Pattern statement)
- Source (Q-item, such as a printed or online grammar)
- Key (a string stating page number, entry keyword or other precise reference to said source)
Create a file with a number of records according to this format.
Upload the file to Wikimedia Commons as Data:Sandbox/(username)/Arabic_linguistic_patterns.tab or similar.
Add a statement Egyptian Arabic (Q29919)Sandbox-Tabular data (P4045)(filename as above) to Wikidata, linking to any documentation page you may have written about your experiment in the Reference section.

While it may seem redundant to specify the language in every record when it also appears in the filename, doing it this way allows you to reorganize your records in a single file or multiple files as you find convenient without having to rewrite the records according to different formats.

I would advise against creating any items specifically to support your experimental data records only, such as Egyptian Arabic grammar (Q114419189), when they don't serve a more general need. Add more fields to your custom table entries instead if necessary. Since items are not meant to be repurposed once they have found redundant, your gradual development work might risk wasting a lot of items, but you can delete and reuse the records of your table files without any such problem.

When you have developed some code to use your tables and the format has matured enough to be used by multiple editors independently of each other, and also for some of the other pattern rule I suggested, it may be time to write a property proposal to replace the experimental Sandbox-Tabular data (P4045) property, but I believe we are far from there yet. SM5POR (talk) 05:57, 25 November 2022 (UTC)

New Lexeme creation page will be live on Wikidata on November 2nd

Latest comment: 1 year ago7 comments5 people in discussion

Hi everyone,

Among our development goals this year is to make the lexicographical data part of Wikidata easier to understand for people not familiar with lexicography. This included reworking the Lexeme creation page to improve the editing experience of users. We plan to replace the Special:NewLexeme page with the new one on November 2nd!

As you may recall, we made a number of tweaks to the old page and asked you to test it and give feedback (see the previous announcement). We addressed the issues the community raised, and we would like to thank everyone who participated in the testing and provided feedback.

While the new Special:NewLexeme is already scheduled to be deployed, we would still like to hear what you think. If you have any questions or suggestions please let us know on this talk page.

Cheers, -Mohammed Sadat (WMDE) (talk) 09:23, 21 October 2022 (UTC)

It's still worse than the current one for me. My biggest issue with the current page is how tedious it is to use, and the new one has managed to make it more tedious by making everything harder to enter.

More issues:

The fields aren't marked as required in the HTML any more.

The required marker is not following the style used by MediaWiki elsewhere.

The required marker is tiny, has a weirdly large gap before it, has no tooltip and does nothing if you click on it.

Pressing enter after entering a language name or lexical category now tries to submit the form instead of selecting the top entry.

Page up and page down no longer work in the dropdowns for the language, lexical category or spelling variant.

The spelling variant field still incorrectly links to Help:Monolingual text languages which is a completely unrelated page.

The spelling variant field no longer automatically selects the right language if you enter a language code (e.g. try "es" - on the current page you get Spanish, in the new one it gives you Esperanto)... even though it tells you enter the language code.

There seems to no longer be any way to open the list of spelling variant languages... which means it's now almost impossible for people to work out how to enter unsupported languages if they don't already know what to do.

If you tab out of the language or lexical category field without selecting an item, there's no indication that the field is incomplete until you try to submit it and get an error.

Moving the terms of use/license info between the input fields and the submit button means that in most browsers, you now have to tab three times to get from the lexical category field to the submit button.

If you use a country variant like de-at (https://www.wikidata.org/wiki/Special:NewLexemeAlpha?uselang=de-at) the example lexeme is showing the fallback language name when it shouldn't.

- Nikki (talk) 21:01, 22 October 2022 (UTC)

On https://test.wikidata.org/wiki/Special:NewLexeme?lexeme-language=Q1, the placeholder text suggests a language code (mis-x-Q26790) that can't be used there.

On https://test.wikidata.org/wiki/Special:NewLexeme?lexeme-language=Q1&lemma=a&lemma-language=mis&lexicalcategory=Q111 and https://test.wikidata.org/wiki/Special:NewLexeme?lexeme-language=Q111&lemma=a&lemma-language=mis&lexicalcategory=Q1, if you try to create the lexeme, it only shows an error about Q1 not existing for the language or lexical category after getting an API error - I would expect it to verify the input before trying to create the lexeme and then show an error message next to the corresponding field. - Nikki (talk) 16:31, 1 November 2022 (UTC)

Thank you! I've created the following tickets and added them to one of the upcoming sprints: phab:T322681, phab:T322683, phab:T322684, phab:T322685, phab:T322686, phab:T322687

For the language code in the spelling variant: What would be your preferred way to address it? Lydia Pintscher (WMDE) (talk) 19:56, 8 November 2022 (UTC)

@Lydia Pintscher (WMDE): BUG!! Language code for created lexeme is wrong: https://www.wikidata.org/w/index.php?title=Lexeme:L735823&oldid=1772213018. --Infovarius (talk) 13:05, 15 November 2022 (UTC)

@Infovarius: Looks like a misconfigured common.js? Mahir256 (talk) 07:44, 27 November 2022 (UTC)

Oh. Oh! My bad, sorry. How was it remain unnoticed by me for so long... User:Mahir256, you are a detective! --Infovarius (talk) 14:23, 28 November 2022 (UTC)

Listing lexemes the topic (QID) consists of

Latest comment: 1 year ago8 comments2 people in discussion

I didn't find any way of specifying lexemes in a Q-item which it consists of. Technically it is impossible because Wikidata types don't have a monolingual array of lexemes data type. So I'd like to open discussion about such a possibility. It opens ways for automatic translations, choosing appropriate forms of words the topic consists of, making plural form, etc.
One possible solution is a property like combines lexemes (P5238) but for Q-items that would list lexemes of a topic. Because every language has its own lexemes for a topic, simple lexeme data type is unsuitable. It would became unordered mess of lexemes in different languages with order and language qualifiers. Such a way is too complex even for machine processing.
Probably a proposal for a new data type would solve the problem. What do you think? D6194c-1cc (talk) 07:57, 25 November 2022 (UTC)

I'm not sure I understand exactly what you are looking for, but the most general property relating lexemes (or rather their senses) to items is item for this sense (P5137). Since there aren't (yet) specific items corresponding to every possible sense of a lexeme in any language, many senses either don't link to any item at all, or they link to the item of a related word. You can make a query in SPARQL to find statements using item for this sense (P5137) by either specifying the sense or the item, or by leaving both unspecified (I just did the latter, and found 136,445 statements).

As an example, the English noun water (L3302-S1) links to items liquid water (Q29053744) and water (Q283). Making a query "?sense wdt:P5137 wd:Q283" yields a list of 73 senses representing words for "water" in various languages. Given the current state of things, this is hardly enough to generate a useful pocket dictionary automatically, let alone translate a full sentence from one language to another, but it may form a rudimentary basis for future development work.

The part of your question that I don't understand is "lexemes the topic consists of". By using SPARQL queries you can certainly limit your matches to only the language you are interested in, say German or French. Translating a phrase such as "the water is cold" however also involves identifying word classes (parts of speech), grammatical forms etc and assembling the translated words in proper order. Part of that problem may be approached using the linguistic pattern data type we discuss in the previous question above (about morphological patterns in Semitic languages), but it's a complex problem that will certainly not be solved by merely adding another data type. Fortunately, we can use existing data types to simulate new ones; there is no need to make a formal proposal merely to experiment with this.

The property combines lexemes (P5238) is hardly of much use here as it merely maps between one lexeme and a list of constituent lexemes; it's not meant to be used with items. SM5POR (talk) 11:01, 25 November 2022 (UTC)

Wikipedia doesn't support the extension that supports SPARQL, so it isn't a solution for me. Also item for this sense (P5137) can't be used for phrases that consist of multiple words. Let me explain the idea by example: scholarly article (Q13442814) = scholarly (L13568)+article (L5515) (en). When you know lexemes you can find their abbreviations, translate them into another language (yes, its very complicated task), and also you can find plural form of noun (Q1084) lexemes (articles (L5515-F2)). Specifying lexemes in items could help to automate many tasks. D6194c-1cc (talk) 11:32, 25 November 2022 (UTC)

Allow me to turn that approach around: Items don't specify any lexemes, so starting out with an item and going nowhere isn't a solution for me. But when I know the lexemes in the source language, I can look up their corresponding items, query Wikidata for matching lexemes in the target language and output the resulting words. It won't be pretty, but a human reader may be able to understand it anyway. Adding SPARQL support to Wikipedia could help automate this task.

Of course, I don't claim that adding SPARQL support to Wikipedia is done overnight, or that the result will not suffer from any performance problems, but let's consider the alternatives:

If the only phrase you will ever want to translate is "scholarly article", sure, that could essentially be done in less than a minute. But if the prerequisite is that we first have to find a technical format for "specifying lexemes in items", and then actually add those lexemes for, say, a million items and 400 languages, I'm not convinced we will finish that task before Wikipedia actually has SPARQL support.

We have two problems here: One is the apparent lack of a strategy for integrating Wikidata with other Wikimedia projects, and the other is the technical difficulty of automatically translating full sentences or multi-word phrases between different languages. I certainly hope that also Wikipedia will one day be able to take full advantage of all the effort that is put into Wikidata, but if Wikidata editors, in order to accomodate Wikipedia, have to spend most of their time worrying about item labels or designing robots to add inverse statements of those already added, I'm not sure Wikidata will ever reach its full potential.

And I use the phrase "apparent lack of strategy" to describe the impression I get, not to discredit the Wikimedia Foundation. There may well be things happening behind the scenes that I'm simply not aware of, but if so, why do we still have to add all those inverse properties? SM5POR (talk) 06:44, 26 November 2022 (UTC)

You can't get lexeme by its name, because different lexemes have different meanings. For example, мир/міръ (L100000) have plural form in Russian language and мир/миръ (L99999) doesn't have plural form. Also in mw.wikibase you can only fetch items directly, no queries are supported. So the only way to get lexemes of an item is to specify lexemes directly in item. D6194c-1cc (talk) 08:14, 26 November 2022 (UTC)

Items don't list lexemes. To get lexemes from items, someone has to specify them first. You propose doing that.
Wikipedia doesn't do queries. To do queries in Wikipedia, someone has to implement it first. I propose doing that.

Why do you depend on an item to translate a phrase? You want to translate "scholarly article". There is an item scholarly article (Q13442814) labelled like that, which means you can translate it after specifying the lexemes "scholarly" and "article" in that item, right?

I want to translate "the water is cold". There is no item labelled like that, and I will not create it because it isn't notable. Instead I split the phrase into its constituent words "the", "water", "is" and "cold", search the lexemes for the corresponding forms and senses (this is a manual task, or one that requires AI) and do the translation.

Likewise, you want to split the item label "scholarly article" into its constituent words "scholarly" and "article", search the lexemes for the corresponding forms and senses (this is a manual task, or one that requires AI) and specify the resulting lexemes with the item.

We are essentially doing the same thing. The differences are:

You require a preexisting item for the phrase you want to translate. I don't, I use the phrase directly.
You specify the lexemes for all the existing items in advance. I specify the lexemes in the phrase I want to translate, only when I need them.

I'm right now quite puzzled as to your idea here; do you really expect Wikidata editors to begin listing hundreds of lexemes with every item, or do you hope it will be done automatically using AI? Because either way, that work is going to take time, and I don't see how you can possibly reduce the time needed to translate a single phrase by first preparing, say, a million items for future translation (there are around 100 million items in Wikidata, but most of them are unlikely to ever be looked up for translation).

As to the impossibility of doing database queries from Wikipedia, that's not some natural law, but an effect of what functionality has or has not been implemented yet. Wikidata has existed for merely ten years, and Wikipedia will hopefully continue to evolve with it for many years to come. It's a bad idea to redesign Wikidata on the assumption that Wikipedia will forever be stuck with its present capabilities only, especially if the Wikidata redesign will cost more than removing the Wikipedia limitations.

Maybe Wikipedia isn't currently the optimal translation tool? -- "If only this Christmas tree were built like a kitchen stove, we could boil eggs on it, and we wouldn't need a stove for that." -- SM5POR (talk) 13:03, 26 November 2022 (UTC)

Again, you cannot find lexeme by its name. Different lexemes have same name but different meanings. It's a task for AI (which determines context), rather than for simple automation. As about scholarly article (Q13442814) (in Russian: "научная статья"), the plural form in Russian would be "научные статьи" (this item was just an example). When you know lexemes, it't not so hard to implement language engine that would change forms of words in a phrase. My task is to get phrases from Wikidata and use them as titles in modules. But I need to translate them and change their form. For this task I have made workaround, but for the plural form I have to make exceptions for every language. D6194c-1cc (talk) 13:46, 26 November 2022 (UTC)

Identifying the proper lexemes (or rather their senses, which is what we want to do) certainly isn't trivial; we are in agreement there. The difference between our respective approach is when we perform that first phase of the translation process, and the cost of doing it that way. Of course merely generating the translation will be faster (and can be done automatically) than parsing the original phrase plus generating the translation, but someone still has to do the parsing, and it will take at least the same effort.

How many modules (approximately) will you need titles for (hundreds, thousands, or millions), and how many times will you need to translate each individual item label (once, twice, or a hundred times)? Can those items you need translated be listed in advance so that we can make a better estimate of the work that has to be done, or do they pop up in a totally unpredictable fashion? SM5POR (talk) 15:07, 26 November 2022 (UTC)