Wikidata talk:Lexicographical data/Archive/2020/09

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

verify a Tamil noun and how to create a batch of pronunciation files from commons category?

Latest comment: 3 years ago5 comments3 people in discussion

Lexeme:L309431 This is one of the example for Tamil nouns with pronunciation audio link from Commons. I hope that this is right. if so how can i create a batch of lexemes for the audio files. --Info-farmer (talk) 08:09, 20 August 2020 (UTC)

Wikidata:Tools/Lexicographical_data has a few tools to create lexemes.

Currently, there seem to be only 236 existing ones [1], i.e. duplicates to avoid. --- Jura 19:54, 20 August 2020 (UTC)

Funny to have water (L3302) in this list. --Infovarius (talk) 16:12, 21 August 2020 (UTC)

Infovarius how to avoid the word in that list--Info-farmer (talk) 11:11, 27 August 2020 (UTC)

now there is not. --Infovarius (talk) 22:36, 6 September 2020 (UTC)

zxx

Latest comment: 3 years ago10 comments5 people in discussion

Hello. What is the status of this code? I tried to use it as "Spelling variant" (under lemma) with the pattern zxx-x-Qitem but it does not work.--MathTexLearner (talk) 23:28, 4 September 2020 (UTC)

@MathTexLearner: I'm not familiar with 'zxx', where did you find that? I've successfully used 'mis-x-Qitem' for this purpose. ArthurPSmith (talk) 00:37, 5 September 2020 (UTC)

@ArthurPSmith: "mis" is "for languages that have no code yet assigned", "zxx" is for "no linguistic content, not applicable". I would like to include LATEX symbols, and they fall into that category (probably as zxx-x-Q5310). For instance, "\partial" which is the expression of partial derivative symbol (Q2920327). There are around 14k LATEX symbols, so it would be useful to have them classified here, and properly linked to the symbols they represent, where applicable.--MathTexLearner (talk) 13:12, 5 September 2020 (UTC)

In answer to your question, more specifically you can find the definition of zxx here: https://en.m.wikipedia.org/wiki/ISO_639 --MathTexLearner (talk) 13:13, 5 September 2020 (UTC)

Some codes exist, but haven't been added yet. For those "mis" needs to be used. --- Jura 13:34, 5 September 2020 (UTC)

Hello all, I'm wondering if LATEX symbols are meant to be stored in the Lexeme namespace, or if we should rather have them as Items? Any suggestions? Lea Lacroix (WMDE) (talk) 08:04, 7 September 2020 (UTC)

Good question, @Lea Lacroix (WMDE), MathTexLearner: I don't see how LaTeX codes or for example the symbols of a programming language ('if', 'for', etc.) would usefully be represented as lexemes - there is only a single form, no grammatical context, etc. To the extent they need a place in Wikidata I think they are best represented as string values for appropriate properties on the associated items. ArthurPSmith (talk) 13:30, 7 September 2020 (UTC)

Hello. The grammar of LATEX is a different subject than the symbols of LATEX represented by tags. In this case, I need a way to categorize the 14000 LATEX tags, and link them with their Wikidata item corresponding to the generic symbol. They are not linguistic context, and as such neither "mul" nor "mis" are appropriate. It is possible for LATEX tags to have different "senses", and it is hard to track them otherwise. Then it is also possible that there are different packages implementing them. At the same time, there are several subsets of TeX, so it is more convenient to have them as Lexemes. I also believe that by converting LATEX tags into Lexemes, that can help in machine learning operations that try to make sense of mathematical formulas written in that language. As it is now, the information about LATEX is too unstructured, which is allowing private companies to offer better alternatives than the open-source option. I also believe that this demonstrator project could be used as a basis for a new Wikibase installation for CTAN.--MathTexLearner (talk) 12:50, 8 September 2020 (UTC)

"mul" seems to me more appropriate for any usecase I see here. The symbols are actually used in lingusitic contexts. ChristianKl ❪✉❫ 15:05, 7 September 2020 (UTC)
Maybe "mul" could indeed do. Let's say we create an entity for "SELECT" and then add a sense for its meaning in each query language? --- Jura 11:56, 16 September 2020 (UTC)

Alternative Lemmas, alternative Forms representation (Estonian)

Latest comment: 3 years ago2 comments2 people in discussion

Hi all. Could somebody please help clarify these aspects on the https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation page:

1) how to properly represent (model) a situation, where 2 different representations exist for an (otherwise, same) lexeme, for example, in Estonian: sina/sa, mina/ma, meie/me, etc. Estonian has MANY such examples. Should those be two separate Lexemes (with some statement linking them)? Should a statement be added (which)? I personally would vote for SINGLE Lexeme, because otherwise all the properties of these lexemes would be identical, with forms, senses, etc. so this approach could potentially lead to over-duplication => errors/omissions/incompleteness, etc.

2) same question, but concerning representation of alternating forms (e.g. Estonian "kodu" in elative case could be given both as "kodust" and "kodunt"). I think I've read the recommendation about such cases somewhere, but can't find it right now. So would want to have them also documented in the above-mentioned page. Thanks!

– The preceding unsigned comment was added by 62mkv (talk • contribs) at 11:54, 6 сентября 2020‎ (UTC).

I think it depends what type of forms they are. https://en.wiktionary.org/wiki/sa#Estonian just notes it as a "short form", but it also mentions that it's a pronoun which are somewhat complex to model.
Maybe https://et.wiktionary.org can help you find a good approach for Estonian. --- Jura 07:58, 6 September 2020 (UTC)

- My question was not so much about Estonian, per se. It was rather about "how to properly (from Wikidata norms, guidelines, standards, or whichever) model such situation" (when essentially same word could have more than one representation). However, I think that I will just add alternative forms with the same grammatical features, maybe that's not a problem at all. --62mkv (talk) 16:47, 9 September 2020 (UTC)

Entry of data for old languauge or languages that does not have ISO.

Latest comment: 3 years ago4 comments2 people in discussion

I am working a bit with Middle Danish (Q12313492) and it is unclear to me how one can enter data about such a language. I do not think it has a ISO code. For "spelling variant" I have used "mis". For usage example (P5831) I have used "da", see oc (L312269). Any other suggestions? — Finn Årup Nielsen (fnielsen) (talk) 10:02, 8 September 2020 (UTC)

If possible, I'd try to use the same code in both uses.

The question is whether the code should eventually be a new code (e.g. oldda) or an extension of "da", e.g. "da-old". In the later case, the lemma should have "da-x-Q12313492".

The rejection of a code for old Swedish suggested to use the code for Old Norse ("non") for that and mentioned Old Danish. For that "non-x-Q12313492" or "non-da" might be appropriate?

Maybe could start with "mis" and eventually make a code request for Middle Danish. --- Jura 10:30, 8 September 2020 (UTC)

I have changed the lemma to "mis-x-Q12313492". However, for usage example (P5831) or other similar properties with monolingual text I am not sure how best annotate. There are various old Danish languages "gammeldansk"/"Middle Danish" Middle Danish (Q12313492) (1100-1500) and olddansk/"Old Danish" Old Danish (Q12330003) (800-1100) [2]. The language before that "Urnordisk"/"Proto-Norse". That could that we needed an da-mid and da-old!? — Finn Årup Nielsen (fnielsen) (talk) 10:50, 8 September 2020 (UTC)

BTW, I added some details about the mis-x-qid system to User:Lea_Lacroix_(WMDE)/List_of_lists_of_languages#General_ideas (bottom of page). A query finds the most frequent ones: https://w.wiki/cMB --- Jura 11:48, 16 September 2020 (UTC)

Need some help with LexData

Latest comment: 3 years ago2 comments2 people in discussion

Hi,

I'm using (or at least trying to use) LexData to create esperanto lexemes. However, I can't add the claim "P8029" because the type (external-id) is not supported yet. Is it a way to get over this issue? Lepticed7 (talk) 08:57, 14 September 2020 (UTC)

If it's not supported, you could create the lexemes first and then add the identifiers with QuickStatements. --- Jura 11:52, 16 September 2020 (UTC)

Anyone else who would like to comment on the misspellings property and definition of "common misspelling"?

Latest comment: 3 years ago1 comment1 person in discussion

There are 2 supporters, 1 opposer and 2 neutral. See Wikidata:Property proposal/common misspellings--So9q (talk) 18:56, 24 September 2020 (UTC)

Missing english adjectives

Latest comment: 3 years ago2 comments2 people in discussion

We have a lot of english adverbs like [3] where the adjective is missing. I would like to create a query to check for all those (query lexemes ending in *ly, cut away the "ly" and query if we have the remaining lexeme (among adjectives)). Would anyone like to help create such a query?

Then I would like to have the lexemes created by bot. Anyone volunteering to do that?--So9q (talk) 07:17, 29 September 2020 (UTC)

I'm not sure that's the right approach here. We do have the form "surprising" under the verb "surprise" - Lexeme:L3719; do we really need to create a separate lexeme as an adjective for that form? That doesn't seem entirely logical to me, given the way English participles work. On the other hand the senses may be somewhat unique to each form - "surprising" has a different meaning from "surprised". So maybe? But I would proceed with care, at least at first, and not just automate with a bot. Besides which what we really need is more senses, not more lexemes, I think... ArthurPSmith (talk) 12:42, 29 September 2020 (UTC)

Lexeme utils Python script

Latest comment: 3 years ago5 comments4 people in discussion

Hi, I have typed in a few hundred lexemes via the lexeme forms, but IMO its just too much repetition.

I therefore intend to write a Python script that does a better job.

I need an example code of creating a new lexeme from Python and how to check if a lexeme already exists. Does anyone have that lying around?

Stage 2 is to make the script help with linking forms from two lexemes that are alternative spellings and other tedious tasks.

Stage 3 is to utilize or borrow Ordias text-to-lexemes to make it possible to ingest e.g. the latest documents from http://data.riksdagen.se and create lexemes for every single word there with usage examples.

Wanna participate?--So9q (talk) 04:45, 28 September 2020 (UTC)

Pinging @MichaelSchoenitzer, Lucas Werkmeister, Yurik: whose work could probably help get you started :) Lea Lacroix (WMDE) (talk) 09:14, 29 September 2020 (UTC)

@So9q: I wrote Lexdata as a Python library for Lexicographical data. Should be easy to build on that. MachtSinn is build on top of it, it's source is also available. -- MichaelSchoenitzer (talk) 15:17, 29 September 2020 (UTC)

Since you already know the Lexeme Forms tool, you could look at its code? E. g. build_lexeme and submit_lexeme to create a lexeme, and get_duplicates to find potential duplicates. Also, maybe you could use bulk mode to create lexemes with less repetition? (The input to bulk mode could be generated by a script.) --Lucas Werkmeister (talk) 20:31, 29 September 2020 (UTC)

Good idea :), had not thought about using the bulk mode for this.--So9q (talk) 07:40, 30 September 2020 (UTC)