# Wikidata talk:Lexicographical data

Active discussions
 Overview Documentation Development Tools Support for Wiktionary How to help Lexemes Discussion

Wikidata:Lexicographical data

 Lexicographical dataPlace used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc. Translate this header box! Start a new discussion
 On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2020/09.

## Milestone - 200k lexemes

My bot just created spiritualistically (L200000), the 200000th lexeme, while import Wiktionary adverbs! It means "in a way relating to being spiritual".  – The preceding unsigned comment was added by SixTwoEight (talk • contribs) at 22:04, 11 October 2019 (UTC).

## Abbreviations

Yesterday, I added all the United States Postal Service abbreviation (Q30619513) for geographic directionals and street suffixes to lexemes as forms with the feature United States Postal Service abbreviation (Q30619513). ArthurPSmith pointed out that this is a suboptimal way to represent abbreviations, because they don't necessarily have different grammatical properties than their spelled-out forms. I had been following an existing example of an abbreviation form in Special:PermanentLink/1230867700#F3, but I see Arthur's point that a property would be more flexible. It certainly would've shortened the federated query I was building in OpenStreetMap:SPARQL examples#Abbreviated street addresses in Cincinnati.

How should we model these abbreviations? Should I piggyback on Wikidata:Property proposal/Alternative form, qualifying alternative form statements with determination method (P459) set to United States Postal Service abbreviation (Q30619513)? Or should there be a separate property corresponding to United States Postal Service abbreviation (Q30619513)?

– Minh Nguyễn 💬 17:59, 3 August 2020 (UTC)

So the alternative I was vaguely thinking of here was having a property on a form (for example the existing form entry for "street") that has as value the standard abbreviation ("ST"), rather than having it as a form on its own. But maybe the separate form version is ok. It wouldn't be a property for a regular item like street (Q79007) though because it's specifically related to the English word it replaces, not its conceptual meaning. ArthurPSmith (talk) 18:57, 3 August 2020 (UTC)
Would such a property best belong to a form, or to a sense? Taking the street address "St" as an example, the corresponding Swedish abbreviations would be "g" for "gatan" (street) or "v" for "vägen" (road). Swedish street names are usually compound words expressed in definite form, such as "Storgatan" (Main Street) or "Byvägen" (Village Road), though when named after people they appear as separate words in indefinite form, with the person's name in genitive ("Olof Palmes gata", "Sernanders väg"). The abbreviations are the same anyway, regardless of form ("Storg", "Byv", "Olof Palmes g", "Sernanders v"). They are however restricted to street addresses only, never used in other contexts, senses or grammatical constructs involving said words. --SM5POR (talk) 02:20, 10 August 2020 (UTC)
I think attaching to the form is better; if there happen to be two different meanings for the same word in an address it would still have the same abbreviation, and the abbreviation for plural forms may be different from that for singular forms. ArthurPSmith (talk) 18:11, 10 August 2020 (UTC)

### 2

User:Jsamwrite recently added some abbreviations as synonyms to main Lexeme: example. I suppose it's better to have them as forms of such Lexeme, what do you think? --Infovarius (talk) 18:33, 10 August 2020 (UTC)

Ah, days of the week in Polish... I agree that abbreviations shouldn't have separate lexemes, just as mentioned in the case starting this thread (unless they have become words in their own right, such as "laser", or perhaps even pronounced acronyms like "DNA", "TV", "PhD"); as is pointed out above, they rarely (if ever) have different grammatical features than the fully spelled out versions.
But forms? There isn't a grammatical form "abbreviative" (and I don't think we should invent one). The abbreviations we are talking about here are merely written variants of the same words for use in special contexts; they are never pronounced that way (if they happen to be in some language, please educate me). If we begin adding abbreviations to lexemes without having dedicated support for that (such as a separate property), I'm concerned the simple lexeme model will either explode in expressiveness (but not necessarily in usability) or become an entangled mess of overlaid interpretations, as there are other written variants than abbreviations to consider: multiple writing systems (Latin or Cyrillic), numeric notations ("twelve", "12" or "XII"), separate graphemes ("and" or "&", "copyright" or "©", "dollar" or "\$") and so on (I'll make a point about that expression "and so on" later, btw). If a spelled-out word and its abbreviation don't differ in pronunciation, I fail to see how they can have different plurals, cases or genders. Adding them as forms would either result in a combinatorial explosion with existing forms, or override all other forms with a non-grammatical variant character sequence. --SM5POR (talk) 10:17, 11 August 2020 (UTC)
I think the explanation we had received for pl is that accronyms can also have countless forms in that language. I don't know if it also applies to other abbreviations in pl.
As for abbreviations in most languages, I'd add them as form on the entity with the unabbreviated form. --- Jura 17:57, 15 August 2020 (UTC)
I then I came across SPQR. (L285219) ;) --- Jura 18:04, 15 August 2020 (UTC)
In my opinion "TV" may be a lexeme covering all terms that is abbreviated as TV.--GZWDer (talk) 15:33, 8 September 2020 (UTC)

## Lexeme Forms documentation improvement for variants

Variants seem to be a very confusing state of things currently, with several overlapping practices. I would like this discussion to help improve the documentation, since there seems to be evolving standards, or simply that the standards and best practices are not well documented. Let's improve that and discuss this here.

Example of how to deal with phrases or idioms that include hyphens? (I am not asking about "compound words", since the meaning of a compound word is one that expresses a meaning different from its individual words it is made from (ex. moonlight - moon, light or sunflower - sun, flower). Instead I am simply asking about variants where the sense or meaning is the same, but only differs in Form or its Representation)

We already know that in many languages any phrase potentially could be hyphenated or not. For example:

This might affect pattern matching and tokenization efforts with Abstract Wikipedia later on or not, don't know but maybe @Denny: could chime in on that a bit. So I think it's worth mentioning here. I do know that in some languages, various glyphs (like hyphen -)are deemed important enough where sometimes it changes the meaning of a phrase. In other languages, the optional glyphs are sometimes not important and have no effect on changing the meaning of a phrase. For English, my hunch is that these 2 Lexemes should be merged, and instead 2 Forms are created for the hyphenated form and regular form? But I'm not 100% sure based on current reading of some documentation and other Talk pages floating around.

Once someone can help me to update the wiki docs with this kind of example and best practices for variants, I think it will be much more useful to understand how to handle more kinds of spelling variations for a particular sense's form that occur within a single language.

I also see alternatives of spelling variants being done like so: ax | axe https://www.wikidata.org/wiki/Lexeme:L14679

and also would like to see documentation updated in regards to how that actually works and the meaning of spelling variant en-x- ?

It seems there are also side discussions and questions around 'variants' handling in the following:

• On a meta level across languages, I don't think you get around considering all three approaches (i.e. as separate form, on the same form, as separate lexeme).
That said, from your comment it's not entirely clear if you are primarily interested in phrases (which have additional problems) or just the color/colour type of thing.
Phrases can have the same meaning without the same words being present in every use of the phrase. As a linguist once summarized it, they have key elements that are always present. --- Jura 07:15, 16 August 2020 (UTC)
• @Jura1: In general it is about handling spelling variants (specifically about helping later with compound analysis like in Lucene's analysis class for Hyphen) where words change spelling if they are split across lines, like german's backen hyphenates to bak-ken. But the question is also about intra-word delimited phrases (words split into subwords) where you would liken this to the WordDelimiterFilter perhaps? Where the hyphens are often optionally kept as part of the phrase. I'm thinking that I really need Denny's answer here and he will understand a bit more my underlying question. I'm probably not explaining it well enough. --Thadguidry (talk) 00:06, 21 August 2020 (UTC)
• @ArthurPSmith: Maybe alternative form plays a role in this same discussion as well? --Thadguidry (talk) 00:22, 21 August 2020 (UTC)
• If the question is mainly about hyphens and hyphenation, have a look at hyphenation (P5279). As for L190266 and L190270, I'd merge them and use one lexeme with two forms. It can't really be on the same form, as creating a language code for the hyphen-less format would be odd. (BTW I had pinged its creator, maybe it was made as 2 for some purpose.) --- Jura 05:02, 21 August 2020 (UTC)
• @Thadguidry: In general I'd prefer having a single lexeme for hyphenated and non-hyphenated otherwise-identical phrases. However in cases where there may be a difference in meaning (not just of grammatical aspects) or in etymology, then separate lexemes may be justified. It does sound like we need to have separate forms. And yes the proposed property could be used for this. ArthurPSmith (talk) 13:50, 21 August 2020 (UTC)
• @ArthurPSmith: Hmm, merge doesn't complete...just spins (and nothing in the js console...darn). Also, do you have a preference where I should document our mutual agreement on this best handling for hyphenated words? -- Thadguidry (talk) 15:17, 21 August 2020 (UTC)

## verify a Tamil noun and how to create a batch of pronunciation files from commons category?

Lexeme:L309431 This is one of the example for Tamil nouns with pronunciation audio link from Commons. I hope that this is right. if so how can i create a batch of lexemes for the audio files. --Info-farmer (talk) 08:09, 20 August 2020 (UTC)

Currently, there seem to be only 236 existing ones [1], i.e. duplicates to avoid. --- Jura 19:54, 20 August 2020 (UTC)
Funny to have water (L3302) in this list. --Infovarius (talk) 16:12, 21 August 2020 (UTC)
Infovarius how to avoid the word in that list--Info-farmer (talk) 11:11, 27 August 2020 (UTC)
now there is not. --Infovarius (talk) 22:36, 6 September 2020 (UTC)

## Wikidata:Property proposal/WordNet 3.1 Synset Id

Hello, because you deal with lexicographical data on Wikdata, could you give your opinion about this property proposal, especially whether it should be applied to item or to lexeme. Pamputt (talk) 06:45, 25 August 2020 (UTC)

## Looking for a co-speaker to present Lexemes to the Italian-speaking community

Hello all,

I got asked to present Wikidata Lexemes & how they can be used on Wiktionary at the Italian WikiCon (taking place online on October 24-25, the presentation would be in English). I can of course give an overview of Lexemes, the features, etc. but I think it would be much more interesting if there's also someone from the community who is editing Lexemes, and trying to connect Lexemes and Wiktionary content.

Would anyone be interested to work with me on giving this presentation? Thanks in advance :) Lea Lacroix (WMDE) (talk) 09:58, 31 August 2020 (UTC)

@Lea Lacroix (WMDE): I'd be down to do it, as I've recently given similar introductions to speakers of Marathi and Sanskrit. (Not sure how connecting Wiktionary to lexemes has been going in the absence of work on phab:T212843 and its subtasks.) Mahir256 (talk) 12:26, 31 August 2020 (UTC)

## zxx

Hello. What is the status of this code? I tried to use it as "Spelling variant" (under lemma) with the pattern zxx-x-Qitem but it does not work.--MathTexLearner (talk) 23:28, 4 September 2020 (UTC)

@MathTexLearner: I'm not familiar with 'zxx', where did you find that? I've successfully used 'mis-x-Qitem' for this purpose. ArthurPSmith (talk) 00:37, 5 September 2020 (UTC)
@ArthurPSmith: "mis" is "for languages that have no code yet assigned", "zxx" is for "no linguistic content, not applicable". I would like to include LATEX symbols, and they fall into that category (probably as zxx-x-Q5310). For instance, "\partial" which is the expression of partial derivative symbol (Q2920327). There are around 14k LATEX symbols, so it would be useful to have them classified here, and properly linked to the symbols they represent, where applicable.--MathTexLearner (talk) 13:12, 5 September 2020 (UTC)
In answer to your question, more specifically you can find the definition of zxx here: https://en.m.wikipedia.org/wiki/ISO_639 --MathTexLearner (talk) 13:13, 5 September 2020 (UTC)
Some codes exist, but haven't been added yet. For those "mis" needs to be used. --- Jura 13:34, 5 September 2020 (UTC)
Hello all, I'm wondering if LATEX symbols are meant to be stored in the Lexeme namespace, or if we should rather have them as Items? Any suggestions? Lea Lacroix (WMDE) (talk) 08:04, 7 September 2020 (UTC)
Good question, @Lea Lacroix (WMDE), MathTexLearner: I don't see how LaTeX codes or for example the symbols of a programming language ('if', 'for', etc.) would usefully be represented as lexemes - there is only a single form, no grammatical context, etc. To the extent they need a place in Wikidata I think they are best represented as string values for appropriate properties on the associated items. ArthurPSmith (talk) 13:30, 7 September 2020 (UTC)
Hello. The grammar of LATEX is a different subject than the symbols of LATEX represented by tags. In this case, I need a way to categorize the 14000 LATEX tags, and link them with their Wikidata item corresponding to the generic symbol. They are not linguistic context, and as such neither "mul" nor "mis" are appropriate. It is possible for LATEX tags to have different "senses", and it is hard to track them otherwise. Then it is also possible that there are different packages implementing them. At the same time, there are several subsets of TeX, so it is more convenient to have them as Lexemes. I also believe that by converting LATEX tags into Lexemes, that can help in machine learning operations that try to make sense of mathematical formulas written in that language. As it is now, the information about LATEX is too unstructured, which is allowing private companies to offer better alternatives than the open-source option. I also believe that this demonstrator project could be used as a basis for a new Wikibase installation for CTAN.--MathTexLearner (talk) 12:50, 8 September 2020 (UTC)
• "mul" seems to me more appropriate for any usecase I see here. The symbols are actually used in lingusitic contexts. ChristianKl❫ 15:05, 7 September 2020 (UTC)
• Maybe "mul" could indeed do. Let's say we create an entity for "SELECT" and then add a sense for its meaning in each query language? --- Jura 11:56, 16 September 2020 (UTC)

## Alternative Lemmas, alternative Forms representation (Estonian)

1) how to properly represent (model) a situation, where 2 different representations exist for an (otherwise, same) lexeme, for example, in Estonian: sina/sa, mina/ma, meie/me, etc. Estonian has MANY such examples. Should those be two separate Lexemes (with some statement linking them)? Should a statement be added (which)? I personally would vote for SINGLE Lexeme, because otherwise all the properties of these lexemes would be identical, with forms, senses, etc. so this approach could potentially lead to over-duplication => errors/omissions/incompleteness, etc.

2) same question, but concerning representation of alternating forms (e.g. Estonian "kodu" in elative case could be given both as "kodust" and "kodunt"). I think I've read the recommendation about such cases somewhere, but can't find it right now. So would want to have them also documented in the above-mentioned page. Thanks!

– The preceding unsigned comment was added by 62mkv (talk • contribs) at 11:54, 6 сентября 2020‎ (UTC).

• My question was not so much about Estonian, per se. It was rather about "how to properly (from Wikidata norms, guidelines, standards, or whichever) model such situation" (when essentially same word could have more than one representation). However, I think that I will just add alternative forms with the same grammatical features, maybe that's not a problem at all. --62mkv (talk) 16:47, 9 September 2020 (UTC)

## Entry of data for old languauge or languages that does not have ISO.

I am working a bit with Middle Danish (Q12313492) and it is unclear to me how one can enter data about such a language. I do not think it has a ISO code. For "spelling variant" I have used "mis". For usage example (P5831) I have used "da", see oc (L312269). Any other suggestions? — Finn Årup Nielsen (fnielsen) (talk) 10:02, 8 September 2020 (UTC)

If possible, I'd try to use the same code in both uses.
The question is whether the code should eventually be a new code (e.g. oldda) or an extension of "da", e.g. "da-old". In the later case, the lemma should have "da-x-Q12313492".
The rejection of a code for old Swedish suggested to use the code for Old Norse ("non") for that and mentioned Old Danish. For that "non-x-Q12313492" or "non-da" might be appropriate?
Maybe could start with "mis" and eventually make a code request for Middle Danish. --- Jura 10:30, 8 September 2020 (UTC)
I have changed the lemma to "mis-x-Q12313492". However, for usage example (P5831) or other similar properties with monolingual text I am not sure how best annotate. There are various old Danish languages "gammeldansk"/"Middle Danish" Middle Danish (Q12313492) (1100-1500) and olddansk/"Old Danish" Old Danish (Q12330003) (800-1100) [2]. The language before that "Urnordisk"/"Proto-Norse". That could that we needed an da-mid and da-old!? — Finn Årup Nielsen (fnielsen) (talk) 10:50, 8 September 2020 (UTC)
BTW, I added some details about the mis-x-qid system to User:Lea_Lacroix_(WMDE)/List_of_lists_of_languages#General_ideas (bottom of page). A query finds the most frequent ones: https://w.wiki/cMB --- Jura 11:48, 16 September 2020 (UTC)

## Need some help with LexData

Hi,

I'm using (or at least trying to use) LexData to create esperanto lexemes. However, I can't add the claim "P8029" because the type (external-id) is not supported yet. Is it a way to get over this issue? Lepticed7 (talk) 08:57, 14 September 2020 (UTC)

• If it's not supported, you could create the lexemes first and then add the identifiers with QuickStatements. --- Jura 11:52, 16 September 2020 (UTC)