Open main menu

Wikidata talk:Lexicographical data









Support for Wiktionary


How to help






Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.

On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2019/01.


Requests for deletionEdit


Following the previous discussions about phonemes and graphemes, I created two requests for deletion :

A galon, VIGNERON (talk) 13:17, 29 December 2018 (UTC)

Modeling emphasis (stress) in pronunciationEdit

Is there some established practice about how to indicate stress (i.e. which syllable is emphasized, in languages that use stress, e.g. most Indo-European and Semitic languages) on Wikidata?

One fairly standard way is to place an acute accent over the (vowel in the) stressed syllable. But producing a combining acute accent is difficult for many contributors (e.g. no doubt easy for French speakers, who have acutes on their keyboards, but not so much for German, English, Russian, or Swahili speakers). Also, acutes are not used (and may not be supported by fonts) in many non-Latin and non-Cyrillic languages, so other ways need to be found to mark stress in Hebrew, Arabic, etc. (Hebrew has a traditional mark, the meteg, that goes under the letters.)

But quite apart from a character to be placed in the Forms themselves, I wonder about more properly modelling emphasis, which would imply modelling syllabification. Would that be overkill?

Thoughts welcome. Asaf Bartov (talk) 15:27, 31 December 2018 (UTC)

How would you indicate stress in a French word "été"? --Infovarius (talk) 00:32, 1 January 2019 (UTC)
Excellent question. I'm not a French speaker. Is there a standard way to indicate stress (in textbooks for learners, for instance)? Tagging Harmonia and Nicolas. Asaf Bartov (talk) 15:42, 2 January 2019 (UTC)
I have no idea if there is a standard way to indicate stress in French. I remember that my textbooks used bold on the emphasized syllabe, but not sure if that is a constant practice. I'll try to find with people teaching French as an secondary language to know what they use. But acute accent are not used to stress syllabes in French ; they change the pronunciation of the letter (é is not è is not e). A word using an acute is different than the same written letters without acute. (and by the way, we only use frequently é, è, à and ù ; other acutes are not easily found in French keyboards). --Harmonia Amanda (talk) 15:52, 2 January 2019 (UTC)
yes, of course, I did not think acutes indicate stress in French. My point was only about their relatively better availability to French speakers, but your point is well taken about not having easy acutes for all vowels! Asaf Bartov (talk) 17:27, 2 January 2019 (UTC)
@Ijon: If a language always uses stress accents in writing then those should be in the lexeme lemma and forms. Otherwise the stress should be indicated with the IPA stress marker in the IPA transcription (P898) property (or for any other phonetic transcription properties we may have now or in future). ArthurPSmith (talk) 16:31, 2 January 2019 (UTC)
Thanks, the IPA stress marker is a good idea! I think it would not be sufficient, though. IPA is not easily read by mere mortals, and it seems to me Lexeme should help learners easily understand the stressed syllable without requiring them to figure out IPA. Asaf Bartov (talk) 17:27, 2 January 2019 (UTC)
No idea, I'm not an expert of pronunciation so I dodge to Pamputt who could maybe tell us more about it. Cheers, VIGNERON (talk) 22:23, 2 January 2019 (UTC)
I completely agree with what Arthur wrote. If stress is part of spelling, then it is indicated in the lemma and forms. If it's just pronunciation then we indicate it within the IPA transcription (P898) property with the IPA stree marker. Indeed, IPA is not easily read by people but this is is the universal way to write pronunciation. Pamputt (talk) 06:17, 3 January 2019 (UTC)
Many dictionaries have special transcriptions to indicate details of pronunciation that don't follow from the spelling. These transcriptions are often more widespread than IPA. We should include such transcriptions as well. —Rua (mew) 11:36, 3 January 2019 (UTC)

add new languageEdit

How can we add a new language? I would like to start adding some words in blackfoot (iso code 'bla'). Thanks. Amqui (talk) 00:49, 4 January 2019 (UTC)

Hi Amqui,
(If I'm not mistaken) you should make a request for a new language to the m:Language committee (according to and following Help:Monolingual text languages).
That said, meanwhile, you can still create Lexemes and use the private code mis-x-Q33060 (which can be a good way to prove to the LangCom that you have an actual use for this code).
Cdlt, VIGNERON (talk) 09:07, 4 January 2019 (UTC)
What Vigneron said, the process for now is the same as described on Help:Monolingual text languages.
FYI, this is included in the discussions we're having at the moment on Wikidata:Identify problems with adding new languages into Wikidata, feel free to participate. Lea Lacroix (WMDE) (talk) 13:32, 7 January 2019 (UTC)

Improving lexeme creationEdit

The creation process for lexemes could be improved, the "Language of Lexeme" and "Lexical category" fields could rank related results higher. I type "adv" and get "Adventure film" as first suggestion, instead of "adverb". "Language" could also accept language codes, then it would be consistent with the "Add sense gloss" interface. After some reading I found the "Wikidata Lexeme Forms" tool which aims to simplify the creation process. Maybe this could be mentioned in Special:NewLexeme? – Jberkel (talk) 09:33, 4 January 2019 (UTC)

To add to this, the resetting of the language field on the creation page for each new lexeme is frustrating; especially when creating lexemes for languages that require the mis code and/or that require many letters typed before the correct suggestion appears. Prioritizing items for languages over other items would go some way, but I expect that the vast majority of lexemes that an individual user creates belongs to, say, 1-3 different languages. --Njardarlogar (talk) 12:52, 4 January 2019 (UTC)
If you are doing more than just 1 or 2 lexemes in your language, the Lexeme Forms tool is a huge help - if there aren't any forms yet in your language talk to Lucas Werkmeister about it. Besides setting the language and category from the start, it also checks for existing lexemes with the same string value in your language, which can be a huge help to avoid duplicates. ArthurPSmith (talk) 18:27, 4 January 2019 (UTC)

Report of word forms in need of pronunciation audio?Edit

Is there a way to generate a report with all word forms that have no value yet for pronunciation audio (P443)? Ideally separated by language. This would be useful for Lingua Libre and similar recording tools. — Sascha (talk) 19:24, 11 January 2019 (UTC)

@Sascha: Yes, it is possible, I use it daily for Polish. It is described in Lingua Libre help, see KaMan (talk) 19:47, 11 January 2019 (UTC)

Indicating allomorphsEdit

How should one indicate allomorfs? Should that be as a P31 for the lexemes or forms? Or should that be indicated as a relation between two forms with a bespoke property? I have set up three allomorphs for the Danish language, see Ordia: or Wikidata . I am, however, not sure that this is a best way to do it? — Finn Årup Nielsen (fnielsen) (talk) 14:36, 13 January 2019 (UTC)

Are you referring to language units that are spelled (and pronounced) differently, or spelled the same and only pronounced differently? If the former I think the clear approach with Lexemes would be to create multiple forms, one for each allomorph. That might be the right solution in the latter case also I guess, just have several forms spelled the same but with alternate pronunciations? ArthurPSmith (talk) 14:36, 14 January 2019 (UTC)
See also Wikidata:Property proposal/precedes word-initial. ArthurPSmith (talk) 14:37, 14 January 2019 (UTC)
Following a Danish text (Den nødvendige grammatik: En kort oversigt (Q54314757)), I am referring to the situation where the root changes, e.g., stor, større, største in stor (L34787). Can we say that stor (L34787) is an instance of allomorph (Q1124301)? Or is it "større" that is allomorph to "stor" or the set of (stor, større, størst) that is allomorph. The same problem arise for suppletion (Q324982), e.g., should we say that go (L3006) is an instance of suppletion (Q324982)Finn Årup Nielsen (fnielsen) (talk) 16:07, 14 January 2019 (UTC)
The question is where to (or whether to) put the instance of (P31)? For go (L3006) (and suppletion (Q324982)) I guess it might be useful to put it on L3006-F3, but maybe that's a special case as there's only one form that's affected. I can't say I know anything about Danish, so I'm really not sure what to recommend for you. ArthurPSmith (talk) 17:05, 14 January 2019 (UTC)
Why should suppletion be on the individual form. It seems to me to be more appropriate on the lexeme level. I regard a suppletion as a lexeme that has been merged from two lexemes. Regarding allomorph, I have now removed the three instances where I put it (I think I applied it the wrong way). I am still not sure how we can specify allomophisms. Possibly with a property? But between what? — Finn Årup Nielsen (fnielsen) (talk) 21:18, 14 January 2019 (UTC)
Return to the project page "Lexicographical data".