Open main menu

Wikidata talk:Lexicographical data

Overview

 

Documentation

 

Development

 

Tools

 

Support for Wiktionary

 

How to help

 

Lexemes

 

Discussion

 

Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2019/05.


Contents

Participles, gerunds and verbal nouns, verb forms or not?Edit

Words like these are tricky when it comes to lexical category. Traditionally we'd call them "verb", but they can be used in ways that are quite un-verblike. Participles regularly function as adjectives, and often inflect like adjectives in languages where that is a thing. Gerunds and verbal nouns are nounlike, can inflect like nouns in some languages, and can also have grammatical gender. All of this seems hard to model using just "forms".

A particularly interesting case is that of Irish (and the Goidelic languages in general). Every Irish verb has a verbal noun, which can take objects in certain cases and is also used in special verbal constructs where a regular noun would not fit. But at the same time, it can inflect for case and number like a noun. What's more, not all verbal nouns are the same: they are a principal part of the inflection paradigm and can't be predicted from the rest of the forms, because they are formed through a variety of means. They also have varying genders: some are masculine while others are feminine. To top it all off, many of them have additional senses that are independent of the verb. Look at wikt:obair for example. The noun has many senses, of which the verbal noun can be considered just one.

In Wikidata, such cases could be modelled by considering the verbal noun, and all of its forms, as part of the verbal inflection pattern. But then that leaves no room for the senses that do not belong to the verb. If we include the verbal noun's forms as verb forms, but then also include the noun as its own lemma, then we end up duplicating the forms: the same form would be included both as a verb form and as a noun form, when in reality these are not two different forms, they are forms of one and the same noun. The grammatical gender of the verbal noun is even trickier, we do not seem to have any way to indicate it in a set of verbal forms at all. Tagging every form of the verbal noun with "masculine" or "feminine" seems silly, since these forms are not gendered in the way an adjective might be, with forms for every possible gender. Instead, a verb would have either masculine verbal noun forms, or feminine ones, but not both, because the gender is inherent in the noun.

While the case of Irish is perhaps a more extreme one, similar situations crop up in other languages too. Participles can have senses independent from the verb too, and can even serve as the basis for deadjectival verbs or nouns. In Dutch, the gerund is identical to the infinitive, but has a gender. In German, it's identical in form, but is written with a capital letter like all nouns, has a gender, and also case forms. All of this makes it very unclear how such forms should be treated, and it's something I've been struggling with on Wiktionary as well. Maybe someone has some great insights? —Rua (mew) 11:39, 25 April 2019 (UTC)

A separate lexeme for the "verbal noun" maybe? With "derived from" property to look to the original verb? ArthurPSmith (talk) 23:36, 25 April 2019 (UTC)
There would have to be a counterpart "has verbal noun" property on the verb then, so that it is clear which verbal nouns a verb has. I say that in plural, because as Mahagaja pointed out to me today, Irish verbs can actually have multiple verbal nouns. "Derived from" would not be suitable for Irish verbal nouns, because some of them have existed as nouns since Proto-Celtic or even Proto-Indo-European times, and cannot be said to derive from the verb within Irish at all.
What is your take on the other cases I mentioned, such as the gerund in German? To take wissen (L2058) as an example, it has also a capitalised form Wissen, which is the gerund, has neuter gender, and also case forms. Given that the capitalisation is significant, we must include it separately from the infinitive of the verb, which is lowercase. Would this, too, be better considered a noun of its own? Keep in mind that every German verb automatically has such a capitalised gerund, it's always identical to the infinitive and it's always neuter. So if we make them separate lexemes, we'd have to have one noun to pair with each verb. And some have again additional senses distinct from the verb, like Essen, which is not only the gerund of essen "to eat", but also means "food". —Rua (mew) 14:55, 26 April 2019 (UTC)
Hmm, I guess I'd recommend a separate lexeme only when necessary (for example the case of additional senses). Is the gender of German gerunds always the same? But maybe it wouldn't hurt to have a separate lexeme for each verb - verbs seem to be much less numerous than nouns in general so it wouldn't add a lot of overhead. ArthurPSmith (talk) 15:52, 26 April 2019 (UTC)
I suppose all these features show that these are separate parts of speech. It is may be a simplification but useful simplification. --Infovarius (talk) 14:53, 1 May 2019 (UTC)
  • In French, I avoided creating them. However, there are some combined forms that don't exist as verbs. --- Jura 13:57, 16 May 2019 (UTC)

How to add voice type of a pronunciation file without error?Edit

i added few pronunciation audio files link to a lexeme(Lexeme:L43699). There are male and female voices available for the lexeme. i am facing an issue while denoting the female voice. how to avoid that error? kindly, see at the page bottom.--Info-farmer (talk) 03:15, 1 May 2019 (UTC)

I do not know what is the best way. I did this change. I think voice type (P412) is good as well. I think we should agree on one way to do and write it somewhere so that everyone can do the same in the future. Pamputt (talk) 08:02, 1 May 2019 (UTC)
I will add as your guidance in all the Tamil lexemes. One more doubt, may i add place of the audio file creation? Like this. Because Tamil people living all around the world. Few country's official language is Tamil. The pronunciations among Tamil people are having diiference.--Info-farmer (talk) 03:21, 4 May 2019 (UTC)
I do not see any problem to do so. Pamputt (talk) 17:07, 4 May 2019 (UTC)
  • Maybe something to do directly at Commons. --- Jura 13:55, 16 May 2019 (UTC)

Sterbeort (P20)Edit

Fragt nach nach dem Sterbedatum - aber warum? Bei Gestalten aus der Mythologie ist der Ort oft wohl überliefert, aber Lebensdaten sind unbekannt. --Bahnmoeller (talk) 17:25, 5 May 2019 (UTC)

Wer fragt nach dem Sterbeort? Du bist hier auf der Seite für lexikographische Daten. Bei einem Wörterbucheintrag wie Schlamassel (L12605) sind keine biografischen Angaben vorgesehen. Gruß --Kolja21 (talk) 19:14, 5 May 2019 (UTC)
@Bahnmoeller: +1, Ich glaube du bist auf der falschen Seite. VIGNERON (talk) 12:11, 7 May 2019 (UTC)
.Wo wäre dann die Richtige? de:Hippokorystes (Sohn des Aigyptos) Q15113335 hat ganz sicher mit allen seinen 49 Brüdern einen Sterbeort (Argos), aber Lebendaten gibt es da nicht. --Bahnmoeller (talk) 12:18, 7 May 2019 (UTC)
@Bahnmoeller: there is a lot of place you can ask: like Wikidata:Forum (auf Deutsch) or place of death (P20), but now I understand your question, the constraint is not mandatory just indicative here (because people who have a place of death (P20) should *usually but not always* have a date of death (P570)). Cheers, VIGNERON (talk) 13:07, 7 May 2019 (UTC)

What tools do you use for automated edits?Edit

Hello all,

Thanks again to all the people who regularly edit and create Lexemes, Forms and Senses \o/

It's been almost one year since the first step for lexicographical data was deployed on Wikidata. We currently have more than 45K Lexemes in 335 languages. We can also note that we have only 6000 Senses so far, and still a lot of Lexemes without Forms.

We saw emerging some great tools, queries, and the first reuse cases of the data. The API can be considered as stable, and the first tools have been successfully testing the software and the data model, so we can now encourage people who want to build tools on top of the API to do so if they have the need for it. Of course, I would still advise you to be careful when it comes to mass imports (make sure that the original data is public domain or CC0, work closely with organisations or communities who gathered data in the first place).

As for people who already started doing semi-automated or automated edits, I'm curious to know more about them: what tools do you use? What scripts are you relying on, what languages?

Thanks, Lea Lacroix (WMDE) (talk) 09:19, 6 May 2019 (UTC)

I've been using Lukas' Lexeme Forms tools to add English words - my list of words comes from various word lists I've collected. When I add senses I usually pull them from the descriptions of associated Wikidata items - sometimes shortening them; if there's nothing associated I've been writing my own. ArthurPSmith (talk) 13:27, 6 May 2019 (UTC)
There is not a lot of tools right now. I use Ordia (by Fnielsen) a lot and I used QuickStatements once to add grammatical gender (P5185) to some French lexemes (Special:Diff/777318266, based on the endings and after careful review as French is a language full of exceptions). I would love to use more QuickStatements (which is AFAIK currently to editing existing lexemes and adding statements at the Lexeme level, I would love to add forms and even better to create Lexeme, but the latter would be a bit pointless without the first). Cheers, VIGNERON (talk) 12:09, 7 May 2019 (UTC)
  • So can we start using tools? I think @Lydia Pintscher (WMDE): was initially reluctant. Maybe this explains why L-growth stopped. --- Jura 13:54, 16 May 2019 (UTC)
    • Yes. Especially at the beginning I wanted us to move a bit slower to make sure we all get used to how Lexemes work, how the system behaves, can create processes and so on. I think we're in a good place by now to expand. --Lydia Pintscher (WMDE) (talk) 15:09, 17 May 2019 (UTC)

How to model preposition after a verb?Edit

Hello all,

I'm currently thinking about working on another German-learning game (after Der Die Das) and I'd like to focus on verbs and the prepositions coming after them - for example, sich kümmern is always followed by um, achten auf, etc. As a German learner I can say it's a real pain to learn by heart, therefore a good reason to make a game out of it :D

I noticed that we didn't really start modelling these prepositions, so before developing the game, I'd have quite some work filling up the data. I'm not even sure how I should add the preposition. Do you have any suggestion?

Also, what to do in the various specific cases, when the verb can have several prepositions?

  • sprechen an (+ recipient), sprechen über (+ topic)
  • In some cases, a different preposition changes the meaning of the verb, eg. sich freuen über (be happy about) and sich freuen auf (look forward to). Should it be connected to two Senses in the same Lexeme, or two different Lexemes?

Thanks in advance for your help :) Léa/Auregann (talk) 10:30, 8 May 2019 (UTC)

As a variation on requires grammatical feature (P5713), this probably requires a new property to document (lexicographical data in general is underserved by properties). "Constructed with" may be a reasonable name as it encompasses all adpositions, relativizers and even affixes where relevant. Circeus (talk) 16:16, 8 May 2019 (UTC)
@Auregann: Prepositions should be added as their own lexemes - for (L2989), über (L6736) with lexical category preposition (Q4833830). If a verb + preposition combination has a special meaning I guess it could have its own lexeme, just like any phrase. But the case where verbs only taken one or a limited number of prepositions does need some way to indicate it also, perhaps with a new property as Circeus suggests... ArthurPSmith (talk) 17:56, 8 May 2019 (UTC)
I think they could have their own lexeme. In Den Danske Ordbog (Q1186741), words with an extra word (adverb usually) are listed on the same page but below. For instance, komme [1], corresponding to komme (L3065), has a series of adverbs/prepositions (BTW how do we know the word class?): komme af med, komme an, komme an på, komme bag på, etc. In DanNet (Q26932303), these constructs have their own entry (lexeme) dn:word-11027042-69 for "komme sig", dn:11027042-61 for "komme over", dn:11027042-32 "komme bag på", etc. I suppose there is a need for a new set of properties to model the relationship between the main lexeme (e.g. the verb) and its "sublexemes" (the verb and its — Finn Årup Nielsen (fnielsen) (talk) 07:30, 9 May 2019 (UTC)
With a few exceptions (e.g. the English particle verbs), I would not consider the verb+preposition a lexeme of its own. Heck, I don,t consider that reflexive/pronominal verbs are separate lexemes! Circeus (talk) 13:44, 9 May 2019 (UTC)
The issue is that the verb+something may have different meaning, e.g., komme (L3065) means come in its basic sense, but "komme sig" means recover, "kommer over" means (approximately) manage/cope (but could also mean come over!), "komme bag på" means surprise and so on. I suppose my issue is slightly different than the issue started by Auregann: sprechen an and sprechen über is still talk (to/about). I suppose Auregann's issue is more about how do we specify grammar/function? — Finn Årup Nielsen (fnielsen) (talk) 14:57, 9 May 2019 (UTC)
I see that Auregann second answer also pose the question about change of meaning with added word. — Finn Årup Nielsen (fnielsen) (talk) 15:00, 9 May 2019 (UTC)
@Circeus: What is your stance on verb+adverb combinations, such as die off? I ask because these are written as one word in Dutch and German, but only in some of the inflections e.g. Dutch afsterven "to die off" but het sterft af "it dies off". English Wiktionary treats all of these as separate lemmas: wikt:die off, wikt:afsterven. Reflexive verbs are not given their own entries, however, but are instead given as plain verbs with a label saying "reflexive", like in wikt:vergissen. —Rua (mew) 13:09, 16 May 2019 (UTC)
I did say I consider particle verb to be a single word. Actual full-on multi-word expressions are were things will get really weird with lexemes, e.g. wikt:kick the bucket, wikt:parler français comme une vache espagnole... Circeus (talk) 16:20, 16 May 2019 (UTC)

Lexemes and gender of nounEdit

Hi,

Do we have any clear model for how to deal with gender on Lexemes?

The situation is a bit complex as:

  • sometimes gender doesn't change much the meaning (Lehrer (L34167) and Lehrerin (L34168) is equivalent)
  • sometimes gender does change a bit the meaning (queen (L1380) can be either a female ruler or a relative of the ruler, spouse/widow/mother which is at least 2 separate senses)
  • sometimes gender entirely change the meaning (rare, but it appears at least in French where "chauffeuse"@fr can be either a female driver or a chair while "chauffeur"@fr is only a driver ; "queen"@en in chess is close to this case too)

At least in the last case, two lexemes is absolutely needed and I see that in the two first cases, the two lexemes solution is also often applied (there is 64 lexemes with a main lemma ending in "-rin" in German for instance). I see pro and cons in this solution but I think we should go this way. What do you think? Do you have other meaningful examples?

Cdlt, VIGNERON (talk) 16:16, 12 May 2019 (UTC)

I suppose if declination (forms) and grammatical categories are the same for different meanings they can fit one Lexeme. That's multiple Senses in one Lexeme were allowed for. Or do you want to separate each Sense to different Lexeme? --Infovarius (talk) 10:07, 13 May 2019 (UTC)
I think when there is a separate word form for a different gender (Lehrer (L34167) and Lehrerin (L34168), or actor (L7011) and actress (L7012)) they should be separate lexemes as we have done up to now, I don't see a good reason to merge them into one lexeme. While the formation ("-rin" in German or "-ess" in English) may be somewhat common, it's not much different from other derived words that are different parts of speech (in English '-er' on verbs to create a noun, 'ly' on adjectives to create an adverb, etc.) and those are certainly different lexemes. ArthurPSmith (talk) 12:38, 13 May 2019 (UTC)
Agreed with ArthurPSmith Pamputt (talk) 13:42, 13 May 2019 (UTC)
I agree about Lehrer (L34167) and Lehrerin (L34168). I said about "chauffeuse"@fr in 2 meanings, queen (L1380) in 2 meanings and other homonyms. --Infovarius (talk) 08:54, 15 May 2019 (UTC)
Lehrer (L34167) and Lehrerin (L34168) are not equivalent because it's not just the grammatical gender that differs, they refer to different things. One refers to a person of male natural gender, the other to a person of female natural gender. For the purpose of meaning these are two separate concepts, they are not synonymous and not interchangeable. —Rua (mew) 13:04, 16 May 2019 (UTC)
  • The question seems rather confusing. Maybe it's because too many languages are mixed together. --- Jura 13:52, 16 May 2019 (UTC)

Lists for Lexeme: Swadesh list(s); CEFRL levelsEdit

Hi all. I've been thinking quite for a time about a couple of possible ways, where having been able to specify that a specific lexeme is a part of some list, would be quite useful. But I could not figure that out for myself..

So, first case is to be to link a (Sense, I guess?) to a specific position in one of the Swadesh lists (I've heard there're at least two of those). That would allow to build any automated extractions and comparisons. Although admittedly usefulness of this is somewhat limited because those lists are quite short and should be easily enough retrievable for any language, I think. I am however very uncertain about the "position in the list", do we have such property? Or will it require an identifier ?

Second use-case would be to able to specify that a given Lexeme belongs to a certain "required dictionary" for a given CEFRL (Q221385) level for some language; for example, I am aware of such lists for Estonian and would be willing to contribute this information into Wikidata. But again, I have not really solid understanding of data model implications. Probably a statement "member of": "CEFRL A2 (Estonian)" on a lexeme would do, where "CEFRL A2 (Estonian)" is an instance of "Q6499736", or .. ?

Has any discussion ever occurred on any of these subjects, and can anyone suggest the way to go about those ?

Thanks! 62mkv (talk) 21:58, 16 May 2019 (UTC)

pronunciation audio (P443)Edit

pronunciation audio (P443) has a low number and at the time it was created using it on normal items was a good idea. It seems to me that it would be better if it now would only be used on lexeme's (or maybe as a qualifier for name (P2561). ChristianKl❫ 15:20, 17 May 2019 (UTC)

Return to the project page "Lexicographical data".