Wikidata talk:Lexicographical data/Archive/2018/03

Latest comment: 6 years ago by Nizil Shah in topic Redirects
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Lexemes categories and grammatical features vocabulary

The Lexemes data model is introducing two tag systems: one for lexemes, called lexical category and one for forms called grammatical features. One source of inspiration that looks very relevant to me is the universal dependencies tag set which is now a whidly used standard in natural language processing. It aims to be highly multilingual and provides a lexeme category set and grammatical features. Tpt (talk) 21:06, 26 February 2018 (UTC)

Thanks Tpt for this link, here is a first try of alignement with the items we have:
Items need some checking and clean up (and references ideally). It would be could to compare to the categories on the Wiktionary too. @Noé: peux-tu jeter un œil ? Cdlt, VIGNERON (talk) 10:50, 27 February 2018 (UTC)
If you want to expand the list, you should have a look at wikt:fr:Projet:Informations grammaticales. noun (Q1084) is a good example of the dichotomy between a technical description made for lexicographers and a popular designation made for readers. In French Wiktionary, we choose nom commun instead of substantif because we though the last one to be less popular. Do you think it could be possible to display both and to have an option to display one or the other? I know Wikidata is supposed to be for machine more than for humans but at some point, some people may have trouble to use it if only technical terms are used. -- Noé (talk) 11:13, 27 February 2018 (UTC)
@Noé: thanks, I'll look at that page. The idea is not only to expand but to confirm it is correct (see where I've put interrogation point) and that it fits most (if not all) cases. For nom commun and substantif, Wikidata will only display the current label (and there is only one label, but it can be changed if needed, I'll start a discussion about this ; plus there is alias - a bit like a redirect - so a user can type nom commun or substantif, they'll found noun (Q1084) just the same), but every re-users can custom it as they want. Cdlt, VIGNERON (talk) 11:40, 27 February 2018 (UTC)
We have in ru-wiktionary also following part of speeches: predicative (Q1931259), participle (Q814722) and transgressive (Q904896). English can have also gerund (Q1923028). --Infovarius (talk) 13:13, 27 February 2018 (UTC)
Reading the page given by Noé and the items given by Infovarius, I have the impression all cases are subclasses of the classes given by universal dependencies, no? If so, we need to work on the ontology of the items and improve them (inflected form (Q4423888) is quite empty and has no label in English, if I understand correctly this is the result of the action of inflection (Q207857) but I don't speak Russian). And for the structure of our lexeme, I wonder when we need more precise values for lexical category and when the data should be stored in other properties (an obvious example, we don't need a class "lexical category : masculine noun", "lexical category : noun" and "gender : masculine" is better ; but it's probably not always true). Cdlt, VIGNERON (talk) 20:53, 27 February 2018 (UTC)
@Tpt, Lydia Pintscher (WMDE): technical curiosity: why is it a tag and not a direct property? Will we be able to put references or to check constraints violations? will/could/should the list of possible values be closed to a limited set? (it's not on the test project where I was able to put Lexical Category = German :/ ) Cdlt, VIGNERON (talk) 11:46, 27 February 2018 (UTC)
Hey! Thanks for your interesting questions :)
  • Language and lexical category are mandatory to create a Lexeme, and unique (only one of them per Lexeme). That's why they get this special format.
  • It will be possible to apply constraint checks on it (with a system a bit different from the existing constraints that are based on properties)
  • It won't be possible to have references on the language and lexical category. If you need to have one, you will have to create a statement (you can see that as we currently do with the label and the property "official name")
  • We will follow our current philosophy: don't block the user to do anything, but suggest them the relevant values :)
Lea Lacroix (WMDE) (talk) 15:49, 27 February 2018 (UTC)
Thanks Lea Lacroix (WMDE), curiosity satisfied! (more than satisfied in fact ;) ) Cdlt, VIGNERON (talk) 20:11, 27 February 2018 (UTC)
Sorry, Lea Lacroix? Lexical category should be unique for a Lexeme? So en:get will be split along several Lexemes? --Infovarius (talk) 07:40, 1 March 2018 (UTC)
@Infovarius: that is what I undertand since the beggininig and I think this is what is planned all along, see mw:Extension:WikibaseLexeme/Data Model ; the lexical category is unique and attached to the page which is a lexeme (not to lexeme subparts of the entry that would be a lemme). Is splitting "get" a problem? And if so, why? Cdlt, VIGNERON (talk) 08:25, 1 March 2018 (UTC)
Personally I feel that this can be problem as a part of speech is sometimes a vague category and it is difficult to say if this lexeme can be classified into 1 or 2 parts of speech. I can't give you an exact example, one variant can be ru:хорошо. May be there are not many such examples though. --Infovarius (talk) 22:05, 2 March 2018 (UTC)
@Infovarius: I don't think there is many case like that and it can easily be solved by creating two lexemes (a bit like the "Bonnie and Clyde problem" for Q items), linking them together and attributed the sources for each. I think it easily solve the problem. Noé gave an interesting example : "voilà"@fr wich is sometimes seen as a verb and sometime as an interjection (same thing for other interjections, the lexical categories are not mutually exclusive). Cdlt, VIGNERON (talk) 13:10, 9 March 2018 (UTC)
Another example, in English: worth is an adjective or a preposition. Well, for a lot of South American languages stative verbs may also be described as adjectives in other sources. It is not a rare exception, there is plenty words with more than one lexical category and the data model have to offer a way to deal with this problem. Noé (talk) 13:27, 9 March 2018 (UTC)
@Noé: isn't the way I proposed a good solution? If not, there is other ways, for instance choosing one preferential lexical category and putting the multiples point of view as values in a property with references and qualifiers. Anyway, Wikidata deals everyday with a lot contradictory data (and also with false or obsolete date), as long as the contradiction is indicated, there is no problem (at least, I don't see any, where do you see a problem?). Cdlt, VIGNERON (talk) 14:12, 9 March 2018 (UTC)
Creating two entries is a weird idea, because it is still the same word, and it may appears twice then in reuses. Having « one preferential lexical category » is not very neutral and may be source of conflicts between editors. Noé (talk) 14:43, 9 March 2018 (UTC)
See also Psychoslave point of view on this matter
For the preferential solution, I agree and that's why I didn't suggest it first (but it will probably use in others cases, for instance when all sources have the same data except one). Having two entries is quite common on Wikidata. True you need to be careful when you reuses it, but as always ; Wikidata stores and provides data, it's up to the re-users to request what he wants (plus, there is some failsafe, with ranks data don't show up by default, only when they are explicitly requested). Cdlt, VIGNERON (talk) 14:56, 9 March 2018 (UTC)

One L-item for "tour"@fr: bug or feature?

Hi @Denny, Lydia Pintscher (WMDE), Lea Lacroix (WMDE), Infovarius, Jberkel:

Thinking again about L-items, I stumble upon a werid exemple : fr:wikt:tour, a word for 3 differents nouns (so same lexical category) in the same language ; this is kind of exceptionnal but not that unusual (see also fr:wikt:manche, en:wikt:desert). If I understand correctly the data model, there will be only one unique item for this 3 different nouns. So in this case, I guess there will be no general statements for the lexeme and only statements for the forms (as most all data are specific to a form: different gender, different sound, different meaning, etc.).

Is my understand correct ? After some thinking, I don't really any real problem, just maybe some little side effect, this means we can't put a constraint « "all lexeme" should have a "gender/pronunciation" property » as there is exceptions like these one (and complicate my search for #Core properties).

Cdlt, VIGNERON (talk) 12:16, 7 March 2018 (UTC)

Thanks for this relevant question!
The system will allow you to create several Lexemes that have the same lemma in the same language and the same lexical category. Lemma, language and lexical category are three things that are mandatory to create a lexeme, but there is no technical restriction about having the same triple of lemma/language/lexical category for different lexemes.
Then, you could decide to model it in at least two different ways:
  • Having several different lexemes, with all the lemma "tour", the language "French" and the category "noun"
  • Having only one Lexeme and describe further in Forms and Senses
I hope that answers your question :) Lea Lacroix (WMDE) (talk) 13:14, 7 March 2018 (UTC)
Oh my bad, I misunderstood the model and I thought the triplet label/lang/lexical category was unique (as in label/lang/description is unique for Q items, one and only one item can have the same combination).
If it's possible then I would say that having separate lexemes would be far better and clearer. Others interrested people: what do you think about it?
Cdlt, VIGNERON (talk) 16:53, 7 March 2018 (UTC)
I agree with you. Having separated lexems would be much better. Because, if not it would be very painful to model that the "Construction élevée" is one sense one for the feminine noun and "Outil du potier" one for the masculine noun. Tpt (talk) 17:13, 7 March 2018 (UTC)
It will be interesting to see how several language "edge cases" can be modeled with Wikidata. Nouns with different plural forms, multiple grammatical genders, regional variants etc. A lot of this is covered on Wiktionary with usage notes. – Jberkel (talk) 12:31, 12 March 2018 (UTC)
@Jberkel: very interresting question. Beside what can be modeled, there is the question question of how to modele it. Multiple forms doesn't seems to be a problem ; on the test project, I've tested and created ki (L554) ('dog' in breton) with two plurals : chas (L555) (common plural, created by suppletion) and kon (L556) ('regular' but unused plural), we need some qualifiers to be more precise but nothing impossible (and maybe some improvements, I'll open a new section below). Same things for others flexions, we need a precise and documented structures but it seems quite easy to me. Regional variations on the other hand seems quite tricky, in particular, when start the variation in the same language and when start the dialectal variation (which can be seen as a other language) ; is X a word of French in West of France or is X a word of West-French ? I guess, most of the times, it's a bit a both and here I don't have a clear idea on how to model that. If you have specific edge cases, please share them here so we can see how to deal with them. Cdlt, VIGNERON (talk) 13:58, 12 March 2018 (UTC)

First version of Lexicographical Data will be released in April

You can read the full announcement here :)

If you're involved in a Wiktionary community, it would be very helpful for us if you help spreading the word there in your language. I'm pinging the current ambassadors: @Helmoony, ԱշոտՏՆՂ, Aabdullayev851, Liuxinyu970226, geertivp, BD2412: @Pamputt, Gallaecio, Bigbossfarin, Nizil Shah, Shavtay, Epantaleo: @Jberkel, Satdeep gill, DonRumata, ToJack, HakanIST, викичи:

Thank you very much! Lea Lacroix (WMDE) (talk) 16:42, 7 March 2018 (UTC)

Exciting! Meanwhile, I'm still thinking about cases that can be good to test the system and how to structures data in it ; I'll keep writing messages here about my reflexions. Cdlt, VIGNERON (talk) 17:13, 7 March 2018 (UTC)
  Done wikt-de Bigbossfarin (talk) 18:41, 7 March 2018 (UTC)
+1 --Liuxinyu970226 (talk) 00:19, 8 March 2018 (UTC)
  Done for trwiktionary thanks to User:Basak.-- Hakan·IST 12:02, 16 March 2018 (UTC)

Enriching Lexicographic data from Wikidata

As you already know all Wikidata entities (Q) include labels featuring their names in all available supported languages. Here, I have a question: Why don't we automatically add each distinct label as a lexeme (L) and consider its corresponding Wikidata entities (Q) as senses (S)? Consequently, we will have millions of new lexemes within several hours. However, what do you think of that? --Csisc (talk) 12:57, 10 March 2018 (UTC)

@Csisc: I think this is a terrible idea : both in the sense « very good » and « very bad » (en:wikt:terrible, this word alone is a good example why we must be carreful when it comes to words). Sure we could use the labels of the existing Q items ; and it would be a shame not to use them at all. But we should absolutely not do it in « several hours » ; it takes times and attention to make sure that each the label is not a mistake (I correct every weeks errors and approximations in French labels) and really make sense. Let's grow slowly but steadily. I see more tools like the wikidata game or mix'n'match, where a human validate it item by item (and it's quite effective, we can create million of words in months which is already more than impressive). Comparison with multiples sources would also be a good idea. PS: most labels are clearly not notable enough as lexeme (the 4.2 million of human (Q5), same the million of building (Q41176)). PPS: senses won't be available right now. Cdlt, VIGNERON (talk) 19:52, 10 March 2018 (UTC)
Temperature-jump NMR study of protein folding: Ribonuclease A at low pH (Q43620894) makes for a terrible Lexeme. A large part of our items are like this. This idea might work for a subset of entities, but I wouldn't know how to define that subset. --Denny (talk) 00:23, 11 March 2018 (UTC)
@VIGNERON: What you have said concerning the polysemic words is accurate. However, as Wikidata's lexicographical data will not include senses for its first edition, this method will be very useful as what we need now is just to add lexemes in all available languages to the Lexicographical data. --Csisc (talk) 09:49, 11 March 2018 (UTC)
@Denny, VIGNERON: As you already know, entities having proper nouns as labels can be easily eliminated when checking instance of (P31). In fact, the Wikidata entities you are talking about are humans, towns, places, buildings, films, plays, games, scientific articles... --Csisc (talk) 09:49, 11 March 2018 (UTC)cf
@Csisc: I'm not bad with SPARQL but I don't see how to build a query to remove all the proper nouns, do you know how to do it? And even then, we would get a list of probable lexeme, a good material for others tools but not for mass import. Cdlt, VIGNERON (talk) 10:46, 11 March 2018 (UTC)
@VIGNERON: Using Wikidata query service is not an excellent idea due to time limit. What we will use here is Pywikibot. --Csisc (talk) 10:52, 11 March 2018 (UTC)
@Csisc: as you want, the choice of the extrating tool doesn't matter. The question here is: how can you tell if the label is a proper noun or not? Cdlt VIGNERON (talk) 10:58, 11 March 2018 (UTC)
@Csisc: It is according to instance of (P31) of Wikidata entities. For example, if a Wikidata entity is an instance of fruit, you cannot say that its labels are proper nouns. That's evident. --Csisc (talk) 11:09, 11 March 2018 (UTC)
@Csisc: that's not evident at all, and fruit is just a specific and small example, there is only 98 items in subclass of fruit (and I'm not sure it's *only* common nouns). What I expected is a more general algorithm (doing a check class by class is not something that can be done in just "hours"). Cdlt, VIGNERON (talk) 11:25, 11 March 2018 (UTC)
@VIGNERON: We can begin by eliminating people, localities, countries, buildings, scientific articles and artistic masterpieces from the initial list of Wikidata entities. What will remain from the list will just be less than one million entries. After that, we can verify them one by one. If you like to verify if this can be done in practice, I will do this list for you. --Csisc (talk) 12:10, 11 March 2018 (UTC)

Form: text or lexeme

Hi all,

I've tested the test project for a word of Breton : ki (dog). I've created 4 lexemes so far :

  • ki (L554) (masculine, singular, unmutated)
  • chas (L555) (masculine, irregular but normal modern plural, unmutated)
  • kon (L556) (masculine, regular but uncommon modern plural, but use in old texts, in formal register, and also relugar modern in the south-east part of Brittany, unmutated)
  • kiez (L557) (feminine, singular, unmutated)

But there is a lot of others forms for this word :

  • gi (masculine, singular, soft mutation)
  • c'hi (masculine, singular, aspirate mutation)
  • kiezed (feminine, plural, unmutated)
  • giez (feminine, singular, soft mutation)
  • cʼhiez (feminine, singular, aspirate mutation)
  • giezed (feminine, plural, soft mutation)
  • cʼhiezed (feminine, plural, aspirate mutation)
  • gon (masculine, plural, soft mutation)
  • c'hon (masculine, plural, aspirate mutation)


When I entered the forms, I was a bit surprised to have a field for a textual representation. Shouldn't it be a lexeme? My point of view, is that lexeme for forms probably have to be created to indicate specific statements (in this case, different etymology for the two plural forms for instance) and it would be more precise (and correct) to have a lexeme instead. We can have both and also have a property "link to the lexeme about the form" but it's doesn't seem right to me. @Lydia Pintscher (WMDE), Lea Lacroix (WMDE): I'd like to have the tech point of view (maybe I'm biased by Breton  ).

This case is not an edge case (except maybe for having two irregular plurals, but that's not that uncommon either), most words in Breton are like that (and often have more forms). Others languages are in a similar case (obviously I can think of Cornish and Welsh who also have consonant mutation (Q557863)).

For the point of having two plurals, it's quite common in Breton even for everyday common words. It exists also in other languages like English and French but for uncommon words (for instance words borrowed from Latin often have a regular plural following the rules of English and a "savant" plural following the rules of Latin ; codex = codices or codexes in English, codex = codex or codices in French).

Cheers, VIGNERON (talk) 14:34, 12 March 2018 (UTC)

It's possible to add statements about forms and I hope it's going to be possible to have statements which value is a form. I would only have a lexeme for "ki" and then forms with statements stating the usage and origin of each form. Tpt (talk) 09:53, 14 March 2018 (UTC)
(I answered directly to Vigneron but I'm going to summarize my answer here as well)
A representation is an unique string of characters, when a Lexeme includes a group of forms and variations. That's why the representation of a Form can't be a Lexeme.
In the case you present, one way is to model it like you started doing it, creating a Lexeme for each variation, and connecting them to each other via a statement such as "plural of". Another possibility is the one Tpt mentions, with only one Lexeme including all the variations as Forms. This would also avoid redundance of information. Lea Lacroix (WMDE) (talk) 10:11, 15 March 2018 (UTC)
How is the search supposed to work? Will the suggestion box show Lexemes or Forms? Since I assume the search box will be the same as with Q-items, will there be any way of telling them apart? (With a different color, with an "L" in front, etc.) Micru (talk) 11:44, 15 March 2018 (UTC)
So, after more thinking, indeed using a form here doesn't really make sense. Still even with text, I'm not sure how to best handle this case. I see two scenarios:
Scenario 1 Scenario 2
Only one lexeme
  • ki, having forms :
    • ki
    • chas
    • kon
    • kiez
    • gi
    • cʼhi
    • kiezed
    • giez
    • cʼhiez
    • giezed
    • cʼhiezed
    • gon
    • cʼhon
Multiple lexemes (at least two)
  • ki, having forms :
    • chas
    • kon
    • kiez
    • gi
    • cʼhi
  • kiez, having forms :
    • ki
    • kiezed
    • giez
    • cʼhiez
  • (maybe also) chas, having forms :
    • ki
    • kon
  • (maybe also) kon, having forms :
    • ki
    • chas
    • gon
    • cʼhon
There is indeed redundancy (and I don't like redundancy) but the logic here is that « giez is a form a kiez, which itself is a form of ki », saying « giez is a form of ki » feels too direct and a bit of a stretch. The scenario 2 is more horizontal (which have pros and cons) and maybe more precise but I'm not sure where exactly how to structure it. What do you think?
Cdlt VIGNERON (talk) 16:01, 17 March 2018 (UTC)

Why have forms a language field?

I was just checking the Demo system and I have seen that for each form there is a language field. What is the point of this? Is it for dialectal forms? And why doesn't the entity selector works on those fields? Micru (talk) 14:26, 14 March 2018 (UTC)

Hello Micru,
Yes, the language field in the form can help to express the language more precisely. For example, a Lexeme that would have "color" as Lemma and "English" as language, could have a form "color" as representation and "en-us" as language, then a form "colour" as representation and "en-gb" as a form.
You're right, for now, the language field in the Forms is a free field where people can type the language code they want, without choosing in a list of languages. This is meant for letting people the freedom to choose the language code they want, even if it's not in an official list for example, but one can also argue that having an entity selector would help harmonizing the content. What do you think? Lea Lacroix (WMDE) (talk) 15:56, 14 March 2018 (UTC)
I'd be in favor of an entity selector. --Denny (talk) 16:15, 14 March 2018 (UTC)
Agreed with the entity selector. Now we have to define what will be these entities? I think the list of all languoid (Q17376908) (and its sub-classes) could cover a lot of cases (dialect, pidgin, etc.) Pamputt (talk) 16:54, 14 March 2018 (UTC)
+1 for an entity selector. Do we need to limit it a priori (like suggested by @Pamputt:), a constraint check a posteriori could be good enough, no? Cdlt, VIGNERON (talk) 18:27, 14 March 2018 (UTC)
+1 for an entity selector too. A constraint check a posteriori should be enough, yes. Micru (talk) 08:07, 15 March 2018 (UTC)
Thank you very much for your feedback! We're going to discuss it within the team :) Lea Lacroix (WMDE) (talk) 09:53, 15 March 2018 (UTC)
Hello,
Here's some more explanation about the current status of the field:
The language field of the Forms is currently a text field. It is stored as such in the database. This field (and the language field of the Lemma, they both work the same) accepts two types of strings:
  • a language code, like "fr" or "en-gb". This language code is checked a priori against the list of languages of Wikidata, the same list as for the item labels
  • a language code augmented with a suffix, to allow you to give more precisions about the language. The format is composed from a language code and an item: languagecode-x-wikidata-Qxxx. To take an example I know, if I want to describe the Lexeme "évènement", French noun, that also has a form "événement" described in the orthography reform of French in 1990, I may add in the language field: fr-x-wikidata-Q486561.
This is how it is right now. In the future, I agree that an entity selector could be nice. The interface will also be improved to display something nicer than "x-wikidata". It could also include an easiest way to add this "suffix" without having to find the Qid related to the specificity of the language.
Thanks for your ideas, Lea Lacroix (WMDE) (talk) 16:06, 15 March 2018 (UTC)
That sounds good. It would be great if the actual widget where this is entered behaves like an entity selector, even if a string with the code is then added. Just to reduce pain and errors when entering. But I wouldn't see this as a launch blocking wish. --Denny (talk) 16:16, 15 March 2018 (UTC)
@Lea Lacroix (WMDE): From what you say it seems that there will be two fields, one for the language code and another one for an item. As I see it the language code can be extracted from the item itself or from the items above in the ontology. For instance if I say that a word is in Mallorquí (Q951473), that item itself has no language code, but going up in the subclass tree eventually I reach an item with a language code, so why to give editors more work for something that can be automated? Micru (talk) 08:23, 16 March 2018 (UTC)

Multiple Spellings in Same Language and References

How multiple spellings of the same lemma in the same language should be handled? For example, રાત્રિ and રાત્રી are the correct spellings of one lemma (This Gujarati language word is read as Ratri which means Night in English) in Gujarati. It is not like colour (en-us) and color (en-gb) difference which have different languages. They both would have (gu)/Gujarati as a language. So should it be stored in the same way?-Nizil Shah (talk) 13:49, 15 March 2018 (UTC)

How will be references for the data extracted from some public domain dictionary will be cited? For example: https://en.wiktionary.org/wiki/shock_and_awe has a reference of Oxford.-Nizil Shah (talk) 14:06, 15 March 2018 (UTC)

You can enter both forms in parallel. There are no constraints. --Denny (talk) 16:17, 15 March 2018 (UTC)
About references, just like in the Q-items, statements will also have qualifiers and references. So you will be able to store precise references.
For extracting these references, or adding some automatically, this would be the job of a tool built by the community in the future for automatic extraction. Lea Lacroix (WMDE) (talk) 16:20, 15 March 2018 (UTC)
Thanks for the clarification.-Nizil Shah (talk) 05:26, 16 March 2018 (UTC)

Feel free to show your modeling experiments

Hello all,

I'm very happy to see more editors interacting on this page, trying the data model and asking very interesting questions :) Feel free to ask and react as much as you want. As stated in the data model documentation, the structure is not entirely fixed, and is meant to let freedom to the community to describe words as they want. So there is not one unique model that the developers would have decided, but multiples way to organize the data, that will continue being discussed together.

If you are doing some tests on the beta system, feel free to share them here, and ask for feedback from other editors. For example, VIGNERON created ki (Breton, noun) with different Forms describing the variations of the word.

Cheers, Lea Lacroix (WMDE) (talk) 16:31, 15 March 2018 (UTC)

Lea Lacroix (WMDE) Played a around a bit as well. As others have remarked, the forms aspect of the model is a bit confusing at the moment. What's the difference between a grammatical feature "female (Q30)" and a statement from the form to Q30? Is it just a UI convenience, or modeled internally in a different way? And how can I add glosses to a lexeme? – Jberkel (talk) 20:28, 15 March 2018 (UTC)
+1 to adding glosses. How are they added? Without them all the information entered is quite useless, I mean that is the basis of a dictionary, isn't it? Right now it feels like a car without wheels, it might be very shinny, but it cannot be used. Micru (talk) 08:29, 16 March 2018 (UTC)
As stated in the announcement, Senses (that contain glosses) will not be present in the first version deployed in April, but will be added very soon. Lea Lacroix (WMDE) (talk) 10:37, 16 March 2018 (UTC)
If the senses are supposed to be added "very soon", why not do the deployment when senses are ready? As it is now, the lexicographical data is not usable. Micru (talk) 13:13, 16 March 2018 (UTC)
It makes sense to get the foundations right first. – Jberkel (talk) 14:23, 16 March 2018 (UTC)
There are several ways to deploy software. One could develop the whole product without deploying it, waiting that it is perfect from the developers point of view, and then give it to the users, and they have to handle it. Another method is to quickly deploy some usable parts of the software, step by step, even if it's not complete, so the users can try it and give feedback. We are commited to the second one, because we think this is the best way to get feedback from the community and build something that is really fitting to their needs. That is why we chose to deploy a first version, that will be completed in the future based on the needs of the users.
Thanks for your note, we will definitely take in account the need to have quickly the Senses added. Lea Lacroix (WMDE) (talk) 18:23, 16 March 2018 (UTC)
I'm not saying that you should wait to have a complete product, because that is nonsense. What I said is that as it is now the lexicographical data development is not usable. It does not fulfill the requirements of a minimum viable product, and as such there is not much point in deploying something that cannot be used. But again this wouldn't be the first time that something not production-ready has been presented to the community (Visual Editor comes to mind), so who I am to judge. Micru (talk) 22:43, 16 March 2018 (UTC)
@Micru: « I mean that is the basis of a dictionary » but is the Lexicographical data meant to be a dictionary? We already have the Wiktionaries to be dictionaries, I feel like Lexicographical data should focus on data (there is still millions of things doable with gloss-less data, see Wikidata:Lexicographical data/Ideas of queries and Wikidata:Lexicographical data/Ideas of tools) and not try to reinvent a project that already works quite well. Cdlt, VIGNERON (talk) 08:55, 19 March 2018 (UTC)

Redirects

How redirects will be handled? See This. There are several reasons to have redirects as listed on this page. -Nizil Shah (talk) 05:30, 16 March 2018 (UTC)

Hello,
I'm not sure that I understand your question. Can you elaborate?
There are no redirects on Wikidata entities, and we are not going to modify the policies of Wiktionary communities. Editors can model words as they want on Wikidata. They can create different Lexemes for "work" and "worked", or only one and describe the different variations in the Forms, as they prefer. They can reuse the data from Wikidata in Wiktionary, if they want to. Lea Lacroix (WMDE) (talk) 07:25, 16 March 2018 (UTC)
Nizil Shah Redirects are the exception on Wiktionary. The page you linked to mostly lists reasons against redirects. The exception at the moment are longer English phrases which are unlikely to be reused in another language. – Jberkel (talk) 08:51, 16 March 2018 (UTC)
I was looking into regular dictionary which has some entries like: Foo - see Footoxlot. How such redirects will be handled in Wikidata when a different word being redirected to another word? Should we consider them synonyms or forms of the same word? What if they are very different?-Nizil Shah (talk) 11:33, 16 March 2018 (UTC)
That's a different case, it depends on the word. It could be a synonym, short form, etc. Or maybe even both. I think these practices will come up as the first entries get created. It might make more sense to attach synonyms to individual senses, since they might not apply to the whole lexeme. They probably also need to get qualified somehow (perfect synonym, only in certain contexts etc). – Jberkel (talk) 14:17, 16 March 2018 (UTC)
OK. That makes some sense. Looking forward to how this turns out. -Nizil Shah (talk) 14:07, 30 March 2018 (UTC)

Linking lexemes to concepts

This is fairly basic, but I've not yet found the answer written down anywhere: how will we link lexemes like, for example, "cheese" and "

fromage

" to the item for the concept cheese (Q10943)? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:03, 16 March 2018 (UTC)

And to add to this, how to link from one form to other forms? Micru (talk) 13:08, 16 March 2018 (UTC)
I imagine you would link a sense to an item and not the whole lexeme? cheese could also be Cheese (Q907797). I am also interested how we could link from Wiktionary back to Wikidata. – Jberkel (talk) 14:29, 16 March 2018 (UTC)
Hello,
Indeed, linking to cheese (Q10943) would be done in the Senses, for example with a property "refers to concept".
It will be possible to link a form to another form, for example with a property "is plural of".
@Jberkel: What do you mean by "link from Wiktionary back to Wikidata"? On Wiktionary, people could make links to the data stored in Wikidata (lexemes, forms, senses, or concepts) or could directly display this data on the pages. Lea Lacroix (WMDE) (talk) 18:18, 16 March 2018 (UTC)
Lea Lacroix (WMDE) Initially it would just be to link to a specific sense, without getting any data. Some Wiktionary senses already point to Wikidata concepts. We could do the same with sense ids (LXXXX-SY) once they are created. – Jberkel (talk) 19:10, 16 March 2018 (UTC)
Thanks for your explanation. Yes, linking to Senses from Wiktionaries will be possible in the future. Lea Lacroix (WMDE) (talk) 08:12, 19 March 2018 (UTC)

Proposal for Senses

After reading the conversations and playing a bit with the demo system. I would like to propose to reuse Q-items for Senses, adding the following features:

  • The descriptions from the Q-item are transcluded in the L-item where it is linked (maybe in the future those can be edited from there)
  • The label from a Q-item can link to the L-item if they are connected
  • Each alias of a Q-item can link to each Form of a L-item if they are connected and the characters match.
  • A bot or a gadget can move Forms to Aliases or the other way around if the community decides so

The advantages of this is that we can build on what we already have, and that it can be deployed incrementally (first just connection L-item->Q-item without transclusion, then transclusion, then links from Q-item to L-item). Also, by using Q-items as the containers of Senses that serves as a way to figure out the translations; that is, if several L-items link to the same Q-item, we know that the words are closely related enough to consider them a translation of each other. The disadvantages is that we might have a few millions of items just for Senses, and that a great part of them will not have any statement, we would have to revise WD:N. However, I believe that the benefits are still greater. Let me know what you think. Micru (talk) 09:53, 17 March 2018 (UTC)

@Micru: Please be cautious about greatly expanding Q-items to reflect different shades of meaning. Remember that a key purpose of Q-items is to be a controlled vocabulary for Wikidata statements. For that role, it is helpful for the vocabulary to be quite sparse. Making a controlled vocabulary richer can make it less functional, by introducing ambiguity as to what the correct value for a statement may be; by distributing statement-values over several items, making searching more difficult; and by making it more likely that similar concepts from different sources may get sitelinked (or External-ID linked) to different items, making it less likely to be able to read across from one to another.
Try to match senses to existing Q-items by all means; but I suggest to be quite cautious before considering any great inflation of the set of Q-items. Jheald (talk) 11:24, 19 March 2018 (UTC)
@Jheald: Could you give some examples where having more Q-items to reflect Senses could be a problem? I have my doubts that all the Senses present in wiktionary will be imported into Wikidata (like cheese, the material, or cheese, the variety). There are advantages of having a more Q-items, for instance there are several "cannon" (artillery gun (Q1246258), cannon (Q10901528), cannon (Q16323077), cannon (Q1723856)), but there could be some more that appear on the en-wikt page for cannon, like "cannon: a bone of a horse's leg, between the fetlock joint and the knee or hock". That would fill many gaps currently present in Q-items. I doubt that having more items would make the Wikidata vocabulary less functional, perhaps the opposite, it is hard to predict. Micru (talk) 12:26, 19 March 2018 (UTC)
@Micru: So look at a thesaurus return for eg fortification (or any other noun you want to think of). I think it may perhaps be quite useful that this full range is not available to people wanting to specify the instance of (P31) of something, but instead their available choice is more restricted, between a smaller number of distinct items. Jheald (talk) 15:31, 19 March 2018 (UTC)
@Jheald: Senses that are very close can be shared among L-items, there is no need to create a 1-to-1 match. For the example in particular I see that we already have fortification (Q57821), barricade (Q81715), citadel (Q88291), bastion (Q81851), battlement (Q23423), blockhouse (Q82101), breastwork (Q91745), castle (Q23413), earthworks (Q1349587), entrenchment (Q23498051), fort (Q1785071), garrison (Q88556), keep (Q91165), outpost (Q1321241), parapet (Q1286070)... So not that many new items would need to be created. Micru (talk) 16:00, 19 March 2018 (UTC)
Hello Micru,
Thanks for your suggestion. I have a simple question: currently, do the descriptions of Q-items really describe what it is? I have the feeling that descriptions are mostly used to disambiguate, not to describe, therefore I'm wondering if the descriptions we have are meant to be used as glosses. What do you think? Lea Lacroix (WMDE) (talk) 08:17, 19 March 2018 (UTC)
Micru, Lea Lacroix (WMDE): I agree with Micru. This is just what I discussed with several WikiIndaba participants this week. The question here is what a sense is. A sense is simply a set of relationships between the named entity and its close environment and this is just what the statements of Q-items define (NOT the description). Consequently, using Q-items as senses is evident (and even simpler). --Csisc (talk) 09:46, 19 March 2018 (UTC)
Hi Lea, at the moment it is mixed, some are definitions, some are disambiguations that could be expanded to become definitions, and some are empty. Here a random list of words I checked:
Most of the descriptions are empty, so in general it might be a good thing to reuse them for Senses, so the probability will be higher that they will get used. Micru (talk) 09:52, 19 March 2018 (UTC)
Remember that the descriptions are systematically used by Wikipedia on mobile, as subtitles to article titles, and as disambiguations in search and suggestion pages, with Magnus's autodesc program used to fill in descriptions which are blank. In this role, it is desirable that descriptions are short terse fragments. If that is useful for senses, then good. But if "expanding" them means making them less short and focused, then please have a thought to their existing roles. Jheald (talk) 11:14, 19 March 2018 (UTC)
The idea is interresting but why do transclusion? If senses are linked to Q-item, then it's easy and trivial to retrieve the description from the Q-item with a tool. Cdlt, VIGNERON (talk) 12:15, 19 March 2018 (UTC)
@VIGNERON: If a L-item has several Q-items connected as senses, it doesn't give much information to have just the label (which is always the same). It would be more practical if the descriptions would show what each sense is referring to without having to access each individual item. Micru (talk) 12:34, 19 March 2018 (UTC)
@Micru: « the label (which is always the same) » not always, it's true when you're thinking inside one language but not in a multilingual context. rook (Q137) and tower (Q12518) have the same label in French (and other languages) but not in English (and other languages, see query for full list). And even inside one language, I'm not sure it's always true (plus, labels of Q-item are not 100% stable). I get the idea but I'm not sure about the solution (especially as most Lexemes will only have one sense and Lexemes with multiple senses can be disambiguate with other properties). Cdlt, VIGNERON (talk) 13:07, 19 March 2018 (UTC)
@VIGNERON: "it's true when you're thinking inside one language but not in a multilingual context". Most likely the L-items will be per language, so for "rook" (English) you will have all the Senses connected to that word in that language (rook (bird), rook (swindler), rook (chesspiece), rook (fortification)...), and most likely the label of the connected sense will be the same as the lexeme (there might be some cases where it is different, for instance when the same sense is shared between synonyms). For the case of French you will also have a L-item for "tour" and to that L-item all the senses will be connected (tour (tower), tour (chesspiece), tour (Machine de guerre), tour (Boîtier d’ordinateur)...)
It is not practical to have just a label in the L-item to represent the sense, because the labels for the senses what the user would see in the L-item for "tour" when browsing in French would be "tour", "tour", "tour"... Micru (talk) 13:48, 19 March 2018 (UTC)
This also might have value beyond Lexemes (though Micru I think makes a strong argument for why Lexemes might be particularly important case). But it's often enough that I've seen an ordinary statement on an ordinary item, when it would have been useful to have a pop-up of the description to confirm that the correct Q-item out of several potential choices with the same label had been made the object of the statement. Jheald (talk) 15:26, 19 March 2018 (UTC)

Embedded Senses AND Q-item Senses

I've been digging deeper in the concerns raised by Jheald, and it is true that for some cases it is difficult to justify creating an item. For instance, should we have a Q-item for the sense of pet peeve? Or for each sense of between a rock and a hard place? It seems that the Senses of those cases are more suited to be embedded in the L-item itself, while for others it is better to have a Q-item. I wonder if it is possible to have both, as each solution independently is not strong enough but both together could do the trick. Micru (talk) 09:32, 20 March 2018 (UTC)

I though I understood that the lexicographical data in Wikidata are not design to create a dictionary, so no senses are needed. I may got it wrong. Noé (talk) 00:46, 21 March 2018 (UTC)
Noé, if you take a look to the data model you will see that senses are indeed there. They are an important part of lexicographical data. Micru (talk) 07:58, 21 March 2018 (UTC)

Separate installation for Wiktionary ?

Why was the possibility to provide a separate installation for Wiktionary that could provide the entire functionality of the current Wiktionary turned down? Reading Lydia's comment at Wikidata:Project chat, it seems that the current solution should only only cover some of it.
--- Jura 13:31, 23 February 2018 (UTC)

Hello,
Since there are hundred of Wiktionaries, installing a Wikibase instance for each of us would be very inefficient, the idea is to store all data in one place. This is what Wikidata is meant for.
Just like Wikidata was not created with the only purpose to support Wikipedia, lexicographical data on Wikidata doesn't have the only purpose of supporting Wiktionary. Other parties can find interest in data about words: students, researchers, applications developers... we want the data to be structured, accessible and reusable for everyone without distinction. Lea Lacroix (WMDE) (talk) 18:30, 25 February 2018 (UTC)
I think any Wikimedia project has that purpose.
The separate installation could be for all Wiktionaries. Similar to Commons that is available for all Wikimedia sites. It would really inefficient to do that on a Wiktionary text page and Wiktionary already hasn't really be favored with the use of MediaWiki.
I don't quite see much room for structured content at Wiktionary beyond what's planned to be a Wiktionary namespace in Wikidata. This is fundamentally different the current Wikidata compared to Wikipedia.
So, I suppose the answer to my question is no [that it wasn't evaluated].
--- Jura 10:27, 26 February 2018 (UTC) [edited]
I'm not sure to understand this suggestion. Why does it matter if the L-items are on wikidata.org or on some separate site called lexeme.wiktionary.org or whatever ? No matter where is it created, data will probably be CC-0 in both case as this is data; and for « the entire functionality of the current Wiktionary », again no matter the place, Wikibase is not made to handle everything and making coffee, just to deal with data which is already a big job. Cdlt, VIGNERON (talk) 08:38, 26 February 2018 (UTC)
If it doesn't matter, it could solve one of the points Lydia mentioned on project chat. Besides that, the Wiktionary community might feel substantially better served with the new tool. Not sure what you think they use Wiktionary.org for now, maybe you could detail things not covered by the new tool.
--- Jura 08:49, 26 February 2018 (UTC)
@Jura1: I don't really know if it matters or not, that is why I'm asking you. I just have the feeling that a separate installation will do nothing more or less (except that having yet another website is a bit more complicate to create and to maintain, while having all data in Wikidata seems easier for everybody). Cdlt, VIGNERON (talk) 09:07, 26 February 2018 (UTC)
It wouldn't be much different to Commons. Not sure if wiktionary.org is limited to data and coffee (reading your characterization above).
--- Jura 10:27, 26 February 2018 (UTC)
There is at least one big difference Commons already exists and already works as a central place. Where the h*ll did you read that I said « wiktionary.org is limited to data and coffee », what I said is almost the total opposite! Cdlt, VIGNERON (talk) 12:02, 26 February 2018 (UTC)
Well, the only actual detail you mentioned was that, but it might be worth checking if there are any gaps, I asked the question at Wikidata:Project_chat#Gap_between_Wiktionary.org_and_Wiktionary_namespace_at_Wikidata?.
--- Jura 17:21, 27 February 2018 (UTC)
Return to the project page "Lexicographical data/Archive/2018/03".