Wikidata talk:Lexicographical data/Archive/2018/06

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Structured Glosses - the last frontier?

Latest comment: 6 years ago12 comments5 people in discussion

Each linguistic unit has a Form and a Sense. If I take any word I could define it by those two characteristics. For instance I could say "test", but I can also specify "L397-F3" (available) plus "L397-S1" (not available yet). But this "L397-S1" is again a collection of words. I can use unstructured text and say that "L397-S1" = "A challenge, trial." or I could use structured text and say "L397-S1" = "A ("L3333-F3"+"L3333-S2") challenge ("L4444-F3"+"L4444-S5"), trial ("L5555-F1"+"L5555-S2")". This method has the advantage that it is more precise and it can help machines understand better the Glosses. It has the disadvantage that it is more time consuming to enter manually, and that it requires a big number of senses already present to start being useful, because at the begining without having any sense you wouldn't be able to link them. I have some ideas about how to live with this challenges (i.e. mix structured with unstructured text), but before going into detail I would like to know your opinion about this and if it is worth pursuing. --Micru (talk) 19:39, 29 May 2018 (UTC)

so we would have, in fact, "structured sentences"? As you suggest, you need something there to get something like this started. But I like the general concept. As I understand it now, Senses come with a "gloss" that is like the "description" on a regular wikidata item - i.e. it can be in multiple languages, but is pretty much a single string of words intended to convey meaning. So what you're suggesting is perhaps adding a "structured gloss", which would be a list of form-sense pairs? ArthurPSmith (talk) 20:06, 29 May 2018 (UTC)

The gloss of each language should be a "structured sentence". Your interpretation is correct (sorry that I was not able to be more specific in my description), but indeed each "structured gloss" would be a list of form-sense pairs. ~~I have no idea about how to get it started~~, or if it is possible to combine "structured glosses" with "unstructured glosses", but at least now the idea is out there and we can think about it.--Micru (talk) 20:58, 29 May 2018 (UTC)

Perhaps what I said it is not entirely true. How I envision it is to start with "unstructured glosses", and switch to "structured glosses" on a case by case, language by language basis, once there is a considerable amount of senses.--Micru (talk) 21:03, 29 May 2018 (UTC)

In Polish wiktionary we have all definitions and examples of usage structured (written in wikikode) and it really helps to understand meaning. KaMan (talk) 07:37, 30 May 2018 (UTC)

It looks like wikifying every word in Wiktionary descriptions. --Infovarius (talk) 13:12, 30 May 2018 (UTC)

Indeed, very similar, the only difference is that instead of 1 link per word, there would be 2 (1 to the form, and 1 to the sense). It could be interesting if done at large scale, or even in other projects, as it would facilitate machine learning patterns of which form comes with which sense depending on the context. I assume that it could be semi-automated, as machine translation software normally has the ability to recognize POS and disambiguate. I could check with some linguistic researchers if this would be interested for them. But the question is, is it interesting for us or too much work? --Micru (talk) 17:06, 30 May 2018 (UTC)

@Micru: I'm not sure to understand exactly your idea (but I still don't get why we need glosses in general, maybe it's just me). I particular: what is the difference between your idea and a proprerty? L397-S1 → L4444-S5 (it could even maybe be the prematurely created item for this sense (P5137) or an other new one). I'm also a bit confused by the notation « "L3333-F3"+"L3333-S2" », why isn't the main lemma of L3333-S2 enough? Cdlt, VIGNERON (talk) 17:18, 30 May 2018 (UTC)

@VIGNERON: I appreciate that you spoke up because it gives me the chance to reflect about it and go deeper into the basics. First of all, I do not know what you understand by "gloss", for me it means a "short explanatory text in natural language", similar to what we do in Item descriptions, I hope you agree with my definition. In practical terms, I believe that glosses are more aligned to the way we think (in natural language), whereas having Senses without glosses would force me to think of words in structured elements only, which is more limited and frustrating. Other than that, in my view each word/form is a compressed representation of something that an observer notices in their reality. I, as a person, cannot record/transmit a 1:1 video feed of my inner experience, so I need to compress it in words, sentences, so I can transmit it to another person and they can decode it in their own terms. Of course this is tricky, because for two parties to understand each other's compression protocol they need to agree on how to encode/decode the message, and here it is where glosses come handy. For me glosses are an expanded version of the reality that a word is trying to convey. It is useful because it forces the people engaged in conversation to collaborate to align their encoding methods in a way that makes communication more effective and mutual understanding possible. I find useful to record glosses because it gives me the chance to dig deeper into the reality of each person in natural language and understand how they think. Digging deeper into this topic, by understanding how a person thinks I can communicate in their (inner, personal) language, so that they understand my needs and then they have the choice to take them into account (cooperation), ignore them (passive aggression), or fight them (active aggression). I can also do the same for them, and do it with awareness. Without understanding each other most people choose (sometimes by ignorance) to either actively or passively attack each other, which I find sad. In a way, recording and thinking about glosses helps in daily life because when a person has this understanding they will be curious to know what reality is behind each word and will ask more questions to understand and cooperate more. Ideally we would need a Wikidata for each person, so that we would know how each person uses each word, but since that is neither practical, nor feasible (for now), at least we can agree or discover agreements (glosses) that have been done about words, and unfortunately it is easier to reinterpret those agreements in natural language. (I hope that with my explanation about why glosses are important for me resonates in you, if not, please do share why not, because for me it is interesting to know what glosses mean for you).

Then about what is the difference between my idea and a property, I have to say that the difference is mainly related to how we encode sentences in natural language vs claims. Sentences are complex, they have an underlying grammar, and they can form a long linear unit. Statements are simple, they only have a statement-qualifier grammar, and generally do not form long units of knowledge. In theory you could encode each word in a sentence as values of a property, but that would make it difficult to read. So the idea is to respect the user-friendliness of linear written text, while adding explicit structured elements. The structured elements (or wikification, as others have put it) do not change the meaning of the gloss, yet I feel it is a step closer in the direction of having explicit knowledge about language (as opposed to leaving many elements open to self-interpretation, or recreating natural language in our limited statement-qualifier structured language).

You say, why not L397-S1 → L4444-S5? Well, I wish it would be so easy, and indeed it might be the case for exact synonyms, but in my experience not all words have exact synonyms. For instance, take the Form "pear", which English wiktionary defines (among other definitions) as "An edible fruit produced by the pear tree, similar to an apple but elongated towards the stem." It is possible to translate it into properties, but some of it is difficult like "elongated towards the stem" (in English, our towards (P5051) has a different meaning, so we would need more "towards" properties). And there are more complicated cases like "freedom of speech" = "The right of citizens to speak, or otherwise communicate, without fear of censorship or prosecution." I feel that we are reinventing the language wheel by trying to express everything with statements. In my view with property-value-qualifier we have reached this far on Wikidata in the last 5 years, it is very far indeed, but still baby steps compared to the power of natural language. I would like to consider natural language my ally and not my enemy, and I find that structuring natural language could be the right way to do it. If some day we want machines to give clear answers based on a body of knowledge, I feel that an effort has to be made to bridge the gap between natural and machine language. Perhaps by going in this direction of explicitly recording what we mean, after some years a Structured Wikipedia will be possible, where text only appears to what the reader can understand. Who knows.

Regarding the notation « "L3333-F3"+"L3333-S2" », I invite you to process this sentence, which albeit a bit extreme, shows the problematic of ignoring the different roles that can take homograph forms. I also think this is a great opportunity to start thinking about how to make user-friendly the encoding of more than one attribute per word/group of words. It is technically (and socially) challenging, but still that doesn't mean that a workable solution cannot be found.--Micru (talk) 08:44, 31 May 2018 (UTC)

Changed some words (Senses->Glosses) because I noticed I was using them incorrectly.--Micru (talk) 09:15, 31 May 2018 (UTC)

@Micru: wow, that a very lengthy response, thank you. I've got a much clearer view on you point. Yes, we agree that a gloss is a short description. But what is short? "An edible fruit produced by the pear tree, similar to an apple but elongated towards the stem." is not really short for me. I think "fruit" will probably be enough as there is no other fruit called "pear". For me (and here we seem to disagree), glosses are here to disambiguate senses and homographs ; as "freedom of speech" is self explanatory, there is no real need for a gloss. My point of view is that we shouldn't put more than needed, there is no need for sentences. If you think in term of transmission, I consider everything superfluous and redundant to be unnecessary noise that perturb the signal. And for the users (who are indeed important), there will be other clues (like image, synonyms, evocation and denotation, etc.) and the concept (like description on items, see pear (Q13099586)).

« Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo » is very easy to process as it's only 3 lexemes (the proper noun, the noun and the verb) and always the non-inflected form (so in theory no need to specify the form). Maybe the French « Les poules du couvent couvent » (in English "Convent hens are brooding") is a better example as here for the second "couvent" we need to specify the form (3rd person plural). That said, I love where I think you are going with this, but isn't it for an external tool? (something like textrazor but with lexemes instead of items?). And If I'm not mistaken, this has nothing to do with glosses, yes?

Cdlt, VIGNERON (talk) 11:19, 31 May 2018 (UTC)

@VIGNERON: Thanks for posting the link to textrazor, I didn't know it and it looks as something it could be interesting for our projects if it could be editable. I'm not sure if it is better as external tool or as a tool for the community, but I suppose that if it were editable I would prefer the content to reside in our servers. I find interesting that you said that non-inflicted forms do not need to specify the form. Is that the case?

So in your view glosses should be used to disambiguate meaning, but not to express it because we have properties/items for that. That is a valid view, however I feel it limited because what you call "the signal" is very small in this case. And it is so because our way of working cannot handle more. I would prefer to have a system that could be grown into an even bigger repository of structured data, so that by contributing in one Wikipedia I would contribute to all of them. Does that dream appeal to you? --Micru (talk) 13:10, 1 June 2018 (UTC)

Rare and pra-languages

Latest comment: 6 years ago4 comments3 people in discussion

how should I enter some languages like Slovio (Q36819) (please correct Lexeme:L2069) or Proto-Slavic (Q747537) (Lexeme:L2072)? Which "language of lemma" should I choose/which code should I use in forms? --Infovarius (talk) 13:17, 31 May 2018 (UTC)

+ Polabian (Q36741) (Lexeme:L2077), Hittite (Q35668) (Lexeme:L2090). Or without code: Proto-Balto-Slavic (Q1703347) (Lexeme:L2083) + Proto-Indo-European (Q37178) (Lexeme:L2087). For a while I use a code "ru" if nothing correct is possible. --Infovarius (talk) 13:42, 31 May 2018 (UTC)

I would remove the part of (P361) = Slovio (Q36819) Lexeme:L2069 since Slovio (Q36819) is already stated in the lemma.

Based on the previous example and discussion, I put the IETF tag "mis-x-Q36819" for Slovio (Q36819) (I hope this part would be improved soon @Lea Lacroix (WMDE):

.

What do you think?

Cdlt, VIGNERON (talk) 16:20, 31 May 2018 (UTC)

+1 with Vigneron. Lea Lacroix (WMDE) (talk) 10:42, 1 June 2018 (UTC)

Lexemes with most statements

Latest comment: 6 years ago1 comment1 person in discussion

We don’t have querying support for lexicographical data yet, but I wrote a Quarry query to at least find the lexemes with the highest number of statements (including statements on forms and senses): query. The current top lexeme is вода, populated by Cinemantique and Infovarius. --TweetsFactsAndQueries (talk) 11:59, 1 June 2018 (UTC)

How to reference notability of lexemes

Latest comment: 6 years ago4 comments3 people in discussion

The notability barriers of lexemes are very low which is fine, but I would welcome criteria and best practice how to avoid made-up words. In my opinion every lexeme should be referenced in some way, the question is how to do it. To give some examples:

‎*wódr̥ (L2087) is an assumed word from Proto-Indo-European (Q37178). This should be referenced by linguistic literature
satanarchäolügenialkohöllisch (L129) is an artificial word created by author Michael Ende (Q76498) in the book Der satanarchäolügenialkohöllische Wunschpunsch, 1st edition (Q54166669)
sí/شِ (L2129) is an obviously existing word but this is only obvious with basic knowledge

Where can references to existing dictionaries, linguistic literature or other sources best be put to reference that the lexeme does actually exist? -- JakobVoss (talk) 10:20, 1 June 2018 (UTC)

You can have allok at Wikidata:Lexicographical data/Notability as a basis for discussion. Pamputt (talk) 12:48, 1 June 2018 (UTC)

Thanks, that's helpful. However, I was more looking for which kind of statements to use to source the notability of lexemes. For instance I now found described by source (P1343). -- JakobVoss (talk) 20:33, 2 June 2018 (UTC)

+1. described by source (P1343) is used for persons and it seems to work pretty good. Additionally, I use quotation (P1683) to prove that a word exists and to show in what content it is or was used. --Zitatesammler (talk) 21:28, 2 June 2018 (UTC)

Alternative forms: orthography reform

Latest comment: 6 years ago3 comments3 people in discussion

"Überschuss" (de; Lexeme:L2170): I would like to add "Überschuß" as alternative form (Wiktionary: "pre-1996"; German orthography reform of 1996 (Q666027)). Would German spelling in the 20th century (Q1203728) be the correct grammatical feature? --Zitatesammler (talk) 15:19, 2 June 2018 (UTC)

Yes, it seems correct, so "de-x-Q1203728" as the spelling variant (and not as the grammatical feature). « pre-1996 » is not a standard code, so I don't know what it means exactly (and assuming that it's not the Principense (Q36520) as written in 1996). Cdlt, VIGNERON (talk) 08:49, 3 June 2018 (UTC)

« pre-1996 » might not be a standard code but it's meaning is clear: The official (Duden) German spelling before the reform of 1996. Imho we need more precise items than German spelling in the 20th century (Q1203728). Till then we could use German Orthographic Conference of 1901 (Q2031873) and German orthography reform of 1996 (Q666027). --Kolja21 (talk) 01:35, 4 June 2018 (UTC)

Lexems <-> Wiktionary

Latest comment: 6 years ago4 comments3 people in discussion

Tracked in Phabricator
Task T195411

May be I something lost but I can't find a discussion about linking with Wiktionaries. Each Lexeme should be trivially linked with a (Extension:Cognate) set of Wiktionary pages and vice versa. It looks not hard. When will it be done? --Infovarius (talk) 14:32, 28 May 2018 (UTC)

@Infovarius: That is being tracked in the Phabricator ticket provided. I suppose that if we had a special page listing all lexemes/forms with the same spelling, then we could link that page to the Wiktionaries with the Cognate extension.--Micru (talk) 07:37, 29 May 2018 (UTC)

Ok. I've forgotten that 1 Wiktionary page corresponds to many Lexemes, ok. But each Lexeme has only 1 corresponding Wiktionary page (at each language edition). Is it planned to add something like sitelinks to Lexemes? Infovarius (talk) 10:12, 4 June 2018 (UTC)

@Infovarius: nope, a Wikidata lexeme can also correspond to multiple Wiktionary pages (the lexeme Lexeme:L1122 correspond to en:wikt:dog and en:wikt:dogs). If there is something like sitelinks, it could only be on an external tools or on a special page. Cdlt, VIGNERON (talk) 10:32, 4 June 2018 (UTC)

Language of Lemma / Lexeme

Latest comment: 6 years ago1 comment1 person in discussion

When I created my first lexemes, there was three fields in special:newlexeme Lemma, Language of lexeme and grammatical category. Sometimes this page wasn't submited in first try. But now sometimes after unsuccesfull submit fourth field appears: Language of Lexeme. And now I am confused - what is the difference between these two languages? And problem - how to select language which is not in the list for lexeme language? (maybe some people was talking about this recently, but I see this field for the first time now) JAn Dudík (talk) 07:02, 4 June 2018 (UTC)

Valencian

Latest comment: 6 years ago10 comments5 people in discussion

My mother tongue, Valencian (Q32641), is considered by scholars a variety of Catalan (understood as the group of languages spoken in Catalonia, the Balearic Islands and Valencia), however 52.4% of Valencian speakers [1] do not consider Valencian (understood as the group of languages spoken in Valencia) the same as Catalan (understood as the group of languages spoken in Catalonia), versus the 41.1% who do consider them the same (I assume by "the same", they mean mutually intelligible). In my opinion, the main root of this controversy is that Catalan (the group of languages spoken in Catalonia) has taken the same label as Catalan (the group of languages spoken in Catalonia, the Balearic Islands and Valencia). This might have been caused by a higher prominence of scholars of Catalan origin reflecting their view, without taking into consideration the identitary feelings of Valencian speakers. Some parties have attempted to correct this with mixed success, for instance there is a Catalan-Valencian-Balear dictionary (Q3026521), and there was a time when Etnologue was referring to the language(s) as Catalan-Valencian-Balear (we have the item Catalan-Valencian-Balear (Q8342780) as a result of this), now they use it only as alternate name. Valencian is officialized and it has a regulatory body (Acadèmia Valenciana de la Llengua (Q1468503)), and there was a proposal for it to get an independent language code. The proposal from SIL international [2] was to:

Change scope of [ca] / [cat] from individual language to macrolanguage
Change reference name of [ca] / [cat] to “Catalan-Valencian” or “Catalan (macrolanguage)”
Add new code element for Valencian
Add new code element for Catalan (individual language), excluding Valencian

In my view this is a fair solution, although I would prefer "Catalan-Valencian-Balear" as reference name for [ca] / [cat] to keep consistency with existing literature. But as of now, I do not know which code to use to refer to Valencian or to Catalan as individual languages, because I haven't found any official documents that refer to these codes. I assume that for Valencian I could use "ca-valencia" (there is this template on en-wiki, so it could be that it is already sort of official). However, afaik there is no language code for the Catalan spoken in Catalonia. I supose that following the same pattern as ca-valencia it could be called "ca-catalonia" (and same for "ca-balear"). In practical terms it would mean that common lexemes would be labeled as "ca", while particular lexemes to each variant would be labeled as "ca-catalonia", "ca-valencia" or "ca-balear" respectively, with an additional item to indicate dialects for instance "ca-valencia-Q2858334". There is at least a dialect shared between "ca-catalonia", and "ca-valencia", that is Q3571132, for it I guess it would be enough to indicate ca-Q3571132. Do you think this is a workable solution? I will inform about this also on the mailing list of Amical Wikimedia (Q16943393) so we can have more input on this topic.--Micru (talk) 17:42, 31 May 2018 (UTC)

I wonder why the languages are tagged with stange code in the Lexeme while it would be possible to use Q-item and thus to solve a lot of code issues, like this one. For the reste, I let the others reply. Pamputt (talk) 20:50, 31 May 2018 (UTC)

Pamputt, I agree with you. Maybe we should open a discussion to see if it is possible to remove language codes altogether and use only items (that at least can have more nuances for topics like this).--Micru (talk) 06:51, 1 June 2018 (UTC)

When entering language as code, it is easy, because no need to remember some Q123 and easy to typing for longer names or similar names like Slovak/Slovenian. But also sometimes problem with suggester (cs/scb)

When entering as name, there might be sometimes problem with suggester suggesting based on language code (Esperanto vs. es (Spanish)). JAn Dudík (talk) 07:06, 1 June 2018 (UTC)

A code will never be clearer that a full name. Pamputt (talk) 12:46, 1 June 2018 (UTC)

I've reached this talk from a Catalan/Valencian related conversation, so maybe my point has already been dealt with somewhere else, but, my opinion is that a solution should be used in this case (Catalan-Valencian-Balearic) that is common to other similar ones. Many other languages have dialects that have their own rules and ruling bodies. Kurdish even uses two different scripts! Swiss authorities use up to five varieties of Rumantsch. And the same goes for English (lift/elevator), German (Strasse/Straße), Spanish (tú sabes/vos sabés) and many more. So rather than a Catalan/Valencian solution, a wider one should be implementented. B25es (talk) 05:31, 3 June 2018 (UTC)

@B25es: Thanks for your input, I also would like to have generic solutions. In this case I think that this and other cases would be addressed by replacing codes by items in the language interface as stated in the section below.--Micru (talk) 09:58, 8 June 2018 (UTC)

For clarity, I'm answering here a question that Micru asked below « How would be encoded in BCP 47 the cases stated in the section above titled "Valencian"? ». First, in BCP 47 there can be subtleties so there not one unique and definitive answer. And I don't know precisely the Valencian situation. That said, here are codes available :

mis-x-Q32641, who means that Q32641 does not correspond to a variant nor close to any known ISO 639 code
ca-x-Q32641, who means that Q32641 is a variant or close of the ISO 639 code 'ca'

It would maybe need more reflections and obviously it might change if there is a new ISO 639 code for Valencian. Cdlt, VIGNERON (talk) 09:06, 8 June 2018 (UTC)

@VIGNERON: Thanks for answering my question here, but so far there is no code for Valencian, so the codes proposed are not a valid solution:

mis-x-Q32641 does not consider the view that Valencian and Catalan belong to the same macrolanguage.
ca-x-Q32641 does not consider the view that Catalan can be seen as individual language spoken only in Catalonia

So another solution is needed, like replace language codes by items.--Micru (talk) 09:58, 8 June 2018 (UTC)

@Micru: Why are they not valid? First, the BCP 47 codes are for the lemmata, not for the lexemes and your concerns seems to be about the lexeme not lemmata. Your two considerations are strange, how did you came to these? do you mean that the information are wrong in Q32641? The BCP 47 private codes only says what the items says. In both case, with the BCP including the item (mis-x-Q32641/ca-x-Q32641) or directly with item Q32641, the problem is the same, no?

As always on wikimedia project, the best solution is to look at references, for instance what code the websites in Valencian use (internet websites could only use BCP 47 codes) Acadèmia Valenciana de la Llengua use "ca-ES", vives.org only "ca" and www.racv.es only the invalid code "Val". I see that the IANA propose ca-valencia. Cdlt, VIGNERON (talk) 12:06, 8 June 2018 (UTC)

Pronominal forms

Latest comment: 6 years ago10 comments3 people in discussion

How should we describe the pronominal forms of nouns, verbs etc.? Pronominal forms are common in most (maybe all) Semitic languages and at least in some Turkic and Finno-Ugric languages too. The point is, that the flection (usually in the form of prefix) enables the word to express some relation (genitive, possesive, accusative...) to a personal pronoun. Some examples:

	noun in nominativ sg.	noun in nominativ sg. + pronoun 1st person sg.	noun in nominativ sg. + pronoun 1st person pl.	noun in nominativ pl. + pronoun 1st person sg.	noun in nominativ pl. + pronoun 1st person pl.
English equivalent	friend	my friend	our friend	my friends	our friends
Hebrew	יָדִיד	יְדִידִי	יְדִידֵנוּ	יְדִידַי	יְדִידֵינוּ
Arabic	خَلِيل	خَلِيلِى	خَلِيلُنَا	خُلَّانِى	خُلَّانَا
Hungarian	barát	barátom	barátunk	barátaim	barátaink
Turkish	dost	dostum	dostumuz	dostlarım	dostlarımız

The intuitive way would be adding the corresponding grammatical categories to the form: "nominativ" + "singular" + "pronominal form" + "1st/2nd/3rd person" + "singular/plural" (+ "gender" if needed), where the later categories from "person" on don't describe the word itself, but the connected pronoun. Alas, this is not possible, since the "singular/plural" category of the noun would interfere with the "singular/plural" category of the pronoun. For pronominal forms of verbs the problem would be even bigger, since the "person" category would collide too.

Possible solutions that came on my mind:

creating a set of items describing more specific grammatical categories ("pronominal form of 2nd person singular masculine" a s. o.),
creating a parallel set of generic grammatical categories for connected pronouns ("2nd person of connected pronoun", "singular of connected pronoun", "masculine of connected pronoun" a s. o.).

Personally, I don't like either of them. If anybody has another suggestion, I'd be glad to read about it. --Shlomo (talk) 19:57, 5 June 2018 (UTC)

@Shlomo: very good question. @Njardarlogar: make a simpler but similar question on #Missing items for grammatical features « I also note that we have e.g. first-person singular (Q51929218), but isn't first person (Q21714344) + singular (Q110786) good enough? ». While the parsimony principle is usually good, here, I'm wondering if the disconnected tags will be clear enough and comprehensible or not. On a different case, I had a similar question for Lexeme:L114-F3 : is « grammatical features : dual (Q110022), plural (Q146786) » understandable? by the way, here a little test for people who tried the grammatical features, without looking at a Breton grammar or looking at Lexeme:L114 other forms, how do you understand « dual, plural »? Cdlt, VIGNERON (talk) 21:01, 5 June 2018 (UTC)

« dual, plural » = paucal (Q489410) ;) I'm afraid the forms of Lexeme:L114 (br: lagad) are not "grandma fit" (Lexeme:L2349). --Kolja21 (talk) 21:34, 5 June 2018 (UTC)

@Kolja21: wrong answer, try again

. But nice try, paucal (Q489410) technically exists in Breton - even if very few grammar books talks about it - for this grammatical feature I used « singulative (Q1450795), plural (Q146786) », see Lexeme:L62 for an example.

Yes, Lexeme:L114 forms are not really omatauglich, but how would you do it to be grandma-compliant?

Cdlt, VIGNERON (talk) 08:35, 6 June 2018 (UTC)

I would start with a help page. Even though I never will learn Breton I would love to see more examples and read about the linguistic backgrounds. A list with the grammatical properties would be helpful: a) properties for every language, b) properties only for special languages. Is there an existing page which we can complete or do we need to create a new one? --Kolja21 (talk) 16:12, 6 June 2018 (UTC)

@Kolja21: good idea, documentation is always a good thing. No idea where and how to put it exactly but I guess a general page with sub-pages by lang would be a good start (and I can write for the /Breton sub-page).

Going back to Shlomo original question: with « dual, plural » my intention was to express « plural of a dual », in this case "daoùlagadoù"@br means « several pairs of eyes » (by opposition to "daoùlagad" « one pair of eyes » and "lagadoù" « several eyes - but not in pair »). Should I create an item « plural of a dual »? (same thing for « plural of a singulative » ; and if created, should I add « paucal » to it?) or will the help page be enough to make the combination of tags understandable?

Cdlt, VIGNERON (talk) 16:33, 6 June 2018 (UTC)

A « plural of a dual » is a paral (Q1754546). The definition in German WP: "Der Paral (lat. paralis) bezeichnet in der Sprachwissenschaft einen Numerus, der ein natürlich gegebenes paarweises Vorkommen ausdrückt, wie bei Augen, Händen, Schuhen etc." --Kolja21 (talk) 16:48, 6 June 2018 (UTC)

For « plural of a singulative » you can use collective noun (Q504952). Sorry, again German: "Das Walisische zum Beispiel hat dagegen Substantive, deren Grundform die Mehrzahl ausdrücken (Kollektiva)." (de:Singulativ) --Kolja21 (talk) 17:00, 6 June 2018 (UTC)

@Kolja21: thank you, I didn't know about the paral (and I don't recall seeing it in Breton grammar, which raises the question of the references) but that make sens. That said, if I understand correctly, paral is just a subclass of dual and here "daoulagad" is the paral, so I still need a « plural of a paral » for "daoulagadoù", no? For « plural of a singulative » collective noun (Q504952) doesn't fit "gwez"@br = trees is the collective (I used collective (Q694268) but maybe collective noun (Q504952) is better ), "gwezenn"@br = tree is the singulative and "gwezennoù"@br = some trees is the plural of he singulative (and now, I realised that you're right, I do need to set up an explanation page

). Cheers, VIGNERON (talk) 17:09, 6 June 2018 (UTC)

Thanks for the introduction into the problematic of paral, paucal, collectives and singulatives. Now, any ideas how to solve the problem of pronominal forms?--Shlomo (talk) 13:05, 8 June 2018 (UTC)

Letters: Language vs. Script

Latest comment: 6 years ago4 comments4 people in discussion

I would like to add the letters of the alphabets. More than 60 Wiktionarys have an article for the letter "A". The documentation only talks about words, phrases, and prefixes.

Quote form above thread ("A Chinese character is Lexeme or Item?"):

'A' is a letter not of a specific language, but of a specific script.

So how can we add letters as lexicographical data? One way would be allowing to chose between "Language" and "Script":

Lexeme A Script Latin alphabet Lexical Category letter

Any other suggestion? --Kolja21 (talk) 21:55, 6 June 2018 (UTC)

I will not answer to your question but a letter is of course part of a script but it is also part of a language. For example "ø" is only used in some languages. Pamputt (talk) 05:51, 7 June 2018 (UTC)

I don't think letters are lexicographical data, they are not lexemes and can have neither forms, nor grammatical categories like gender, nor senses. I'ld recommend adding letters as Q-items and not L-items, I expect we will find a way to integrate Q-items with Wiktionaries eventually. Ain92 (talk) 19:27, 7 June 2018 (UTC)
- Sure, letters are not typical lexemes but they share a lot of traits with lexemes, they can have different language (C'h (Q2344267) is a letter only in Breton), forms (lowercase and uppercase), homographs, etymologies, etc. For now, it's maybe best to only keep them only as Q-items but we need more thoughts and point of view on this. Cdlt, VIGNERON (talk) 09:53, 8 June 2018 (UTC)

Special pages now display the lemma

Latest comment: 6 years ago3 comments2 people in discussion

Hello all,

Several of you mentioned the fact that Lexemes appearing on special pages were only identified by their L-number. It was tracked on this ticket. This problem is now fixed. We replaced the display of "Lxxx" by "Lemma (Lxxx)" (or Lemma1/Lemma2 in the cases where several lemmas are entered) on the following pages:

And possibly some others that I didn't mention here.

Note that this is a first fix. Some editors mentioned that they would prefer a format like "Lemma (lexical category, language)". If you have suggestions about how this should be formatted, feel free to open a new ticket and give details.

If you notice any bug or page where it's not correctly working, you can ping me in the previous ticket. Thanks, Lea Lacroix (WMDE) (talk) 08:03, 7 June 2018 (UTC)

Great, thanks for this "first fix" which seems to works pretty well already (I can't wait to see more improvements!). Cdlt, VIGNERON (talk) 10:47, 7 June 2018 (UTC)

Update: we currenty encounter some issues with the right-to-left languages. This will be fixed on next Wednesday. Lea Lacroix (WMDE) (talk) 08:49, 8 June 2018 (UTC)

Link to Special:NewLexeme

Latest comment: 6 years ago7 comments4 people in discussion

I see a link to Special:NewItem at the left panel but subj. Can anyone with rights add the link? --Infovarius (talk) 13:07, 7 June 2018 (UTC)

Very good point, this link is indeed needed; but isn't it a bit too soon? I would like to have the search function working before make the creation more public (in order to avoid duplicate and especially as we don't have the merge function either). Cdlt, VIGNERON (talk) 13:37, 7 June 2018 (UTC)

This has been discussed in this ticket and this one, we suggested to wait for RC integration (done) and search integration (not done yet), so we have better tools to monitor the increase of new Lexemes that the link will bring. Lea Lacroix (WMDE) (talk) 13:45, 7 June 2018 (UTC)

Link from Special:NewItem

A link on NewItem might be worth it too. If someone has a suggestion for a nice wording, it could go on MediaWiki:wikibase-newitem-summary.
--- Jura 06:54, 8 June 2018 (UTC)

@Jura1: I don't understand your first sentence, there is already a link to NewItem (for more than 5 years). For the second sentence, yes, indeed we need a text on MediaWiki:Wikibase-newlexeme-summary (page which is empty but already called by Special:NewLexeme). Cdlt, VIGNERON (talk) 08:34, 8 June 2018 (UTC)

Sometimes it helps if one re-reads the text and the topic of a thread. There is no need to comment on every topic on Wikidata if it's too much a struggle.
--- Jura 08:47, 8 June 2018 (UTC)

My bad, I didn't notice that you were speaking of linking on (as everybody was speaking of linking to). Now I understand but I still have the same caveat: it's a bit too early but then yes, creation pages should link to each others. The wording could be something like « This page if for [concept|words] stored on [Item|Lexeme], for [Lexeme|Item] go to Special:New[Lexeme|item] » Cdlt, VIGNERON (talk) 09:38, 8 June 2018 (UTC)

Lexicographical properties template

Latest comment: 6 years ago3 comments3 people in discussion

I have created {{Lexicographical properties}}:

Please feel free to expand it, or subdivide the entries as needed. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:01, 7 June 2018 (UTC)

What about a link to Wikidata:Property proposal/Lexemes? :) That could encourage more people to go and discuss about the property proposals, some of them seem quite blocked. Lea Lacroix (WMDE) (talk) 08:54, 8 June 2018 (UTC)

I've added "Phonetics". Other subdivisions could be:

Wikidata property for lexemes (Q54254515)
Wikidata property for items about languages (Q20824104)
properties that are being considered for deletion = under discussion (senses)

--Kolja21 (talk) 12:41, 8 June 2018 (UTC)

New list: verb categories

Latest comment: 6 years ago2 comments2 people in discussion

Wikidata:Lists/verbs/categories

In case you were looking for them.
--- Jura 08:46, 29 May 2018 (UTC)

List: noun categories

Wikidata:Lists/nouns/categories

JFYI - yet another category list, a little bigger. --Infovarius (talk) 14:32, 9 June 2018 (UTC)

Replacing language codes by items in the user interface

Latest comment: 6 years ago39 comments9 people in discussion

Recent conversation like this or this suggest me that it is complicated to work with language codes. There are many cases where there is no language code, or the language code itself doesn't represent the reality of the linguistic system where the lexeme/form originates. There are considerations about script variants, spelling reforms, regions of use, linguistic variants, that are complex to represent with language codes. As we dig deeper and deeper in the nature of each single lexeme/form the usability of language codes decreases. However, luckily in Wikidata we have the alternative of items. Items can contain language codes, can be considered subclasses of other items that in turn can contain other language codes, and I believe that they can offer an alternative to language codes at least for the user facing Wikidata.org proper. For data reusers, it should be possible to generate the language codes according to the items where possible. From my perspective I would prefer to drop language codes in Lexeme/Form "language" and "spelling variant" fields and use the same system as now it is used in the field "Grammatical features". I would like to have some input about this (or support/oppose with justification so when can create some consensus).--Micru (talk) 07:06, 1 June 2018 (UTC)

Oppose. Every lexeme should have language. And every "allowed" langage should have code. There must be some system. Without system there will be thousands of "languages", but many of them are in fact only dialects or "I-say-this-is-dialect"s. But system should accept codified variants like de-AT or pt-BR. Serbocroatian is considered as macrolanguage (Q152559) [3], as language is obsolete. sh=sr,hr,bs,cnr as one language is en.wikt POV. JAn Dudík (talk) 07:31, 1 June 2018 (UTC)

You say "Without system there will be thousands of "languages"", and in it is true there are thousands of languages. In reality every person has their own language, so there are 7 billion languages, however many people choose to understand each other and recognize each other as having enough points in common as to consider themselves as "speaking the same language" (even in reality it is not "the same", only reasonably mutually intelligible). Sometimes new agreements are found, or speakers decide that they do not want to understand the speakers of a language in a similar language variety, so in general codes are problematic (also items). The question is if we can get more flexibility with items to reflect changing situations, and if it is more user-friendly to use items or language codes.--Micru (talk) 09:20, 1 June 2018 (UTC)

After rewording it sounds better. If you mean replacing lang code in the top with name of language, I see it not necessary. Replacing with Name of language (code), sounds good. Input should be done both with code and name. 10:26, 1 June 2018 (UTC)

@JAn Dudík: When you say that it sounds better, do you mean that you feel comfortable removing your oppose tag, or are there more points that you need to address first? I do not understand what you mean by "Input should be done both with code and name", do you mean that the entity suggester should behave differently in those fields and prioritize language codes/names?--Micru (talk) 12:10, 1 June 2018 (UTC)

Oppose my point of view is that the current system is good and sane ; it's not perfect but we just need to make it easier to use. « There are many cases where there is no language code » with the "mis-x-Qid" you can add whatever you want and be as precise as you want, so there is always a code available (it would be nice to enter the name and get the Qid instead of entering the Qid itself, but that's a minor improvement that can be done once the all the bigger issues would be solved). True, sometimes, it can be complex to choose the "perfect" code but in the vast majority of cases, there is no problem and the code are useful so we should keep them. Without code, how would you make the distinction between two lemmas and tell which one is what? Getting rid of the code feels a bit like "throwing out the baby with the bath water" and doesn't seems to solve anything. Cdlt, VIGNERON (talk) 07:49, 1 June 2018 (UTC)

The code is just a name, same as an item ID, it is just a name. We can say that a lexeme is written in "Qid", and in that item we can specify which language code it has. The main difference is in user-friendliness, and usefulness. I do not find language codes user friendly, and I do not find useful having language codes that I cannot understand. I prefer to have a selector where I can type the name of the language, and have a link to an item where I can get more information about the code. Without the code you still would have items to specify the language. From your knee-jerk reaction I realize that I should have framed the title as "Replacing language codes by items in the user interface". I will make the change now.--Micru (talk) 09:20, 1 June 2018 (UTC)
- @Micru: oh, ok, thanks for the rewording and I agree, the system itself is good but the interface should have a better and more user-friendly rendering but I still think that's not the biggest issue right now. And the code is not just a name, it's a international recognized standard. I think it's important for alignment with things outside Wikidata. Cdlt, VIGNERON (talk) 09:39, 1 June 2018 (UTC)

@VIGNERON: It might be as "official" as you want, but for me it is still a name, and as any name it can be aligned with our name for the language (item). Why is it different for you to use codes in lexemes instead of items that contain the same codes? --Micru (talk) 09:51, 1 June 2018 (UTC)

The question is not just about the language and not just for me but for everything and everyone. A standard (which is not at all "official") widely used among lexicographers seems more suitable for lexicographical data. And structurally, IETF codes can be much more precise and more flexible than item : fr-x-Q815549 or mis-x-Q815549 allow to introduce a subtleties that wouldn't have Q815549 alone (consider it as a dialect of French or as a language in itself, here it's a dummy example of course). Cdlt, VIGNERON (talk) 09:52, 3 June 2018 (UTC)

@VIGNERON: Can you please explain how a tag has more subtleties than all the statements that an item contains?--Micru (talk) 11:56, 3 June 2018 (UTC)

@Micru: the IETF BCP 47 tag is made of subtag that you can combine. With the private use, you can also combine items. So mathematically, there is more tag then item (no matter how many item you create). With more tag, you can be more precise. I gave just above the example of fr-x-Q815549 or mis-x-Q815549 but you can go further with the script: fr-x-Q815549-Latn and fr-x-Q815549-Arab / mis-x-Q815549-Latn and mis-x-Q815549-Arab, and so on. True this can be also done with statements but as it is basically the same information, I find it way better to be together in one only place (and later when query will be available, it is easier to query one fields instead of checking if a property is there or not). Cdlt, VIGNERON (talk) 12:37, 3 June 2018 (UTC)

@VIGNERON: I still don't follow your reasoning. You say that tags are more flexible, and you gave the example of fr-x-Q815549 and mis-x-Q815549. How are you going to specify that both tags are valid? Or are they going to be entered twice for each lexeme? It seems to me that if you enter the item, then in the statements you can give as much "contradicting" information as you want. You can consider it a language AND a dialect, at least that is the idea behind Wikidata, that we can reflect different points of view without preferring any. With the language code you are being forced to choose a point of view, so in my opinion that forces us away from a plural point of view.

You also say that you can go further adding the script to the code, but why not have a field for the script instead? Wouldnt't make things easier for querying? In my opinion, it looks more complex to perform queries if all the tags are inside a language code...--Micru (talk) 21:08, 3 June 2018 (UTC)

For the validity, technically the system only allow valid tags (like the statements only allow valid Qids) and for veracity and uses, it depends on the sources.

Querying strings is not much more complicated than querying statements. I do daily queries on labels and descriptions to correct mistakes or make improvements. The system is not fully in place, we can't query efficiently anything right now for Lexemes so I cannot know for sure but I don't see any problem here. On the other hand, following a standard (like Wikidata follow the RDF standard of the query use the SPARQL standard) has a lot of inherent advantages for users.

And has already done on Wikidata on multiples occasions, you can have both tags and statements, a bit of redundancy can be good and useful ; not too much but here, indeed statements can be useful too (for checking for instance).

Cdlt, VIGNERON (talk) 21:33, 3 June 2018 (UTC)

@VIGNERON: Maybe I haven't explained myself properly, I will try again. Let's imagine that I want to enter "dîner" in Belgian French (Q815549). You say that both fr-x-Q815549 and mis-x-Q815549 are valid. Which one to use then? Or both? If I use only one I'm forcing a POV, if I'm using two it might be more clear but there is redundancy. However, if I just use the Qid, and inside the item I enter a statement for all the IETF BCP 47 codes of Belgian French, then I do not make a choice. And if more language codes are created in the future for that variant, they can go to the item too.

You say that querying strings is no more complicated that querying items, and I agree. However, you speak of Belgian French (Q815549) as if it is a unit, but it has in turn dialects. If I enter that a word is used in "Belgian French from Brussels", which language code will you use then? And how will you make sure that when you query the lexemes of Belgian French, also the lexemes of "Belgian French from Brussels" will appear too?

Regarding "you can have both tags and statements". How? Where?--Micru (talk) 07:03, 4 June 2018 (UTC)

When to use which code is a matter a references, we should simply do what the references says as always.

For Brussels, BCP 47 code already exists, BE-BRU is the ISO 3166 code for Brussels. So if you want fr-BE and fr-BE-BRU, just ask everything starting with fr-BE (or be more specific, that's up to you to query what you want). BCP 47 has already been thought to identify every string of every possible texts (in including the most exotic things you can imagine), and with the private use zone where we can use Wikidata items, there is nothing that can't be encoded.

« you can have both tags and statements » refers to what we already do on Wikidata, there is plenty of properties duplicating information from the label (all the naming properties) or description (instance of (P31) for starter). Properties are needed when we need a qualifier or references (which is not possible for label or description).

This discussion is already very long and is going nowhere: BCP 47 is the perfect system (or the closest to perfect that exists), I - personally - see no reason not to use it.

Cdlt, VIGNERON (talk) 22:37, 4 June 2018 (UTC)

I don't know if it's on purpose but "dîner" is actually a very interesting example, I will look how to model these two lexemes (dîner in "standard" French is in the evening meal - dinner@en - but in Belgium and other areas, it's the midday meal - lunch@en).

@VIGNERON: I still have one last question before ending this conversation, sorry that is so long but I want to make the most out of it :)

How would be encoded in BCP 47 the cases stated in the section above titled "Valencian"?--Micru (talk) 20:54, 6 June 2018 (UTC)

Thank you for replying there.--Micru (talk) 22:57, 8 June 2018 (UTC)

Comment per Wikidata_talk:Lexicographical_data#Spelling_variants, I think we should remove the mandatory distinct value constraint on these codes instead.
--- Jura 09:01, 1 June 2018 (UTC)
The notion that it is only language codes that represent language is a fallacy. For instance all of American English is en-USA. This is a combination of a language code and a country code. The problem with letting go of combinations of language codes with associated standards is that it becomes extremely easy to confuse content. The notion that a code is a just a label is another falacy. They are codes issued by ISO therefore they refer to standards. No disambiguity in that.

Thanks, GerardM (talk) 09:53, 1 June 2018 (UTC)

@GerardM: I do not understand your message. What is your posture? Do you feel language codes should be replaced by items? When you say "all of American English is en-USA", do you find that inappropriate? You say that "The notion that a code is a just a label is another falacy", why do you consider so? It is an agreed code (label) to refer to something by an organization, but which difference does it make for you? In general I find your message hard to understand because you do not state your point of view in terms that I can understand (for instance by saying "I think that..."), I would appreciate more clarity. Thanks.--Micru (talk) 10:10, 1 June 2018 (UTC)

I do not posture. The key thing for me is that whatever we use is backed by language codes. Without them it will be impossible to have an integrated system for lexical data for the expression of languages, dialects etc. Standards may be considered not concise enough but they prevent the endless bickering, splitting and lumping that you get otherwise. There is no user friendliness when it is not extremely obvious what it is you consider to be the language of the expression. Labels for languages exist in multiple languages and that in itself is already where ambiguity starts. Thanks, GerardM (talk) 11:19, 1 June 2018 (UTC)

@GerardM: What do you mean by "I do not posture"? Do you mean that you don't take a posture? If you don't take a stance, why should I consider what you are saying? So if I understood correctly, you prefer standards because then you can use it as a weapon to impose (the standard's will) on other people? Do you consider the labels in multiple languages problematic? --Micru (talk) 12:17, 1 June 2018 (UTC)

@Micru: Consider what "posture" means. Given what it means it is quite offensive. When you use words as "weapon" you use antagonising language. Yes I want us to impose some standards because the alternative is hopeless. PS I have been there and done that. Thanks, GerardM (talk) 12:24, 1 June 2018 (UTC)

@GerardM: The link you have posted ("posturing") has two definitions: (1) intention to impress or mislead, (2) place (someone) in a particular attitude or pose. Which one are you referring to? What is offensive? Is using antagonising language negative? If so, which other language could I have used? What is the "hopeless alternative" that you have experienced? Sorry that I make so many questions, but I really would like to understand your point better.--Micru (talk) 12:36, 1 June 2018 (UTC)

@GerardM: I want to add that I was using "posture" as a noun, not as a verb. In this case "posture" has two definitions, and I was referring to the second one "a particular approach or attitude". So when I asked "What is your posture?" I meant "what is your approach?"--Micru (talk) 12:46, 1 June 2018 (UTC)

Oppose. As far as I know, it was never said that language in lemma or form is ISO code. There is another standard that is a much better fit: BCP: 47. It is a common standard and it already gives you a lot of flexibility including `privateuse` part in which you can put "what-ever-you-want". Bekh6ex (talk) 10:33, 1 June 2018 (UTC)

@Bekh6ex:, what is the advantage of using BCP: 47 directly vs. using an item and putting the BCP: 47 code (together with other codes) in the item?--Micru (talk) 12:20, 1 June 2018 (UTC)

@Micru:, The main advantage is the guarantee that standard code present for Lexeme. Linked Item might not have code defined or the code might be changed/vandalized and it won't be displayed on Lexeme. Also, taking SPARQL into account: if we want to find all Lexemes of a certain language, English for example, and then we will have many specific items like "English spoken in 1990s in the south of USA" it will require user to spend quiet some time to understand how to write this query and the result becomes quiet unstable, because clusters of words may disappear from the result set because of only single edit that will touch some property value on one of the language Items. Bekh6ex (talk) 12:38, 7 June 2018 (UTC)

@Bekh6ex:, I think your concern applies to the whole of the Wikidata system, which is in fact very unstable, and at the same time remarkably stable. What you are saying about "cluster of words that may disappear" applies verbatim to items too, so I don't think you are bringing any new concern to the Wikidata system, it is known and we still live with it. Your concern about vandalism and not noticing changes in codes apply to all kind of IDs that are stored in Wikidata, so again nothing new. Summing up, I believe that your concerns are just rehashed concerns about the whole Wikidata system, which are understood and accepted as such.--Micru (talk) 08:49, 8 June 2018 (UTC)

Comment Please note that we opened a ticket: Decide on a way forward for acceptable languages for lemmas and representations to discuss about several options to have an list of acceptable languages. If you could also add your input and ideas there, that would give us a better perspective of the options! Thanks, Lea Lacroix (WMDE) (talk) 11:33, 1 June 2018 (UTC)
Support I let a comment on the Phabricator ticket. Pamputt (talk) 13:23, 1 June 2018 (UTC)
Support Wikidata is about structured data, this allows lexemes to be more automatically integrated with the rest of wikidata rather than having to interpret and parse string codes. ArthurPSmith (talk) 14:58, 1 June 2018 (UTC)
And I note that the Notability page (linked below) states "Any language is included as long as they are notable as Wikidata item." which suggests we should pull our language list directly from wikidata items, not an intermediate code list. ArthurPSmith (talk) 15:19, 1 June 2018 (UTC)
@ArthurPSmith: in the end, IETF tag and Qid are both strings. Which one would be easier to parse? (and not just for Wikidatians, for every possible re-user). And I don't the BCP 47 as an intermediate but more as an extension and an improvement of the Wikidata item which allow to be much more precise. Cdlt, VIGNERON (talk) 09:07, 3 June 2018 (UTC)

Outcome of the conversation

I believe that I have made a thorough effort to understand the concerns raised to my proposal of replacing language codes by items in the user interface. I also believe that I have heard, understood, and addressed those concerns. The way I see it is:

There are concerns about Wikidata not aligning with existing language standards. That concern is addressed by understanding that we can incorporate any code into a Wikidata item, and that we can create any item to represent any language (even if there is no language code for it yet).
There are concerns that Wikidata items are not flexible enough to represent the same information as BCP: 47. That concern is addressed by requesting/allowing the input of different items to represent different qualities of the lexeme/form (language system, script, spelling reform).
There are concerns about Wikidata items being too flexible and allowing too much freedom to editors. Those concerns are addressed by acknowledging that it might seem scary to allow editors to create and represent their own version of the truth, yet individual freedom must be respected if we want a truly open and collaborative project.
There are concerns that the same problems that affect Wikidata (vandalism, changes in structure) might be present by using items instead of language codes. That concern is addressed by acknowledging that it is an issue of Wikidata, that we indeed have to live with if we want to guarantee flexibility and openness of the project.
There are concerns that external re-users won't be able to use our data if we don't use language codes. That concern is addressed by explaining that from the item linked the re-user can access the language code. And not only BCP 47, but any other code that is present in the item and that the re-user chooses to use.

This is of course a my interpretation of the concerns raised, and my way of dealing with the concerns raised. If someone believes that I mischaracterized their concern or that I have not addressed it properly, please do speak up so we can discuss it further. However, I do believe that my concerns have not been heard, understood, and addressed properly, those concerns are:

I have the concern that we are not believing enough in the power of our project and our codes (Q-items) to represent reality. I believe that if we use external language codes, we will be relying too much on what external parties say that is the right way to represent reality, without valuing that our method to represent reality is at least as useful, valuable, and reputable as the system that external sources use (I would even dare to say our system is better). This concern is addressed by using items instead of language codes.
I also have the concern that no language code can address the situation that I raised above in the section called "Valencian", because no codes exist to address this particular case and perhaps this applies to other particular cases that might exist. This concern is addressed by using items instead of language codes.
I also have the concern that language codes display an unclear meaning to the user. This concern is addressed by using items instead of language codes.

So seeing all the concerns raised and the way they have/have not been addressed, and also considering the advantages expressed by Pamputt here, I believe that the right course of action to take here is to replace language codes by items in the user interface. I have not heard the possible concerns from the Development Team, but I believe that Léa can do that for us, and bring their concerns here if necessary.

Pinging all participants in the conversation: @JAn Dudík, VIGNERON, GerardM, Bekh6ex, Pamputt:@ArthurPSmith, Lea Lacroix (WMDE):

With my best intentions.--Micru (talk) 09:48, 8 June 2018 (UTC)

Thanks Micru for summarizing. I will ask input from the development team and bring it back in the next days. Lea Lacroix (WMDE) (talk) 11:33, 8 June 2018 (UTC)

It shows that you do not know the standards. It is perfectly possible to have a code for "Valencian". Consequently what you have written is a mishmash of a summary and personal opinion. Thanks, GerardM (talk) 13:33, 8 June 2018 (UTC)

@GerardM: As said, it is my personal interpretation, and you can challenge any part that you consider that needs further discussion. It might be possible to have a code for "Valencian" and for "Catalan" (as individual language spoken in Catalonia, not the macrolanguage), but as of now they do not exist, and as such the standard does not fulfill my need.--Micru (talk) 13:45, 8 June 2018 (UTC)

It is possible. Given your attacking style last time, I leave it ãt that. However, it is not about your need. It is about what serves us all best. Thanks, ~~

@GerardM: I did not attack you, there was a misunderstanding from your side with the meaning of the word "posture" that I tried to clarify (see my comments above). If you felt attacked, please do explain where, or how, so that we can clarify it because it is not my intention neither to attack you nor to offend you. You say that "It is about what serves us all best", and I agree because I include myself in the "all", so if something has to serve us "all", it has to serve me as well. --Micru (talk) 14:02, 8 June 2018 (UTC)

Whatlinkshere aka number of lexemes in langages

Latest comment: 6 years ago8 comments5 people in discussion

When I want to see how much lexemes is in which language, there is no easy possibility. Page special:Whatlinkshere does not show any page in NS 146 [4]. JAn Dudík (talk) 12:48, 24 May 2018 (UTC)

We have phabricator:T195302 to fix that. --Lydia Pintscher (WMDE) (talk) 15:30, 24 May 2018 (UTC)

My Ordia webservice has a rudimentary, work-in-progress and not-up-to-date list. For Danish here: https://tools.wmflabs.org/ordia/language/da. You can switch to your language of interest by editing the URL, e.g., for "cs": https://tools.wmflabs.org/ordia/language/cs — Finn Årup Nielsen (fnielsen) (talk) 11:09, 27 May 2018 (UTC)

Hey @JAn Dudík:, see also this tool showing the number of Lexemes per language :) Lea Lacroix (WMDE) (talk) 09:37, 6 June 2018 (UTC)

@Fnielsen: Could it be possible to refresh your Ordia webservice? Thanks in advance. KaMan (talk) 11:05, 11 June 2018 (UTC)

@KaMan: I have just updated Ordia, e.g, https://tools.wmflabs.org/ordia/L2742 (may be the latest) or https://tools.wmflabs.org/ordia/language/da — Finn Årup Nielsen (fnielsen) (talk) 12:48, 11 June 2018 (UTC)

Note there are currently(?) something wrong with the entity search on Wikidata. This problem also affects the Ordia search: You basically cannot see newly created lexemes. :( (I cannot seem to find a Phabricator task for this problem) — Finn Årup Nielsen (fnielsen) (talk) 13:18, 11 June 2018 (UTC)

I have now added https://phabricator.wikimedia.org/T196896 — Finn Årup Nielsen (fnielsen) (talk) 13:45, 11 June 2018 (UTC)

Citation

Latest comment: 6 years ago11 comments4 people in discussion

Hello y'all,

Right now, 4 lexemes use quotation (P1683) (list). This property was not designed for Lexemes and right now it has a constraint limiting it to references. I fully and totally acknowledge the need for such a property but the current use seems crooked. Hacking a property outside of its proper use is not solution : either, we change the constraint on P1683 or we create a new property. A third possibility, it to link to the document and to use quotation (P1683) in references (without any changes to the property). That what I proposed on Wikidata:Property proposal/attested but only two persons commented on it.

There is also the level question: should it be on the lexeme, on forms or on senses? (or any).

@Zitatesammler, Kolja21: as user of this property, what do you think?

Cdlt, VIGNERON (talk) 08:24, 9 June 2018 (UTC)

I think the third possibility is the best but I don't understand fully the necessity of a new property. Why not to use described by source (P1343)? And also there is some strange argument in the proposal: why do you suppose that "actor"@en, "actour"@en and "actress"@en are the same lexeme? I am sure that at least "actor" and "actress" should be at different lexemes. Infovarius (talk) 12:56, 9 June 2018 (UTC)

@Infovarius: very good points. For described by source (P1343), it's possible if we repurpose a bit the property, eg. the description could be changed from « dictionary, encyclopaedia, etc. where this item is described » to « document where this item or lexeme is described » and the constraints adapted. I would slightly prefer a new property for clarity but it's not a strong will nor need, both seems good to me.

For "actor/actour" and "actress", the example can be a bit confusing but my question still stand: does attestation refers to lexemes, forms (or even senses). I fell it's often better at the forms level as documents contains lemmata, not lexemes.

For the question: « is it one or two lexeme? » it's a different matter that we need to sort out. There has been some discussion already (on the demo system, there was the Leiter/Leiterin in German with both one and two lexemes for simulation of the model) but no definitive answer as far as I know. If two lemmata have the same properties, is it really two lexemes? And, if there is differences (and if you dig deep enough, it's always possible to find difference), how much differences are needed to flip from one lexeme to two lexemes? (see the example of Lexeme:L2331 and Lexeme:L2332 where I considered that the differences - senses and etymology - are big enough to make two lexemes but for "actor/actour/actoress/actress" I tend to see it as inflected forms of the same lexeme)

Cdlt, VIGNERON (talk) 13:27, 9 June 2018 (UTC)

Do we need to double all properties? Imho a "quote" is a quote. "Described by source" has even "dictionary" as example in the description. Is a quote from the Dictionary of Modern Written Arabic (Q4117629) an other type of quote if quoted in a different namespace? The description define a property. If the constraint limiting does not work, let's change it. --Kolja21 (talk) 15:22, 9 June 2018 (UTC)

And the second question: "... should it be on the lexeme, on forms or on senses?" Do we have senses? No. I'm happy for every source or example (property missing) that is given. First an editor needs a quote, than he might think about how to interpret the quote. --Kolja21 (talk) 15:41, 9 June 2018 (UTC)

@VIGNERON: Imho also the property image (P18) like you used it in Lexeme:L2330 (tour) is helpful. It gives an idea why the lexeme has been created. Later, when we have more possibilities, an other editor might move the picture to senses. --Kolja21 (talk) 22:38, 9 June 2018 (UTC)

image (P18) should probably not (almost never?) be used on the lexeme level. For instance "tour" may have the sense tower, but it has other senses, e.g., "Tour de France". IMHO we should generally wait on putting images on the lexeme pages until we have senses. — Finn Årup Nielsen (fnielsen) (talk) 13:15, 11 June 2018 (UTC)

@Kolja21: I didn't suggest to double *all* property, my suggestion is just that maybe a few property would be better to be split (a new attestation property - different from what exists as far as I can tell - and maybe citation). Repurposing a property is obviously a good idea too but it need discussions and consensus for each and every property we want to repurpose. It could take a lot of time and convincing where creating a new property is a bit more easy.

I did a quick try on Lexeme:L2785, and it seems a bit strange. Attestations doesn't described a word, they just use it. And the lexeme level feels wrong too, the forms level would be better, no? (or the sense level when it will be available).

@Fnielsen: absolutely, image (P18) (a good example of property that clearly don't need to be duplicated) should always be on the senses level. Meanwhile, I think it's ok to use it on lexeme level, it can be very useful and helpful. BTW, in "[Tt]our de France", "tour" is not Lexeme:L2330 nor Lexeme:L2332 but Lexeme:L2331 (and that's an example of how images can help to disambiguate homographic lexemes while we're waiting for the senses).

Cdlt, VIGNERON (talk) 12:36, 12 June 2018 (UTC)

I think we have an issue here: Is a lexeme a single sense or are there multiple senses attached to each lexeme? The data model [5] and the Leiter example [6] suggest multiple sense for one lexeme, so Lexeme:L2330, Lexeme:L2332 and Lexeme:L2331 should be the same lexeme. A Danish encyclopedia defines "leksem" as an abstract unit with different for forms [7] (so one could read that a leksem can have multiple senses). Lemon [8] shows multiple senses for a "ontolex:LexicalEntry". For instance,

   :troll a ontolex:LexicalEntry ;
     ontolex:denotes <http://dbpedia.org/resource/Troll> ;
     ontolex:denotes <http://dbpedia.org/resource/Internet_troll> .

The English Wikipedia [9] writes "A lexeme belongs to a particular syntactic category, has a certain meaning (semantic value)" seemingly suggesting one sense. I am not sure English Wikipedia is right. — Finn Årup Nielsen (fnielsen) (talk) 13:53, 12 June 2018 (UTC)

@Fnielsen: a lexeme can obviously have multiple sense (and if the data are available and granular enough, I would say that a lexeme always have multiple senses), no question here. But, Lexeme:L2330, Lexeme:L2331 and Lexeme:L2332 is still 3 lexemes as it's 3 totally different words (who happens to have homographics lemmata but they are three different words) which will have all multiple senses (L2330 will have "tower" and "rook", L2331 will "tour", "round", "turn", etc. and L2332 will have "wheel" and "lathe"). More precisely, a lexeme is a lexical unit. So the multiples senses should have a consistency to make it a unit (same data: same lexical category, same etymology, same derivates, etc.). Cdlt, VIGNERON (talk) 14:21, 12 June 2018 (UTC)

Ok, I see now that there is both feminin and masculine tour. — Finn Årup Nielsen (fnielsen) (talk) 16:13, 12 June 2018 (UTC)

Lexical category display on Special:NewLexeme

Latest comment: 6 years ago1 comment1 person in discussion

The label and qid might not be sufficient to pick the correct item. I think either the description or Wikidata usage instructions (P2559) of the items should be displayed as well.
--- Jura 10:57, 12 June 2018 (UTC)

Add input for sense to Special:NewLexeme

Latest comment: 6 years ago1 comment1 person in discussion

It would be good to have that there directly (once senses are available).
--- Jura 10:57, 12 June 2018 (UTC)

Conjugation tables as lexeme ?

Latest comment: 6 years ago16 comments5 people in discussion

Thanks to @Okkn: we got conjugation class (P5186) and word stem (P5187). Logically we would store the forms for each class somewhere. Should these be added as lexemes as well?
--- Jura 06:38, 24 May 2018 (UTC)

We talk about it a bit, see #Conjugation.

Conjugation are just forms, see what is done on Lexeme:L16 for a simple example.

conjugation class (P5186) and word stem (P5187) could be very useful to generate and check these forms. It strange that it's limited to Japanese, all languages have a similar pattern, should we expand these properties or create new ones?

Cdlt, VIGNERON (talk) 07:11, 24 May 2018 (UTC)

I think the above doesn't actually answer it. Besides "aller@fr" isn't really suitable for this feature.
--- Jura 07:26, 24 May 2018 (UTC)

Why not? True, "aller"@fr is a bit complicated as it's suppletive and have 3 stems ; but I think qualifiers can easily indicate explicitly when to use each stem. And "aller"@fr is very irregular but most others verbs (in French and in other languages) are regular (in Breton, for instance, there is only 5 irregular verbs ; see fr:Verbe_irrégulier#Nombre_de_verbes_irréguliers_par_langues for the number of irregular verbs by languages which is very low). Cdlt, VIGNERON (talk) 08:05, 24 May 2018 (UTC)

Interesting thought for aller@fr. Maybe should start with Breton though ;)
--- Jura 08:37, 24 May 2018 (UTC)

I've been bold and started "aller"@fr : Lexeme:L750. I've created a regular Breton verb too : Lexeme:L764 ("labourat"@br which is "to work"@en). There is still of works to do but I see no major obstacles (just some properties that need to be thought and created properly). Cdlt, VIGNERON (talk) 09:22, 24 May 2018 (UTC)

@VIGNERON: actually I've messed up, what did you mean by Lexeme:L764? Is it English phrase or a notion? --Infovarius (talk) 19:10, 24 May 2018 (UTC)

@Infovarius: sorry but I don't understand what you are saying. Lexeme:L764 is not in English, it's not phrase nor a notion. It's a Breton verb. But maybe it's linked to my next question to Jura just below. Cdlt, VIGNERON (talk) 20:14, 24 May 2018 (UTC)

@VIGNERON: sorry, I meant Lexeme:L766. User:Jura1, what type it has? It looks like ill-formatted page now. --Infovarius (talk) 12:14, 31 May 2018 (UTC)

@Infovarius: ohhh, very interresting! Thanks for sharing. I understand the idea and I like it but I'm not sure it's suitable for lexeme. At least the main lemma and the lexical category feels very wrong (no idea for a correct lemma but lexical category probably should be something like suffix (Q102047)). Maybe we could have a separate lexeme for each forms, what do you think. Cdlt, VIGNERON (talk) 12:27, 31 May 2018 (UTC)

I added conjugation class (P5186)=regular Breton conjugation (Q54083637) and word stem (P5187)="labour" to Lexeme:L764 (labourat) and created Lexeme:L766 for regular Breton conjugation (Q54083637). Hope this clarifies it.
--- Jura 09:57, 24 May 2018 (UTC)

@Jura1: thank you but it's still not entirely clear to me. What kind of values is expected in conjugation class (P5186), on Lexeme:L750 the value is a subclass of verb but on Lexeme:L764 the value is a subclass of conjugation. That's inconsistent. Is it on purpose? and if so could you explain please? Cdlt, VIGNERON (talk) 20:14, 24 May 2018 (UTC)

With "not suitable" for this feature, I meant "not applicable". For other but 4 Breton verbs, you should be able to define the stem and query the forms (once Query Server is available.)
--- Jura 07:40, 25 May 2018 (UTC)

Automating conjugation

Some wiktionaries have templates that generate the forms given an input. Could we have a property (for instance "conjugation generator") that would take an item with a qualifier for the stem as an input and then a bot will generate the forms based on that? For instance, in es-wikt there is wikt:es:Plantilla:es.v.conj.er that generates all the forms. You can see that in action for the verb "comer" (wikt:es:comer#Conjugación). For instance we could have something like:

conjugation generator

es.v.conj.er

edit

stem

com

0 references

add reference

add value

What do you think?--Micru (talk) 20:52, 24 May 2018 (UTC)

Why not just use the generators to populate the forms? I'd expect we'd still want to store all the conjugations. Or do you mean in addition to it? --Reosarevok (talk) 10:25, 13 June 2018 (UTC)

In addition to it.--Micru (talk) 11:22, 13 June 2018 (UTC)

Special:Allpages and Lexemes special pages

Latest comment: 6 years ago9 comments3 people in discussion

The next step seems to be that *https://www.wikidata.org/wiki/Special:AllPages?from=&to=&namespace=146 looks more like Special:Allpages, i.e. it will include the "label", but wont be sorted.

I think it would be good to have more separate Special pages directly in MediaWiki for:

Lexemes
Forms

selecting the text

by start of string
by end of string

either for

all languages or
a specific one.
--- Jura 21:03, 1 June 2018 (UTC)

@Jura1: are you referring to phabricator:T195382? Cdlt, VIGNERON (talk) 08:53, 3 June 2018 (UTC)

The question is what should be on Special:SpecialPages for Lexemes (what users get when they install Wikibase on MediaWiki). Personally, I don't think Special:Allpages is of much use for items and lexemes.
--- Jura 19:36, 3 June 2018 (UTC)

Tracked in Phabricator
Task T197145

@Lea Lacroix (WMDE): Given the absence of better suggestions, can you add this somehow to your agenda? Maybe four distinct special pages would be ideal:
- Special:AllLexemes
- Special:AllLexemeForms
- Special:AllLexemesReverse
- Special:AllLexemeFormsReverse
  Each would have a way to filter by language.
  --- Jura 06:48, 8 June 2018 (UTC)

Hello @Jura1:, I will create a ticket. Before that I need a bit more details about what the use would be for these pages.

What would contain the "reverse" ones?
Can you give me one usecase for each of these pages? i.e. a situation where you would need it?

Thanks, Lea Lacroix (WMDE) (talk) 07:12, 8 June 2018 (UTC)

It's fairly common to look up words by the way they are written. w:Special:Allpages at Wikipedia works that way. It's just that this wont work for Lexemes as there it ends up sorting by "L" and a number. At Wikipedia, you can do w:Special:Allpages/Sort.
Reverse would allow to find words with the same ending (e.g. "abort" and "sort").
--- Jura 07:26, 8 June 2018 (UTC)

@Lea Lacroix (WMDE): can you work with that?
--- Jura 07:54, 13 June 2018 (UTC)

@Lea Lacroix (WMDE): I had a look at the ticket, looks it isn't assessed yet. How will it go into your agenda?
--- Jura 05:26, 15 June 2018 (UTC)

It will be analyzed and prioritized by the team. For now, we're focusing on fixing the main user experience issues, and providing Senses. I can't tell you when this specific ticket will be added in the agenda. Lea Lacroix (WMDE) (talk) 09:25, 15 June 2018 (UTC)

Arabic diacritics

Latest comment: 6 years ago14 comments6 people in discussion

Imho Arabic lexeme should be written without tashkil (marks used as phonetic guides).

The lexeme قَلَمٌ (Lexeme:L203) should be changed to قلم (wiktionary:قلم)
The lexeme صِفْر (Lexeme:L12345) should be changed to صفر (wiktionary:صفر)

"The literal meaning of tashkīl is 'forming'" (en:Arabic diacritics). Signs like fatḥah and ḍammah should only be used in the section "forms". --Kolja21 (talk) 21:25, 4 June 2018 (UTC)

@Kolja21: Why not use both? Like for Lexeme:L2287? Cdlt, VIGNERON (talk) 22:05, 4 June 2018 (UTC)

That could work, if we have a rule what form is the basic form. (Otherwise there are too many variants.) Unfortunately Arabic is even more complicated since words like دب (Lexeme:L2320) can be used as a verb and as a substantiv. Should I divide the lexeme into:

دب (verb) = Lexeme:L2320
دب (substantiv) = ...

--Kolja21 (talk) 22:42, 4 June 2018 (UTC)

@Kolja21: Ok. But why would we need a rule for « basic » form? (I don't know well Arabic)

Apparently, we should have 4 lexemes for دب (according to wikt:دب).

Cdlt, VIGNERON (talk) 22:48, 4 June 2018 (UTC)

IMHO we don't need rule for "basic" form. Actualy we shouldn't have, because it would break the NPOV principle. There are both dictionaries using tashkil and dictionaries omitting it. We do need a rule how to mark correctly the representation with and without tashkil. Doing this, he database user can later choose his preferred form(s) via query. The same for Hebrew, Aramaic etc., and maybe also for languages using several different graphic systems like Uzbek, Ladino or Serbian.--Shlomo (talk) 07:52, 5 June 2018 (UTC)

I divided the lexeme into:

دب (verb) = Lexeme:L2320
دب (substantiv) = Lexeme:L2321

A guide (see "phonetic guide" above) is not the same as a rule. So I'm not sure if we can count the 4 versions of the English Wiktionary as 4 different lexemes. A tashkil is not the same as a German umlaut. Apfel and Äpfel are two words. In Arabic tashkils are usually not written. There are just a help for children and foreign readers and every Arabic country has it's own dialect. --Kolja21 (talk) 23:05, 4 June 2018 (UTC)

@Kolja21: I don't Arabic enough so I'm not sure either, but it seems to be 4 lexemes (different etymology, pronunciation, etc). "Apfel"@de and "Äpfel"@de is 1 lexeme, and "tour"@fr is 3 lexemes (Lexeme:L2330, Lexeme:L2331, Lexeme:L2332). Cdlt, VIGNERON (talk) 06:35, 5 June 2018 (UTC)

Well, the "Etymology 3" and "Etymology 4" at en.wikt are not exactly what should be considered different etymologies, in fact they are verbal nouns derrivated from the verb given in "Etymology 1". Nevertheless, I also advocate considering verbal nouns separate lexemes -- supposed there are some reliable sources for them, since the references at en.wikt are not very convincing... An Arabic verb can have several verbal nouns (up to 40) and every verbal noun has it's own set of inflected forms. I can't imagine how to store this complex structure as a single lexeme in the present data model.--Shlomo (talk) 07:39, 5 June 2018 (UTC)

@VIGNERON: I suppose that Lexeme:L2331 and Lexeme:L2332 should be merged - they have different senses but the same (yes?) forms - this perfectly lies into model 1Lexeme->nSenses. --Infovarius (talk) 07:24, 6 June 2018 (UTC)

@Infovarius: maybe but I don't think so, it's two different words with each have several senses that can be grouped in two different set of senses, first group being all related to "action" and the second group related to "tool". For instance and for etymology, if you put them all in one lexeme, how would you say that L2331-S1, L2331-S2, L2331-S-3 derived from lexeme (P5191) "tourner"@fro and that L2332-S1, L2332-S2 derived from lexeme (P5191) "tornus"@la? And in the other way, how to say that Lexeme:L2334 derived from lexeme (P5191) Lexeme:L2332 (but not Lexeme:L2331). See also the previous discussion Wikidata talk:Lexicographical data/Archive/2018/03#One L-item for "tour"@fr: bug or feature?. Cdlt, VIGNERON (talk) 08:31, 6 June 2018 (UTC)

how do existing dictionaries of Arabic handle these diacritics? We should not try to reinvent everything from scratch. -- JakobVoss (talk) 16:09, 5 June 2018 (UTC)

@JakobVoss: as said just above by Shlomo : « There are both dictionaries using tashkil and dictionaries omitting it. » Most wiktionnaries give the two writings too. Cdlt, VIGNERON (talk) 16:20, 5 June 2018 (UTC)

This happens also with Latin and Italian. Most dictionaries include diacritics or accents in lemma entries to facilitate reading but they are not used in plain text. --Vriullop (talk) 15:10, 15 June 2018 (UTC)

Thanks for all the information. Shlomo wrote "IMHO we don't need rule for 'basic' form. Actualy we shouldn't have, because it would break the NPOV principle." We might not need WD:N for lexicographical data but a help page Wikidata:Lexicographical data/Documentation/Languages would be great. We could start with collecting the given examples

German: "Apfel"@de and "Äpfel"@de is 1 lexeme (easy: singular / plural)
French: The "tour"@fr is 3 lexemes (complicated: same letters, same language, same lexical category)

with a space for explanation and comments. --Kolja21 (talk) 18:13, 5 June 2018 (UTC)

Missing items for grammatical features

Latest comment: 6 years ago7 comments4 people in discussion

I don't have a good overview of what we already have (or shouldn't have), but these grammatical features seem to not have their own items:

~~definite form~~ (indefinite number (Q53998049) exists for the indefinite form)
subjective case
objective case

Listing them here before I create anything.

I also note that we have e.g. first-person singular (Q51929218), but isn't first person (Q21714344) + singular (Q110786) good enough? --Njardarlogar (talk) 16:45, 25 May 2018 (UTC)

@Njardarlogar: we have definite (Q53997851) and indefinite (Q53997857) (thanks to Fnielsen), the second one maybe should be merged with indefinite number (Q53998049) (or maybe repurposed a bit "definite in Basque", not sure if it make sense here for these languages).

For first-person singular (Q51929218) vs. first person (Q21714344) + singular (Q110786), I had the exact same question. Both method seems correct and I would tend to prefer the second one.

Cdlt, VIGNERON (talk) 16:53, 25 May 2018 (UTC)

Regarding indefinite (Q53997857) and indefinite number (Q53998049): Should we have one item for each language? Or one common? A feature of indefinite (Q53997857) for Danish is that it (usually?) comes with a -(e)n suffix. That particular feature may be different from other languages. — Finn Årup Nielsen (fnielsen) (talk) 17:10, 25 May 2018 (UTC)

Personally, I don't imagine that we'd need language-specific items for definiteness any more than we seem to need language-specific items for grammatical genders (i.e. we should merge indefinite (Q53997857) and indefinite number (Q53998049)), but I am not an expert. --Njardarlogar (talk) 18:11, 25 May 2018 (UTC)

Well, I'm not sure but definiteness can vary a lot from on language to an other. In some languages, definiteness is lexical (marked by inflection of lemmata, and for some languages the inflection act like a grammatical number and in some like grammatical case...) and in some it's only structural (no inflection of lemmata, using a or several determiners for instance "the dog/a dog" in English). So maybe several items can be useful to represent this variation. That said, anyway, I'm more leaning toward only one item to use in the lexemes. Cdlt, VIGNERON (talk) 18:37, 25 May 2018 (UTC)

A structural (in)definiteness is not a matter of dictionary or lexicographical data. Morphological (in)definiteness (marked by inflection or aglutination) should appear as a property of form; the fact that the form of inflection indicates also gender/case/number/etc. at the same time should not disturb us. A possible lexical (in)definiteness (a property of the lexeme itself - maybe applicable for proper names?) could be probable added as a property of the lexeme.--Shlomo (talk) 08:06, 4 June 2018 (UTC)

I see now that we have oblique case (Q1233197), so do we already have an item for the objective case? --Njardarlogar (talk) 19:39, 15 June 2018 (UTC)

Avoid language domination

Latest comment: 6 years ago16 comments5 people in discussion

A short analysis of which language codes are used how often showed that more than hundred ISO 639-1 codes are not used in any of the 2250 lexeme labels. It would be nice if we get at least one lexeme per label at least before L9999. Contributions of native speakers are best, but it can also be done this way:

Find a dictionary of an uncovered language (Google Books may help to access parts of it).
Add an item for the dictionary (unless it already exists). This will also be helpful to source more words from the dictionary
Select and add a nice word from the dictionary as new lexeme.
Add a reference to the dictionary with described by source (P1343)

I did so with mamati (L2222) in Māori (Q36451), maybe you can guess the meaning. -- JakobVoss (talk) 19:25, 3 June 2018 (UTC)

Not sure if you read the bug report(s) above, but some codes can't be used. The general definition is currently being reviewed/recast. Not sure if I understood Lydia's comment correctly, but the ideal project focus may be languages that don't have an ISO 639-1 code. If you want to compare with one of the Wiktionaries: list.
--- Jura 19:33, 3 June 2018 (UTC)
- @JakobVoss: great idea, thank you for this list! For dictionaries, as books, they should follow the Wikidata:WikiProject Books model and ideally we should link to a specific edition (with edition number (P393) and publication date (P577) in the item, so they're wouldn't be a need to put them as qualifiers every time), not to works. @Jura1: a lot of available code are not used yet. I'll create some for languages I know enough about and that are still missing (I've already created Lexeme:L2282 and Lexeme:L2284 in Welsh (Q9309), @Llywelyn2000: ). And for codes not available right now, there is still the private use that can be used, so all languages can already be added on Lexemes. Cdlt, VIGNERON (talk) 21:55, 3 June 2018 (UTC)
In French wiktionary there are slightly more languages: wikt:fr:Catégorie:Langues. One can take a word per language from those categories. --Infovarius (talk) 10:17, 4 June 2018 (UTC)
@Infovarius: imports from wiktionaries are forbidden (or at best, strongly discouraged). It would be better to start from scratch (and to add a good reference like in JakobVoss example). Cdlt, VIGNERON (talk) 10:51, 4 June 2018 (UTC)
That's new Vigneron, when did you decide that?
--- Jura 09:47, 6 June 2018 (UTC)
Hello @Jura1:, I stated this on behalf of the development team, both in the first announcement in March and the one on the deployment day.

We kindly ask you to not plan any mass import from any source for the moment. [...] We strongly encourage you to discuss with the communities before considering any import from the Wiktionaries. Wiktionary editors have been putting a lot of efforts during years to build definitions, and we should be respectful of this work, and discuss with them to find common solutions to work on lexicographical data and enjoy the use of it together.

I would add that this last sentence is not only about definitions that Wiktionary editors are building, but also about how they structure information in general. Lea Lacroix (WMDE) (talk) 12:24, 6 June 2018 (UTC)
This is different from what you announced earlier and what Vigneron is deciding. Why did you change your mind? Are you considering changing the license for lexeme namespace?
--- Jura 12:29, 6 June 2018 (UTC)
I quoted my own announcements, there is no change. Vigneron is mentionning something based on this announcement as well. Lexeme namespace is licensed in CC0 and will stay that way. Lea Lacroix (WMDE) (talk) 12:40, 6 June 2018 (UTC)
Maybe he misunderstood it then. I'm not sure if you are aware, but Wiktionary communities can't change the license of their content by discussing it.
--- Jura 06:50, 8 June 2018 (UTC)
The announcement says « no import » and I say « no import » and somehow, you think that I did a misunderstanding? Plus, here, you're the only one who speak about license changes. Cdlt, VIGNERON (talk) 08:38, 8 June 2018 (UTC)
I think you misquoted the announcement.
--- Jura 08:47, 8 June 2018 (UTC)
No. VIGNERON (talk) 08:48, 8 June 2018 (UTC)
Can you add a reference for your quote?
--- Jura 08:55, 8 June 2018 (UTC)

@JakobVoss: would it be possible to update your file? Cdlt, VIGNERON (talk) 09:38, 16 June 2018 (UTC)

@VIGNERON: Done, I also added a file with base languages (90 so far) -- JakobVoss (talk) 14:30, 16 June 2018 (UTC)

FAQ clarification

Latest comment: 6 years ago5 comments3 people in discussion

I've tried to edit some passages in the FAQ to make them clearer, but I'm having trouble with these two in particular:

larger communities in a given Wiktionary project will be able to benefit from the data in order to check their data against Wikidata’s data, and at the same time check Wikidata’s data against their own locally stored data.

and

With Wikidata for the other projects we have seen that there was an influx of several thousand new contributors working on Wikidata (about half of the contributors), who have not been active in other Wikimedia projects before.

Can anyone help me clarify what exactly these passages intend to say? --Waldir (talk) 19:22, 26 May 2018 (UTC)

Hi Waldir,

I'll to make it as clearer as possible (some part of the message will be lost in simplification):

Wiktionaries have data, Wikidata have data ; let's compare them both way.

and

Wikidata attracts a lot of totally new users.

Cdlt, VIGNERON (talk) 09:08, 27 May 2018 (UTC)

Thanks for the clarification, VIGNERON. I think the first sentence is somewhat obvious so it doesn't add much to the FAQ; I would suggest simply removing it from its section. The second one is more relevant, but it's worded confusingly. I would suggest changing the sentence

It has been observed that many Wikimedia projects that integrated Wikidata into their editing workflow have received an influx of several thousand new contributors working on Wikidata (sometimes about half of the contributors), who have not been active in other Wikimedia projects before.

to:

It has been observed that many Wikimedia projects that integrated Wikidata into their editing workflow have received a significant influx of new contributors working on Wikidata-related tasks.

What do you think about both these proposals? --Waldir (talk) 11:42, 2 June 2018 (UTC)

@Waldir: It's been two weeks, but this change of yours sounds good. Mahir256 (talk) 20:29, 15 June 2018 (UTC)

Thanks, I'll go ahead and make the changes then :) --Waldir (talk) 19:20, 17 June 2018 (UTC)

English verbs

Latest comment: 6 years ago9 comments6 people in discussion

Let's see e.g. "test". It has a form "test" in one case of present tense but a form "test" in multiple other cases of present tense. What should we do: 1) use one form and add multiple features (first person (Q21714344), second person (Q51929049), plural person (Q51929154)...) or 2) create one form per each feature separately? Other form of question: what should we do with homoforms? --Infovarius (talk) 14:47, 28 May 2018 (UTC)

Good question. The solution 1 seems a bit strange, I fear it wouldn't be comprehensible; the solution 2 is more explicit but isn't it too much? Cdlt, VIGNERON (talk) 18:07, 29 May 2018 (UTC)

I'm also more in favor of solution 2. For the case of homoforms there is phab:T195411.--Micru (talk) 19:12, 29 May 2018 (UTC)

The potential problem of sol.2 is that in some languages some features can be divided very deeply (like cases in Finno-Ugoric languages), should we create homoform per each potential feature? --Infovarius (talk) 13:10, 30 May 2018 (UTC)

IMO yes. I'd like to see all 28 (or 29) forms for Estonian nouns separately, even if some happen to be the same for some specific words.--Reosarevok (talk) 10:16, 13 June 2018 (UTC)

@Reosarevok: actually I meant if you'd like to see 28 (or 29) forms (according to Estonian point of view) in some English word like "cat". I.e. form "cat" as an accusative case, form "cat" as an instrumentative case and so on... @VIGNERON, Lea Lacroix (WMDE): what do you think? --Infovarius (talk) 13:07, 13 June 2018 (UTC)

I think that each word should be described with the grammar that applies to its language :) Lea Lacroix (WMDE) (talk) 13:35, 13 June 2018 (UTC)

As Lea, I'd expect an English lexeme to not include any of these. Possibly with the exception of some pronouns, depending on whether we're going to approach them as cases or as separate lexemes. --Reosarevok (talk) 19:54, 13 June 2018 (UTC)

Does anyone have an example lexeme with a suggestion for an improved data model? There seems to be some degree of agreement here (prefer multiple forms), but test (L397) at least hasn’t been updated to that yet. --Lucas Werkmeister (talk) 12:37, 17 June 2018 (UTC)

French conjugation (in French)

Latest comment: 6 years ago4 comments3 people in discussion

Salut/Hello,

To avoid English domination and as this section concerns conjugation in French, it will be in French.

@Pamputt, Tpt, Infovarius: (et autres francophones que j'ai oublié)

Je viens de compléter à la main les conjugaisons du verbe "aimer"@fr : Lexeme:L47. J'aimerais ) savoir ce que vous en pensez (avec pour intention d'en faire un exemple pour les prochains verbes, voir pour les outils comme Wikidata Lexeme Forms ci dessus). J'ai fait quelques choix qui sont discutables en tout au moins à discuter (et à confirmer ou infirmer, à vous de me le dire) notamment :

entrer explicitement toutes les formes, même homographes ("aime" pour L47-F2, L47-F29, L47-F45).
- sauf les formes des temps composés (qui sont relativement triviales à construire, à part pour le choix de l'auxiliaire qui devrait être indiqué mais comment ?)
indiquer à la fois le temps et le mode
- mais parfois il existe un item spécifique regroupant le temps et le mode… que faire dans ces cas-là ? j'ai mis cet élément ainsi que le mode mais cela fait doublon du coup
j'ai ajouté quelques propriétés dont :
- ⟨ L47 ⟩ conjugation class (P5186) ⟨ conjugation of Group I French verbs (Q2993354)    ⟩
  (et améliorer ce dernier élément)
- ⟨ L47 ⟩ word stem (P5187) ⟨ aim ⟩
  (même si techniquement word stem (P5187) est toujours limité au japonais)
- described by source (P1343) et toutes les éditions du Dictionnaire de l'Académie française (dont une citation en exemple pour la première édition mais je ne suis pas très satisfait de cette structuration, cf. #Citation ; et peut-être officialized by (P5194) conviendrait-il mieux)

Je suis preneur de tout avis, remarques, précisions, manques et corrections.

PS: avec 16898 octets, c'est un des plus gros lexème pour le moment (même si derrière Lexeme:L189 et ses 24890 octets).

Cdlt, VIGNERON (talk) 14:19, 13 June 2018 (UTC)

Merci beaucoup pour cette exemple ! Voici mon grain de sel pour les points que tu as soulevé :

Entrer toutes les formes même homographes me semble la bonne manière de faire (il n'y a pas moyen de faire des disjonctions dans les "grammatical features"). J'espère que dans le futur il y aura une interface utlisateur "compacte" qui rapprochera les homographes.
Pour les temps composés, avoir un statement avec une propriété ad-hoc pour l’auxiliaire me semble une manière simple de faire.
Pour temps vs temps+mode c'est probablement quelque chose à harmoniser entre les langues (ce n'est pas spécifique au français). Je serait plutôt pour éviter les items temps+mode pour avoir quelque chose de plus facile à réutiliser (pour exemple, voici ce que fait UniversalDependencies pour les temps et les modes.
J'éviterais d'utiliser described by source (P1343) qui est très flou et assez fourre tout. Utiliser officialized by (P5194) ou quelque chose de spécifique me semble mieux et plus "utile". Tpt (talk) 12:18, 18 June 2018 (UTC)

@VIGNERON: Pour le présent, ne devrions nous pas utiliser present indicative in French (Q20977689) (spécifique au français, comme les autres éléments de temps que tu as utilisé) au lieu de present tense (Q192613) ? Tubezlob (🙋) 18:09, 18 June 2018 (UTC)

@Tpt, Tubezlob: merci pour vos retours. Je reprends les points :

ok pour les homographes, cela semble faire consensus
ok, propriété à proposer à la création
je savais que ce point ferais débat. J'avoue avoir du mal à peser voir clairement le pour et le contre. N'hésitez pas à développer vos arguments (voir à ouvrir une discussion spécifique sur ce point qui lui n'est pas spécifique au français).
bonne question mais officialized by (P5194) me semble vraiment étrange, un dictionnaire "officialise"-t-il vraiment ? (surtout qu'officialité c'est contextuel, le Dictionnaire de l’Académie française peut éventuellement être considéré comme officiel pour la France - et encore - mais pas vraiment pour le français, si ?) et l'officialisation peut-être vraiment être multiple ? (une fois officialisé, il me semble qu'il n'y a pas besoin de ré-officialiser de nouveau ou à nouveau). Ceci dit, described by source (P1343) n'est pas vraiment satisfaisant non plus. La propriété attested in (P5323) vient d'être crée, je la voyais plus spécifique (pour les attestations mais finalement la présence dans une dictionnaire est aussi une forme d'attestations) et pour les formes et pas pour les lexèmes, pensez-vous qu'elle pourrait convenir ici ?

Cdlt, VIGNERON (talk) 18:26, 18 June 2018 (UTC)

literal translation (P2441)

Latest comment: 6 years ago5 comments4 people in discussion

I noticed that ameččuk (L2446) has mappings to seven other languages via the property which titles this talk page section. Is this how we should be making links between words until a more formal mechanism is established? (@Reda Kerbouche: as the author of the item.) Mahir256 (talk) 00:49, 16 June 2018 (UTC)

Hi Mahir256 and Reda Kerbouche,

Well, I understand the idea but not, this is probably not how it should be done. First, literal translation (P2441) is intended as qualifier only, that should be changed if we want to use it on lexemes (if we want to use it on lexemes). Then, translation should be at the sense level (not available right now) not at the lexeme level. I've fixed some others details on this lexeme.

That said, I've got a general question (I've seen it for other properties like image (P18)) for all lexical-data-graphers, could and should we temporary store some sense level statements at the lexeme level until the senses are available? (I see pros and cons and I'd like other points of view)

Cdlt, VIGNERON (talk) 08:13, 16 June 2018 (UTC)

Temporarily using properties on lexemes that should be used on senses would increase the risk that someone puts in a lot of effort on populating lexemes with statements using sense properties without knowing that such statements will be removed eventually; and of course, someone or something will have to (re)move all these statements once senses are up and running.

I don't have a strong opinion, either way. At the very least, I think it can pay off to have a clear idea of where specific properties should be used. We might want to warn to users that are frequently using sense properties on lexemes that the statements are likely to be removed later. --Njardarlogar (talk) 11:39, 16 June 2018 (UTC)

@Njardarlogar: yes, I agree! It can lead to big problems. On the other hand, meanwhile, it can be very useful. So we are two to be on the fence, any other point of view? Cdlt, VIGNERON (talk) 17:57, 17 June 2018 (UTC)

Whether or not sense properties are allowed for now, I would say literal translation (P2441) should not be used; rather translations should be handled by a lexeme(sense)-valued property - see Wikidata:Property proposal/translation. ArthurPSmith (talk) 15:09, 18 June 2018 (UTC)

Cannot select item from "Grammatical features" suggester on tablet

Latest comment: 6 years ago2 comments2 people in discussion

Tracked in Phabricator
Task T197573

@Lea Lacroix (WMDE): I tried to enter this bugreport in Phabricator but I'm getting Access Denied "You do not have permission to edit task policies." when I'm trying to post it there. So I enter my bugreport below. Feel free to copy this to Phabricator.

I use newest Chrome for my tablet (build 67.0.3396.87, Android) in desktop mode of the interface. I have no problem with suggesters on Special:NewLexeme page (like selecting Language of Lexeme). I have no problem on newly created lexeme page when I select properties and theirs values for lexeme. The problem comes when I'm trying to enter any form to this lexeme. I enter string representation of the form and language of the representation and then click with my finger on "Grammatical features" field. Then I start typing string of some item (for example Q146786). Suggester finds this item and highlights it. Then I select this item from suggester with my finger by single click on it. Then strangely whole page reloads and page with Q146786 is opened. I lost page with my lexeme. This way I cannot select anything with this suggester on my tablet.

KaMan (talk) 09:53, 16 June 2018 (UTC)

Thanks for reporting this problem! I created a ticket. Lea Lacroix (WMDE) (talk) 08:20, 18 June 2018 (UTC)

Support Klingon

Latest comment: 6 years ago11 comments6 people in discussion

I'm surprised not to find a Klingon word among the easter eggs lexemes, did I miss something? Looks like Klingon cannot be chosen as lemma language, so how to fix this? The ISO 15924 code is "Piqd". Anyway, if I were responsible for introduction of Lexemes I'd rather be worried as Klingons get angry for less reason! -- JakobVoss (talk) 18:54, 29 May 2018 (UTC)

Klingon (Q10134) has also a ISO 639-3 code: tlh. It could be used by Wikimedia projects. Pamputt (talk) 19:16, 29 May 2018 (UTC)

From the criteria for inclusion of the English Wiktionary: “Languages originating from literary works should not be included as entries or translations in the main namespace, consistent with the above. However, the following ones should have lexicons in the Appendix namespace: Quenya, Sindarin, Klingon, and Orcish.”

But as to having Klingon words in Wikidata, why not? The language itself is notable after all (Klingon (Q10134)). -- IvanP (talk) 19:25, 29 May 2018 (UTC)

@JakobVoss: Devil's advocate: what is the state of Klingon copyright? (I remember that Paramount made some claims that they own the copyright) Cdlt, VIGNERON (talk) 20:12, 29 May 2018 (UTC)

There is no copyright on words. Trademark is a different issue. -- JakobVoss (talk) 21:00, 29 May 2018 (UTC)

About notability of languages, something was written. The inclusion is rather broad, and Klingon words are thus welcomed here. Pamputt (talk) 05:53, 30 May 2018 (UTC)

The question is, has the House of Paramount enough bat'leths to protect its POV with a moQbara, or will they just whine at FIPO's doorsteps and twirl like a half-dead racht?--Shlomo (talk) 16:36, 7 June 2018 (UTC)

@JakobVoss: as a measure of QA I added a klingon lexeme: qI'yaH/ (L3403). I only wasn't able to choose klingon as a lexeme language. I'm not sure of this is wrong though. After all it is a transliteration into english (latin characters) unless we would enter it using Conscript encoding--Loominade (talk) 13:36, 19 June 2018 (UTC)

Thanks Loominade, a transliteration of a language is still in this language, so I change the code to Klingon. And you should definetly should add the pIqaD via ConScript ;) Cdlt, VIGNERON (talk) 14:04, 19 June 2018 (UTC)

@VIGNERON: how did you do that? --Loominade (talk) 14:11, 19 June 2018 (UTC)

@Loominade: just enter the sequence of characters as a secondary lemma, like on ama/𒂼 (L1) for example. For the code, until a better solution, you can use mis-x-Q6421045. Cdlt, VIGNERON (talk) 14:16, 19 June 2018 (UTC)

Indicating inflection classes

Latest comment: 6 years ago7 comments2 people in discussion

So we have conjugation class (P5186) for verbs, but what about other word classes, like nouns and adjectives? Is there a reason that we have a property exclusively for verbs, when we instead could have a general inflection class property? Should we have separate properties indicating the inflection class for nouns, adjectives etc.?

Another interesting question is notability and references for such classes. For example, you might notice that a large number of lexemes in a language have identical inflection patterns, but no dictionary explicitly indicates any class for them, such that we have no references explicitly confirming the existence of such a pattern. On a wiktionary, we can simply create a template for this inflection pattern regardless of references - but such a pattern may not qualify for an item here, however? --Njardarlogar (talk) 17:14, 17 June 2018 (UTC)

Hi @Njardarlogar:,

This is very insterresting questions. Do you have an example of class for nouns? The absence of references is a big problem but it can be overcome (technically, an item can be created without references, see point 3 of WD:N, but it should be done wisely and should be use with a strong consensus) ; are you sure that no researcher in the field of natural language processing (Q30642) defined such classes? In the end, what you're describing seems more like something for a tool, not for stored data.

Plus, conjugation class is not exactly an inflection class. Inside one class, verb can have different system of inflection. For exemple in French, there is 3 classes : conjugation of Group I French verbs (Q2993354), conjugation of Group II French verbs (Q2993353), conjugation of Group III French verbs (Q2993358). The 2 first classes are mostly regular (not perfectly, but the lemma of the infinitive and the word stem (P5187) can help to find most the inflections), the third class is just all the irregular verbs with almost no pattern.

Depending on the answers, Okkn proposed conjugation class (P5186) for Japanese verbs but it was extended to all languages, maybe we can extend it to a general property for other categories of lexmes if there is a need.

Cdlt, VIGNERON (talk) 17:54, 17 June 2018 (UTC)

In the Nynorsk Dictionary, adjectives, nouns and verbs may all belong to inflection classes; for example a1 and a2 for adjectives, v1 and v2 for verbs and m1, f1, and n1 for nouns.

Then, as an example, we have some masculine nouns that have two alternative inflections, both of which are regular; but where the second inflection pattern has not been given its own name. Two examples from this group are sau and kobra (the unnamed inflection is -en, -er, -ene; the pattern -en, -ar, -ane is known as m1). On the English Wiktionary, I created a template for this group and named the (combined) pattern m2.

I think I have heard a name for the unnamed inflection pattern in some contexts, but then I think the name describes a group of words in Germanic languages more broadly, so I am not sure if this name is directly applicable to this inflection pattern (for example, the pattern in Norwegian Nynorsk also applies to nouns of non-Germanic origin, like kobra).

I don't see why you would not want to store inflection classes at lexemes. They would allow you to do things like querying for other lexemes with the exact same inflection. For the entirely regular inflection classes, it would also enable the user to generate all the forms of the lexeme themselves, so that they don't have to rely on the forms having been entered at the lexeme entity. --Njardarlogar (talk) 18:35, 17 June 2018 (UTC)

@Njardarlogar: thanks for the explanations. Are these a1 a2 classes an invention of this Dictionary? (and isn't it protected by copyright, see It is prohibited to copy from this book on the disclaimer).

It's not that I don't want to store this in the data, it's just that I'm wondering about the best way to store it. I totally understand the checking with request (not so much the generation as it's more a one time thing and are needed to store others data that can't be generated on the fly), but couldn't it be deduce from other properties and data? (I don't know Norwegian at all, for the two languages I know: in French flexions is easy, so the rules can be embedded directly in constraints: plural from is "-s", except if already ending with an "s", "x" or "z", except totally irregular exceptions - but class won't help here - ; in Breton - where there is dozens of plural suffixes depending on a lot of not always clear criteria - it's more difficult but a class is not helpful and queries can still do some preliminary checkings).

Cheers, VIGNERON (talk) 20:12, 17 June 2018 (UTC)

The dictionary is copyrighted, yes. It is supposed to reflect the decisions of the Language Council of Norway regarding spelling - no more, no less. The only original work in the dictionary should be definitions, examples, layout and similar. The grammatical classes tend to ultimately correspond strongly to the grammatical classes of Old Norse, and more recently and more directly to the Landsmaal published in the 1840s and 50s. Names like a1 and a2 might be their invention, however - I don't know how old those are.

I would say that for Norwegian, no, you generally cannot deduce the inflection for any class of words just by looking at the words. If you know the grammatical gender of a noun, that might sometimes be enough, but often it is not (for example ris and gris are both masculine, but have different inflections). --Njardarlogar (talk) 21:00, 17 June 2018 (UTC)

To test things out, I've created regular masculine, -ar indefinite plural (Q55088277) and regular neuter (Q55088370) and used them with conjugation class (P5186) at førar (L3287) and førarkort (L3288). --Njardarlogar (talk) 19:42, 18 June 2018 (UTC)

Interresting. If this is accepted (which sounds a good idea), then we should extend conjugation class (P5186) to all lexemes (and not just verbs) and other adaptations, like we did for word stem (P5187) (which again seems logical as both these properties are about morphological description). Any objections? Cdlt, VIGNERON (talk) 08:37, 19 June 2018 (UTC)

Spelling variants

Latest comment: 6 years ago15 comments6 people in discussion

Apparently, it is not possible to create a form with more than one representation in the same language. I wondered how to handle cases where there are several different spellings of the same form. For example, the German word Lexeme:L811 (meaning “nightmare”) can be both spelled Albtraum and Alptraum. It feels wrong to have two forms with the same grammatical features and statements for it. --Dominic Z. (talk) 20:15, 30 May 2018 (UTC)

I would also like to be able to make statements about different spellings. For instance, Albtraum has been established by the orthography reform of 1996 while Alptraum is a still valid pre-reform spelling. There is also the problem of determining when to speak of different spelling variants of the same word, cf. my remarks (“Was ich noch gerne hätte, ist eine klare […]”) here. -- IvanP (talk) 07:07, 31 May 2018 (UTC)

@Dominic Z., IvanP: but, if this is a different spelling, then it's a different form, no? And with two forms, you can use different lang codes (de and de-1996 ideally but de-x-Q666027 since de-1996 is not yet available). Cdlt, VIGNERON (talk) 07:52, 31 May 2018 (UTC)

Why is there the possibility to add multiple spelling variants (to the same Form), then? It already works with different language codes (now done). OK, de-1996 indicates that the spelling is valid according to the reformed orthography (actually amended up until 2017; should the code also be used for spellings introduced after 1996, like Clementine?) but there are also spellings that ceased to be valid (e.g., daß is invalid in the 1996 orthography, Katarr was deleted from the official word list in 2011, Wandalismus in 2017), there should be a way to indicate this. Note that de-1901 and de-1996 are described as redundant in the IANA Language Subtag Registry. -- IvanP (talk) 09:35, 31 May 2018 (UTC)

@IvanP: yes, indeed this solution you applied works well too (and maybe better). Both solutions seems more or less equivalent, I guess time will tell which one is best. And for the specificities, will probably need properties to be precise (maybe officialized by (P5194) can be used, you should ask on Property talk:P5194).

The IANA registry indicate a lot of subtag as redundant but that's for their context, in Lexemes we need to be more precise (especially as the system doesn't allow only one representation per language but for distinguishing two representations too, if we say Albtraum@de and Alptraum@de how to tell which one to use when? Albtraum@de-1996 and Alptraum@de-1901 is obviously needed here).

Cheers, VIGNERON (talk) 10:12, 31 May 2018 (UTC)

If you use some random subtag, doesn't that lead users to misidentify the variant just to ensure it passes a constraint check? Shouldn't individual variants be referenced in the Forms part instead?
--- Jura 10:21, 31 May 2018 (UTC)

@Jura1: sorry I don't understand, could you explain? The subtag is not random at all and I don't think the variant is misidentified, how would you correctly identify the variant? And the individual variants are already referenced in the forms (take a look at Lexeme:L811). Cdlt, VIGNERON (talk) 11:25, 31 May 2018 (UTC)

It might be easier if you try to create Lexeme based on the indications given by IvanP (or a comparable sample in another language).
--- Jura 15:20, 31 May 2018 (UTC)

Ok, done : Lexeme:L2127 (I'm not converting Lexeme:L114, still waiting for a decisive argument for one of the two methods). I still don't understand a word you said on the message above. Where is the randomness or the misidentify you're talking about? Cdlt, VIGNERON (talk) 16:06, 31 May 2018 (UTC)

I think you assume there is a distinct language code or at least a distinct item for the Form you attempt to describe. Reading the comment and sample given by IvanP, this doesn't seem to be the case. You can obviously attempt to use a random (distinct) language code or create a new item. For your sample, even if a spelling matches Q54555486 or Q54555509, this doesn't necessarily mean that this is the only accepted characterization of a spelling and therefore it may be misleading.
In line with our general approach to constraints, I don't think there should be a mandatory distinct value constraint for language codes of possible multiple spellings of a single lexeme.
--- Jura 09:37, 1 June 2018 (UTC)

Yes I assume that different lemmata are different, I don't see how different lemmata would be the same. I've re-read you multiple times and don't get you at all (I'm not even sure who you try to talk to, please indent correctly to make it clearer). EOT for me and for now unless you could be clearer. Cdlt, VIGNERON (talk) 10:18, 3 June 2018 (UTC)

Why do we not choose to create two differents Lexemes; one for Albtraum and one for Alptraum? What are the advantages creating only one Lexeme? Pamputt (talk) 08:50, 5 June 2018 (UTC)

@Pamputt: for me the obvious answer is « because it's not two lexemes », it's two spelling of the same word. If you split this lexeme in two, you would have to duplicate a lot of data (senses, translations, etc.), which would be inefficient and contradictory to the aim of Wikidata to centralize data. That said, I'm not entirely convinced by IvanP solution to put two representation for the same form. I think that storing Alptraum and Albtraum in two different forms will be better if you want to add statements; for instance a quote of the date of first known attestation only concern the lemma Alptraum or Albtraum, not both (that's why I split F1:lagadoù and F5:lagadeu on Lexeme:L114, which is not ideal either). Cdlt, VIGNERON (talk) 10:05, 8 June 2018 (UTC)

For Danish, I have run into a couple of these problems: Lexeme:L3830 has one form that can both be "tornystret" (Lexeme:L3830-F2) and "tornysteret" according to [10]. Furthermore, Lexeme:L3073 can both be "en", "een" or "én" (in the form for the common gender) [11]. — Finn Årup Nielsen (fnielsen) (talk) 22:36, 21 June 2018 (UTC)

@Fnielsen, Dominic Z.: that is why I preferred to enter variation as different forms (like I did on lagad (L114)), it's maybe not ideal but adding different variation to the same form seems too confusing. I think that anyway, separate forms are need for other informations (like pronunciation or citation). Cdlt, VIGNERON (talk) 07:50, 22 June 2018 (UTC)

How to enter proper nouns

Latest comment: 6 years ago12 comments6 people in discussion

(pinging KaMan)

How should we enter proper nouns? To me it seems natural to set the lexical category as proper noun (like this) rather than using a statement with instance of (P31) (like this). A proper noun is a noun, so using such a statement would technically be a duplication of information. We would also been using both the lexical category field and the statements section to store information on the lexical category rather than using just one of them (though are there examples were storing more complex information about the lexical category necessitates the use of the statement section?). --Njardarlogar (talk) 09:35, 22 June 2018 (UTC)

@Njardarlogar, KaMan:, yes, using proper noun (Q147276) directly as the Lexical Category seems the more natural and logical. That's what I did on Breizh (L1756) and same for several other items like Kraków (L1041). We could maybe even use a more precise Category like toponym (Q7884789) (like it's done on some dictionaries). More generally, I'm not sure that using instance of (P31) on Lexemes is a good idea. Cdlt, VIGNERON (talk) 09:45, 22 June 2018 (UTC)

Wouldn't that depend on sense?
--- Jura 09:50, 22 June 2018 (UTC)
@Jura1: how could it depend on sense? All sens of one lexeme are by definition and by contruction in the same lexical category. Cdlt, VIGNERON (talk) 10:15, 22 June 2018 (UTC)

To me lexical category should be instance of part of speech (Q82042). Is there any language where proper noun (Q147276) is not noun (Q1084). If not then we can sublass it and it could be fine to place it in lexical category. KaMan (talk) 09:51, 22 June 2018 (UTC)

@KaMan: I don't know any language where proper noun (Q147276) is not noun (Q1084) (but to be confirmed by references). I looked quickly but the items need to be improved, some declaratio seems strange (among other things, I suspect a confusion between "proper noun" and "proper name", which are both "nom propre" in French). Cdlt, VIGNERON (talk) 10:15, 22 June 2018 (UTC)

One issue. To me a lexical category is something that should be invariant during translation (and so, it can depend on sense, yes). For example, is "English" a proper name? (Because it is not in Russian translation). --Infovarius (talk) 11:30, 22 June 2018 (UTC)

~~"English" is an adjective, not a noun, so I think that's a separate concern.~~ Noted - it's early here... ArthurPSmith (talk) 11:46, 22 June 2018 (UTC)

@Infovarius: not sure if the invariant idea is true (AFAIK there is no lexical category that exists in all languages, so none can be truly invariant). When you says "English" what do you refers to? The language or the people? (@ArthurPSmith: both being noun, in addition the adjective and verb) And is it one or two lexemes? In French, "anglais" (lang) and "Anglais" (people) are definetly two lexemes, both being common noun. Cdlt, VIGNERON (talk) 11:50, 22 June 2018 (UTC)

I mean English as language. --Infovarius (talk) 14:47, 22 June 2018 (UTC)

Ok, thanks for the clarification.

Do you think that the people and language is one or two lexemes in English? about the lexical category, I've looked a bit more into it and that's strange. On the English Wiktionary, both are indicated as proper noun but if I look at an other word like "French", the language is indicated as a proper noun and the people as a noun (I'm gonna ask on the Tea room). So I've looked in English dictionaries (including but not limited to Cambridge Dictionary and Merriam-Webster) and most of them seem to only indicate noun. The Cambridge Dictionary says noun even for name that are clearly proper noun and proper name like England or London, where the Merriam-Webster says geographical name for England and London. I guess will may need a property (instance of (P31)?) to store multiples values, with references; for the Lexical Category, I suggest using the most precise uncontroversial item.

Going back to « lexical category is something that should be invariant during translation », the more I think about it, the less it make sense ; I see no reason why translation wouldn't use similar but distinct categories (especially if one category is the subclass of the other).

Cdlt, VIGNERON (talk) 15:50, 22 June 2018 (UTC)

Having thought some more about this (ahem) I think I agree with VIGNERON generally - a lexeme will generally correspond to multiple senses which will generally have different translations into another language, and I don't think it's reasonable to always require those translations to be identical lexical categories (even when the two languages agree on what such categories are). Preventing the use of proper noun as a category in English for this reason seems unnecessary. And while the wiktionaries are great, they are "works in progress" as much as wikipedia or wikidata, so I wouldn't rely on them to resolve issues like this! ArthurPSmith (talk) 21:21, 22 June 2018 (UTC)

Help with pronunciation

Latest comment: 6 years ago5 comments2 people in discussion

Can somebody help me with pronunciation in Wiktionary (L3402)? I'm trying to reflect in pronunciation audio (P443) and IPA transcription (P898) what is collected in https://en.wiktionary.org/wiki/Wiktionary#English with correct split and markup to male/female and UK/American/Canada/Indian. Thanks in advance. KaMan (talk) 07:58, 19 June 2018 (UTC)

@KaMan: for pronunciation audio (P443), just add the file and qualifiers. Same for IPA transcription (P898) ; but here I'm wondering how this property should deal with phonetic and phonemic pronunciation. I think the more logical would be to put the first as a direct property and the second as a qualifier as the corresponding IPA transcription (P898), doest it make sense (subtelties of pronunciation can be very confusing, we probably need to think more about this). When I look at other Lexemes, I see that these properties are often at the lexeme level but the form level seem better to me, no? Cdlt, VIGNERON (talk) 10:06, 22 June 2018 (UTC)

@VIGNERON: Ok, I gave it a try. I added IPA with qualifiers but still I don't know how to distinguish male/female voices in markup. And for markup with language of work or name (P407) qualifier I have only English (Q1860). How can I distinguish UK/American/Canada/Indian. KaMan (talk) 13:03, 22 June 2018 (UTC)

KaMan I replaced English (Q1860) with more specific items. For male/female, not sure what would the more appropriate property... Right now, I only see maybe the general applies to part (P518)? Cdlt, VIGNERON (talk) 13:19, 22 June 2018 (UTC)

@VIGNERON: Ok, I used applies to part (P518) but somehow I don't like the way IPA is duplicated now for male and female voice. There is no way to apply applies to part (P518) to pronunciation audio (P443) of IPA. KaMan (talk) 15:06, 23 June 2018 (UTC)

Lexical stress

Latest comment: 6 years ago4 comments4 people in discussion

Please explain to me how to specify the lexical stress? Should I place an stress mark inside the word form? This is very important for the Slavic languages. See w:Russian declension. -- DonRumata (talk) 14:01, 27 June 2018 (UTC)

It seems that it could be put in forms. I notify @Cinemantique, Infovarius: who know russian and lexemes. See also вода (L189) for an example of russian lexemes with forms. Cdlt, VIGNERON (talk) 14:09, 27 June 2018 (UTC)

I'd suggest to apply the same system as we do for the Semitic languages and insert both stressed and unsterssed representation of both lexeme and forms. The stressed representation can have as its language code somthing like ru-x-Q181767.--Shlomo (talk) 17:29, 27 June 2018 (UTC)

Interesting solution! I just wanted to propose another - with some new property ("stressed form" or "stress position") but they seem clumsier. --Infovarius (talk) 09:51, 28 June 2018 (UTC)

Page listing the existing tools

Latest comment: 6 years ago1 comment1 person in discussion

Hello all,

Since several people started developing tools, I created a subpage of Wikidata:Tools dedicated to lexicographical data. It's also added on the banner of this page. If you're one of the autors (@Lucas Werkmeister, Fnielsen, JakobVoss:) feel free to improve the description, add a screenshot, add your tool if it's not in the list yet.

If you have ideas of tools, you can still add them on this page.

Reminder from the announcement: please be gentle with our system, and avoid any mass import, especially from non CC0 sources. Thanks :) Lea Lacroix (WMDE) (talk) 08:53, 28 June 2018 (UTC)

Help us improving the user experience

Latest comment: 6 years ago3 comments3 people in discussion

Hello all,

The first version of Lexicographical Data has been deployed three weeks ago, and the development team is already working on the next features that you requested on this page or on Phabricator.

In order to understand your needs better, and to improve the user experience on Lexicographical Data in the future, our UX team would like to discuss with some of you. This would help us learning more about:

what you’re currently doing with the existing interface,
what your workflows are, and how they could be improved,
what motivates you when adding and editing Lexemes,
the things you would like to do with this new data.

Having a conversation and possible demonstration of your workflow will enable us to understand your needs in the context of your activities.

Since our UX team wants to spend time observing your interactions with the interface, and discussing with you individually, this will take place as individual interviews with one of our designers. Via the communication platform of your choice, you’ll be able to share your screen, show how you’re currently editing the data and chat with us.

The discussion would take between 30 and 60 min, we will set up the appointment depending on when you are available. The appointments will take place between June 19th and July 3rd Note that depending on your preferred language, the discussion can happen in English, German or French.

The outcomes of the discussions will be compiled anonymously and published on Commons after being analyzed by the UX team.

If you’re interested, please let a comment under this message and we will contact you via the email feature or your talk page.

Thanks again for your involvement in the project, and for helping us improving the interface of Lexemes! Cheers, Lea Lacroix (WMDE) (talk) 11:48, 12 June 2018 (UTC)

VIGNERON (talk)
Lea if you're still doing this, you can add me ArthurPSmith (talk) 15:44, 22 June 2018 (UTC)

So how did it go?
--- Jura 15:38, 29 June 2018 (UTC)

Monitoring per language

Latest comment: 6 years ago2 comments2 people in discussion

@Léa Lacroix:, is there any plan do deliver special page to monitor recent changes in Lexemes per language? One can be interested in monitoring changes in his/her language. In Wiktionaries we have Special:RecentChangesLinked together with category of language to be searched in. It would be valuable to have similar tool in wikidata. KaMan (talk) 09:01, 29 June 2018 (UTC)

Thanks for your request! We'll definitely look into it more in details. We could probably add to the existing filters of Recent Changes. Lea Lacroix (WMDE) (talk) 09:49, 29 June 2018 (UTC)

Lexemes as labels for items

Latest comment: 6 years ago12 comments4 people in discussion

If the label of an item could be a lexeme and not just a string, you could get the specific form of the label that is relevant for your use, rather than just one form that is potentially the wrong one (and which you might not even be able to know in advance that you'll get).

As an example, if you are interested in an item (say animal (Q729)) in the context of taxonomy, you might be most interested in the plural form. --Njardarlogar (talk) 17:40, 4 June 2018 (UTC)

I suspected that this would have been brought up earlier (couldn't find anything), but it hasn't?

To go into some more detail for an example, in the case of plant (Q756), the label for many languages is a singular form, whereas if you want to label the taxon (Plantae), the norm is to use a plural form. So when I use a module to automatically create an infobox for Rosa (Q34687) at nn:roseslekta, I get a singular form from Wikidata when I really wanted a plural form (the other relevant labels used in the infobox are plural, making the language inconsistent). You could argue either way whether the label should be a singular form or a plural form; but for different uses, you want different grammatical forms anyway, so the general principle still applies. While we do have taxon common name (P1843), there is no guarantee that there will be only one common name.

It will presumably be possible to get the corresponding sense from the item and then find the relevant lexemes from the sense, but then you will have no guarantee that you will get the lexeme that is being used as the label. Other lexemes with the sense might be rare or deemed inappropriate to use in the specific context - information that it might be possible to store at the lexeme in some way, but that you have no guarantee that will have been stored there yet, either way.

In sum, being able to do something like using lexemes as item labels seems not only to be very convenient (e.g. by reducing redundancy of maintenance, such as for spelling), but a necessity to be able to use Wikidata for certain purposes (like in the example above).

There might be other solutions (for example involving properties), too; so it might be worthwhile to also bring this up at Wikidata:Project chat. --Njardarlogar (talk) 11:17, 16 June 2018 (UTC)

At some point we are going to connect senses with items using item for this sense (P5137). When that happens it should be possible to have a link from each label matching a form to the lexicographical item.--Micru (talk) 11:22, 16 June 2018 (UTC)

This seems to be what I wrote about in my third paragraph. With such a property you will have the possibility to go to the lexeme that corresponds to the label (not sure if it is currently possible with Lua to see which entity has a given item set as the value for a given property, however), but you still cannot verify that it actually is the lexeme that corresponds to the label. Occasionally, there might be more than one lexeme that has both the given sense and the label as one of its forms. If these forms from the different lexemes also have the same grammatical features, there is no way to tell which of the lexemes it is that corresponds to the label. --Njardarlogar (talk) 12:46, 16 June 2018 (UTC)

@Njardarlogar: Can you give an example of two different lexemes with the same forms pointing to the same item? I think there are none, because it would mean that the lexeme is repeated, but I might be wrong.--Micru (talk) 13:06, 16 June 2018 (UTC)

I don't have an actual example at hand, but in principle it should be straight forward. Say you have two lexemes with the given sense in common (other senses might be different) and the same lemma, but with different grammatical genders. Then the label appearing at the item may well match the lemma form of both lexemes. Any other form of the lexemes could be different. --Njardarlogar (talk) 18:49, 16 June 2018 (UTC)

@Njardarlogar: I don't find it so straight forward as you say. Without concrete examples for me it sounds just like an hypothesis not based on any real case scenario.--Micru (talk) 20:55, 16 June 2018 (UTC)

I am thinking about this from a perspective of design and guarantees.

It is difficult to prove that no such example exists among attested languages, and impossible to prove that it never will. Thus the only way that the software can guarantee that you always get the lexeme entity corresponding to the label (as long as the information has been added) is by having a link between the label and the lexeme.

Such links are also part of what makes Wikidata powerful: where you don't have to presume that you are getting the right topic (like by parsing a media caption at Commons), but can know it for certain (as long as the correct information has been entered into the database).

From my perspective, it boils down to this: how important is it that a user can get the the lexeme of a label (if it exists)? If it is important, we should make it possible to guarantee that the user can get the lexeme, if we have the information stored. If it we deem it to be not that important, we could argue that we don't have to give that guarantee, but that will still make Wikidata less powerful.

When I develop templates using Wikidata, like in the example above, I am often thinking about what guarantees Wikidata can give me. If I cannot get a certain guarantee, then it will complicate the usage of the information because I cannot be absolutely certain about what information I will actually get from Wikidata, which is a weakness. --Njardarlogar (talk) 09:47, 17 June 2018 (UTC)

Just a quick remark, if we want to do things correctly, items should be linked to sense, not to lexeme. And the other way round (sense to items) seems better to me (and has already been discuss). Cdlt, VIGNERON (talk) 12:24, 17 June 2018 (UTC)

Also brought this up at the project chat now, with some details on how this could be implemented in practice. --Njardarlogar (talk) 11:50, 30 June 2018 (UTC)

To me this feels like a feature that's worth discussing but where it would be better to have the discussion in a year when we see how far the capabilities that linking senses to items bring us.

Doing this change might make life more complicated for data reusers who just download the Q-namespace without also downloading lexemes and it's unclear whether issues like that are worth the benefit. ChristianKl ❪✉❫ 16:17, 30 June 2018 (UTC)

+1 with ChristianKl, the idea is not bad but it's too soon and it will maybe not necessary. Cdlt, VIGNERON (talk) 17:55, 30 June 2018 (UTC)