Wikidata talk:Lexicographical data/Archive/2018/11

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Relate Form to Sense

Latest comment: 5 years ago4 comments3 people in discussion

The forms of Bank (L34723) f. in German are not identical in the sense of bench and bank (e.g. nominative plural in the sense of bench is Bänke in contrast to Banken). How should these forms be related to the senses? --Mfilot (talk) 21:30, 1 November 2018 (UTC)

They should be entered as separate lexemes. As well as different forms, they also have different etymologies. - Nikki (talk) 21:37, 1 November 2018 (UTC)

Ok, makes sense. I created a new lexeme Bank (L34791) for the financial institution and cleaned up the translation (P5972) (see Bank (L34791), banque (L15448), bank (L3354)). --Mfilot (talk) 22:16, 1 November 2018 (UTC)

I think we still need a way to indicate forms that may be linked to specific senses. --- Jura 13:16, 4 November 2018 (UTC)

Forms dependant on first letter of the following word

Latest comment: 5 years ago4 comments3 people in discussion

What grammatical category should we use, or how should we tag forms that are dependent on the first letter of the following word? For example: the English article 'a' has the form 'an' if followed by a word beginning with a vowel; and the English prefix 'in-' has the form 'im-' if it is followed by a base word starting with 'p' or 'b', etc. How should we / should we distinguish such forms? Liamjamesperritt (talk) 01:09, 4 November 2018 (UTC)

I have such cases in Polish too. For example nad (L14478) which can have forms "nad" and "nade" depending on next word. I mark them with vocalic form (Q55082724) and non-vocalic form (Q55082712) as it is taged in Grammatical Dictionary of Polish (Q55214514) used by me as source of grammatical forms. KaMan (talk) 07:13, 4 November 2018 (UTC)

Normally, we would list the required forms somewhere. Maybe with the same as for Latin/French? Wikidata:Property_proposal/requires_form. --- Jura 12:04, 4 November 2018 (UTC)
- Interesting idea. I left some thoughts about the "requires form" property on the proposal page, as I think something like this could solve the problem. Liamjamesperritt (talk) 00:18, 5 November 2018 (UTC)

How to split or merge pronoun (Q36224)

Latest comment: 5 years ago6 comments5 people in discussion

Some individual pronoun (Q36224) can either get their own lexeme or be grouped into one lexeme. Consider the English we (L483) which groups we, us, our, ourselves, while in Danish (Q9035) I have split vi (L35288) (basic form) and vores (L35289) (possessive pronoun). For Danish (Q9035), the dictionary Den Danske Ordbog (Q1186741) split these words/lexemes [1] [2]. I am unsure which way is the best. If we do not split, it seems that individual forms can be attached to different word classes. The same might go for the etymology where Danish (Q9035) vi/vor is based on vár/várr. — Finn Årup Nielsen (fnielsen) (talk) 17:23, 4 November 2018 (UTC)

In the Indo-European languages, many possessive determiners can inflect by themselves, unlike genitive cases which don't have any further inflections. This makes me think that they should be treated as lexemes in their own right. —Rua (mew) 20:19, 4 November 2018 (UTC)

I was surprised when I saw that "my" wasn't entered as a lexeme of its own. That's not what I was expecting at all, so I would be favour of splitting them for English. The online OED also has separate entries for them all. - Nikki (talk) 20:29, 4 November 2018 (UTC)

I think I took care of most of the pronouns in English; the grouping was based on my reading of how to handle lexemes in such cases, such as this question and discussion. There aren't many of them so I suppose they could be split - but then how do we link them properly to indicate they have such related meanings? ArthurPSmith (talk) 23:01, 4 November 2018 (UTC)

I don't think we have anything suitable right now, we don't have many properties for linking lexemes. - Nikki (talk) 09:37, 5 November 2018 (UTC)

Not sure why some of these got merged. If there is a need for a consolidated one, maybe the detailed ones could be linked with Wikidata:Property_proposal/form_is_subject_of. --- Jura 15:56, 5 November 2018 (UTC)

Lexemes that are defined grammatically in terms of other lexemes

Latest comment: 5 years ago4 comments2 people in discussion

One way that English Wiktionary avoids having to re-define words several times is by using special definitions that refer to another lexeme. For example, a word might be defined as the verbal noun or passive of some other verb. An example is the Northern Sami pair gávdnat (L35329) "to find" and gávdnot (L35330) "to be found", where the latter is a passive derivation of the former. Is there a way to give something like "passive of gávdnat" as the sense, instead of repeating all the senses of the base verb but in passive form? —Rua (mew) 20:17, 4 November 2018 (UTC)

Why should you regard them as different lexemes and not forms of one lexeme? --Infovarius (talk) 10:39, 7 November 2018 (UTC)

Because they are full lexemes in their own right. Passive verbs are full verbs, and have an infinitive and all the forms that any other verb might have. You can even derive new lexemes from one. That latter point is important, because we can only derive lexemes from other lexemes with our properties. —Rua (mew) 11:19, 7 November 2018 (UTC)

For example, gávdnon (L35767) "occurrence" derives from the aforementioned passive verb. —Rua (mew) 11:28, 7 November 2018 (UTC)

Senses before forms

Latest comment: 5 years ago12 comments6 people in discussion

Senses are more important for identifying the word and are generally what people are more likely to look for than forms. Can they be listed before forms? This is especially important for words with dozens of forms. —Rua (mew) 12:09, 21 October 2018 (UTC)

I agree with that. Because of it as a temporary solution for myself I wrote small script to quickly jump to senses from ToC at the top of the page: User:KaMan/ToC_to_lexemes.js. KaMan (talk) 12:57, 21 October 2018 (UTC)

The informations like the grammatical type of the lexeme are nethertheless very important as we have several lexemes for the same string (one example for the same string with different grammatical gender, see L:L2330 and L:L2332), so these informations may be important for disambiguation and to avoid mistakes. (By the way, I noticed that there did not seem to be differences in the type, gender and forms of twe homographs, L:L2331 and L:L2332, why are they different lexemes ?) author TomT0m / talk page 18:41, 21 October 2018 (UTC) removed lates this argument as the discussion below proved it’s not really founded. Semantic is carried by Lexemes in « Lexical semantics », so senses are essential to identify an item. author TomT0m / talk page 11:57, 22 October 2018 (UTC)

They might differ in information that hasn't been provided yet. —Rua (mew) 19:44, 21 October 2018 (UTC)

@TomT0m: exactly as Rua said (etymology for instance, you can already see the reverse etymology for tour de Babel (L474) or tourneur (L2334)), this 3 lexemes "tour"@fr has been discussed a lot already, look at the past discussions (Special:WhatLinksHere/Lexeme:L2330). In the other way round, do you know even one source who says it's one lexeme? (*all* the dictionaries describe them at least as 2 lexemes, often 3 lexemes). Cdlt, VIGNERON (talk) 09:57, 22 October 2018 (UTC)

Sorry for the naive question, I don’t follow very closely this page :) I must admit I find articles such as fr:Lexème_(linguistique) as it’s full of specialized terms and infinite nuances and lacks of example. Maybe we need a Help:Lexeme page which sum up the current discussions, give examples and details the definition and practices used on Wikidata for the layman ? I’m just searching intel on guidelines right now so I looked on the data model to see if there was basic definition and I see that the Lemma definition does mention that in mw:Extension:WikibaseLexeme/Data_Model#Lemma it’s written

Two distinct lexemes with the same lexical category can exist in the same language if they have different morphology, that is, different forms.

that does not mention the etymology on lexeme, so it’s inconsistent with actual use of the page. My own intuition would have suggested that etymology is tight to senses, but it’s just me :). author TomT0m / talk page 10:36, 22 October 2018 (UTC)

@TomT0m: no problem, it's always better to ask. It is hard to explain simply what is a lexeme, like it's hard to explain what is a concept for items. For a very simplistic approach: a lexeme = a word. An more precise approach would be: an entry of dictionary (usually one lemma only has one entry but sometimes there is thing like « 1. tour and 2. tour » when one lemma is bore by several lexemes, see "tour" in the TLFi). And more technically: an entity with specific informations, including but not limited to morphology.

« etymology is tight to senses », yes but more exactly « etymology is tight to senses of a word ». Anyway, "tour"@fr is a weird exception were several homographs with similar informations are different lexemes, don't focus too much on it (and just look at references ;) if dictionaries says it's two lexemes, just follow them).

Cdlt, VIGNERON (talk) 11:17, 22 October 2018 (UTC)

@VIGNERON: I understand that practical approach but that does not really answer my question :) I finally took the approach of browsing the enwp article, and I found that the en:lexeme is defined by a field called en:Lexical_semantics which actually takes into account the meaning of the different lexical entity and carries semantics, so that explains the fact a little bit more. author TomT0m / talk page 11:51, 22 October 2018 (UTC)

If there is general agreement that we should switch around Senses and Forms I'm happy to do that. Some more opinions please? --Lydia Pintscher (WMDE) (talk) 03:48, 26 October 2018 (UTC)

I

Support putting senses before forms (but after statements) on the page for a lexeme; people looking for a particular word should be informed quickly if they've gone to the wrong place. ArthurPSmith (talk) 15:13, 26 October 2018 (UTC)

Alright. I've opened phabricator:T208592 for it. --Lydia Pintscher (WMDE) (talk) 15:06, 2 November 2018 (UTC)

It's already live, though in older lexemes it appears after purge of page or some edit and refresh of the page. KaMan (talk) 10:09, 8 November 2018 (UTC)

Extensions of Ordia

Latest comment: 5 years ago3 comments2 people in discussion

The Toolforge tool Ordia at https://tools.wmflabs.org/ordia/ has now been extended. There are overviews of languages, lexemes, forms and senses. There is also a text-to-lexeme functionality https://tools.wmflabs.org/ordia/text-to-lexemes though currently only enabled for four languages. Example: [3] — Finn Årup Nielsen (fnielsen) (talk) 20:39, 1 November 2018 (UTC)

Nice. I had done something similar (offline). Just lacks the select lexical category and create buttons. It seemed to time-out when I tried with a Wikipedia article. --- Jura 13:18, 4 November 2018 (UTC)
- Rather than time-out I suspect it is a CORS problem I need to look into. I hope to extend Ordia with button for input. — Finn Årup Nielsen (fnielsen) (talk) 00:20, 8 November 2018 (UTC)

Vote: Do we allow phoneme in the Lexeme namespace?

Latest comment: 5 years ago10 comments8 people in discussion

Hello, from this discussion and this one, I understood that there is a relative consensus on the fact that we do not allow storing phonemes in the Lexeme namespace. Thus, I added this fact in Wikidata:Lexicographical data/Notability. However Jura1 considers that the discussion is not over. Because, I think it is (nobody write new message on that topic for a while), I propose to vote in order to validate this point. @KaMan, Nikki, VIGNERON, Njardarlogar, Rua, Infovarius: @Jura1, ArthurPSmith, Circeus, Lexicolover: I ping you because you participated to previous discussions on that topic.

Do we allow phoneme in the Lexeme namespace?

Support

Oppose

Phonemes have to be store in the Q-namespace. Pamputt (talk) 07:46, 25 October 2018 (UTC)
Despite asking several times, there as no explanation, justification or reason why the Qitems are not enough. VIGNERON (talk) 07:57, 25 October 2018 (UTC)
IIRC I already told it twice, phonemes are not lexemes. KaMan (talk) 11:26, 25 October 2018 (UTC)
None of the lexeme features (language, lexical category, forms with grammatical features, etc) apply to phonemes, as far as I am aware, so I don't see any benefit to putting them in Lexeme namespace at all. ArthurPSmith (talk) 15:00, 25 October 2018 (UTC)
—Rua (mew) 15:58, 25 October 2018 (UTC)
Phonemes are not lexemes. Although it is theoretically possible to store non-lexeme phenomenons as L-entities, such a practice is quite problematic and should be only allowed, if there is a strong reason for it. The same applies to graphemes, btw.--Shlomo (talk) 06:40, 26 October 2018 (UTC)
--Njardarlogar (talk) 09:31, 29 October 2018 (UTC)

Discussion

@Pamputt: As you don't contribute actively to lexemes on Wikidata, it's not clear how this would affect you and why you'd vote on this. Maybe you could outline the problem you are trying to solve. What alternatives do you propose? What is the urgency of this point? I find your overall posts to this page rather nonconstructive (who would start topic called "L21070_should_not_exist" to seek positive input?). Please avoid breaking things. --- Jura 07:55, 25 October 2018 (UTC)
@Pamputt: What now with this voting results? KaMan (talk) 09:45, 9 November 2018 (UTC)
@KaMan: I wait until this section is archived and I will add a section to specify explicitly that phonemes and graphemes are excluded from the Lexeme namespace with a reference to this vote (that is why I need the discussion is archived in order to use a perennial link). Something similar to this. Pamputt (talk) 12:25, 9 November 2018 (UTC)

Forms that also have idiomatic meanings

Latest comment: 5 years ago6 comments3 people in discussion

How are cases handled where a form has acquired meanings that can't be predicted grammatically from which form it is, but are idiomatic to that particular form? An example that comes to mind is English broken, which has meanings that don't follow from it being the past participle of break. —Rua (mew) 17:30, 10 November 2018 (UTC)

I would say create separate lexemes for new set of form(s) of new meaning(s). KaMan (talk) 18:07, 10 November 2018 (UTC)

How would you define its etymology? —Rua (mew) 18:34, 10 November 2018 (UTC)

With object form (P5548) of derived from lexeme (P5191). KaMan (talk) 18:40, 10 November 2018 (UTC)

@Rua, KaMan: Almost(?) all English participles can act as adjectives; I'm not sure it's really worth having a separate lexeme for all of them. And almost all the senses I see in enwiktionary for "broken" are also possible to associate with the original verb. But one or two of them perhaps not, so yes a distinct lexeme for that sort of case makes sense to me. ArthurPSmith (talk) 19:11, 10 November 2018 (UTC)

According to Wikipedia, participles are adjectival or adverbial by definition, so it's not surprising to see them acting like adjectives. The question is what to do with the ones that have semantically separated from the verb and become independent words. Even if "broken" is not a good example, there are plenty of examples across languages that are. Another example I can think of is ukudla in Zulu, which is both the infinitive of -dla "to eat" and lexicalised in the meaning "food". —Rua (mew) 23:00, 10 November 2018 (UTC)

Translations

Latest comment: 5 years ago23 comments8 people in discussion

I have been adding translation (P5972) to Dienstag (L6818) in the sense of day of the week. Currently (12:37, 4 November 2018 (UTC)) these are 16 translations. From these senses a translations points to the sense on the German sense object resulting in 272 translations entries. Of course this is a bit redundant, but this is the way it is intended to work, isn't it? I'm aware that most senses have item for this sense (P5137) pointing at Tuesday (Q127), but this might be cumbersome when looking for a translation taking the path over the Q-item instead of translation (P5972). --Mfilot (talk) 13:03, 4 November 2018 (UTC)

Maybe the idea is that people focus on a limited number of language pairs for an unlimited number of lexemes rather than the opposite.
The layout for these could be improved. The language should be visible at least as (language) code and the gloss could fall back to the language of the lexeme (sorry for the digression). --- Jura 13:16, 4 November 2018 (UTC)

That is a good thought to focus on limited number of languages which will happen anyway since for most lexemes it will not be that easy to identify the translations. I agree that the language of the translation (P5972) should be visible, and a more compact layout would also help. Some translations in Dienstag (L6818) display as code instead of label e.g. L34322-S1 instead of вівторок. Is this related to my language settings? --Mfilot (talk) 13:47, 4 November 2018 (UTC)

It happens if there is no gloss in your interface language. With "code" I had in mind the language code (rather then id of the sense). I'm not even sure if it's a good idea to show the gloss in the interface language rather than the language of the linked lexeme. Maybe that should be a setting in preferences. --- Jura 14:01, 4 November 2018 (UTC)

Don’t we have items to represent meanings ? This seem like the same situation than the interwikis on the pre-Wikidata era on wikipedias. My understanding was that « item for this sense » had the same role of a central item to represent a meaning each of the exact meanings for that item would connect to. This would avoid the translation explosion, and the exact translation pair are easy queryable.

⟨ ?sense of a word in french ⟩ item for that sense Search ⟨ Wikidata item A ⟩

⟨ ?sense of a word in english ⟩ item for that sense Search ⟨ Wikidata item A ⟩

. author TomT0m / talk page 21:43, 4 November 2018 (UTC)

Indeed item for this sense (P5137) can help avoiding translation explosion for nouns, since suitable Q-id can often be found, but then we would have to create Q-id to "translate" the conjunctions aber (L7879) and mais (L9261). --Mfilot (talk) 17:38, 5 November 2018 (UTC)

It seems redundant to have item for this sense (P5137) and translation (P5972) together in the same sense. Would it be possible to have a constraint that only one of those properties should be used for each sense? When the number of translation (P5972) statements grows beyond a certain number, a new item could be created for that sense to act as a hub for all languages. In my view a "sense" is similar to a Qitem that is embedded in a lexeme because it is convenient, but we shouldn't fear creating new Qitems when necessary.--Micru (talk) 18:44, 5 November 2018 (UTC)

@Micru: I've added a lot of senses (for English) - it seems relatively easy to find item for this sense (P5137) when the lexeme is a noun, but we almost never have existing Q items for other parts of speech (verbs or adjectives most commonly). Are we comfortable with adding Q items for verbs and adjectives? ArthurPSmith (talk) 20:05, 5 November 2018 (UTC)

@ArthurPSmith: I'm inclined to think that verbs and adjectives meet our notability criteria since they refer to clearly identifiable entities (specially if they exist in several languages) and they would fulfill a structural need. Of course, I'm open to hear more arguments about this.--Micru (talk) 21:16, 5 November 2018 (UTC)

General comment: implementing sense entities still seems like a cleaner approach to me than putting senses for verbs, adjectives, adverbs etc. in the main namespace. It should also have a lot of benefits, like making it much easier to search through existing senses. We would also not have to worry about the notability policy for the main namespace; if one lexeme uses a sense entity, that should be good enough to keep it. We could also hard-code certain sense-specific behaviours, like making it possible to set the narrowest hyperonym (supersense, like blue for light blue) in a special field on the entity rather than using a property, and listing up all the widest hyponyms (subsenses) as well to aid navigation. --Njardarlogar (talk) 21:18, 5 November 2018 (UTC)

@Njardarlogar: How is a "sense entity" different than a regular item? --Micru (talk) 22:06, 5 November 2018 (UTC)

@Micru: I don't think of the extent to which a sense entity would be different from an item as the most important point, but rather that it is actually a separate type of entity with its own namespace. Having senses as their own entity type does mean that we can tailor them to their specific purpose, like I suggested above, and which I think potentially could get quite useful; but I don't think it is the most important reason.

Simply by keeping senses separate in their own namespace, we get

improved experience with the user interface: now we have to first enter a gloss, then select item for this sense (P5137) as the property to use, and only then can we select an entity. With sense entities, we would have a special field that would only accept sense entities as input and that would not require a gloss - just start searching among senses right away.
dedicated entities that do not contain content irrelevant for senses; for example, on a sense entity for a country, we would have no information on head of state, population, et cetera, et cetera. We could instead have a more prominent position for links to related senses, such as an inhabitant of the country, its language(s) and so on
easier navigation among existing senses: the main namespace is mostly composed of entries that will never be used as senses

--Njardarlogar (talk) 13:12, 6 November 2018 (UTC)

@Njardarlogar: I understand your point, however I find that adding an additional namespace for senses would increase the complexity unnecesarily. 1) The feature that you suggest of a special field for senses could be thought to accept q-items as input, no additional namespace necessary. 2) Entering again the information for existing entities would mean duplication of effort for creation and maintenance. 3) There is nothing that indicates that adding an additional namespace would make navigation easier.

On the other hand, using items for senses where relevant doesn't require any additional infrastructure, we can start doing it now if there is the will.--Micru (talk) 22:14, 6 November 2018 (UTC)

@Micru: Regarding 1), you can still expect many irrelevant suggestions from the main namespace; items that will never be used on a lexeme for a sense. As long as we can e.g. set example lexemes on the sense entities (or have them generated automatically by the MediaWiki software), it must necessarily be easier to navigate a dedicated sense namespace than the main namespace because the main namespace is filled with irrelevant items. There would be duplication with sense entities, particularly for nouns; but it would likely not be 100% unless adverbs and similar concepts were included in the main namespace independently of their use by the lexicographical project. The essence of a sense should not change over time, so maintenance should mainly be about dated language in the descriptions/definitions (including altered classifications of the concept the sense corresponds to).

All that said, there is one potentially important difference between how we would use sense entities versus items: items are currently not supposed to have lengthy definitions, an item description is not meant for a dictionary definition but to be brief and act as a disambiguator. On senses entities, on the other hand, we could accommodate precise definitions and have a specific field for this purpose. Without definitions, the lexicographical project would be incomplete, surely. A property could potentially be used for this purpose on items, yes. --Njardarlogar (talk) 17:49, 7 November 2018 (UTC)

@Njardarlogar: Regarding irrelevant suggestions, that could be improved with a better suggester. From my experience it is not that bad, but if you have some examples of where the suggester was not offering you the items that you needed, then you should post them so that the developer team is informed.

As for the definitions, in the past I was under the impression that we need them. However the more I thought about it, the more I came to realize that the statements *are* the definition. For Wikidata it is not so relevant to come up with textual explanations of words (which btw normally have copyrights), that is the job of the wiktionaries, but what we can do is to transform those definitions into structured data (CC0).--Micru (talk) 12:24, 8 November 2018 (UTC)

@Micru: So far, all Wikidata items have been "conceptual or material entities", or some notable "thing". To start adding Q-Items for verb and adjective senses in mass would represent a substantial change to the essence of the Wikidata ontology. Since Wiktionary (and by extension, dictionary) entries are generally disallowed from entering Wikidata's Main namespace, it feels to me that entering senses of various lexical categories (verb, adjective, adverb, etc.) goes against that policy. A key reason the Lexeme namespace was created was to keep encyclopedic data separate from lexical data, and now we are coming to realise that lexemes alone are potentially insufficient to efficiently describe lexical data. Do we need to create another namespace to more fully attain to what the Lexeme namespace was meant to fulfil, or do we shift the usage of the Main namespace, and potentially go against the purpose of creating a separate lexical namespace in the first place? Either way, it seems to me that the current data model appears to be missing an important piece of the puzzle. Liamjamesperritt (talk) 21:28, 6 November 2018 (UTC)

@Liamjamesperritt: I have difficulties understanding why an adjective or a verb could not be considered a conceptual entity. In my view the definition of conceptual entity seems quite arbitrary and perhaps related to what can be considerated encyclopedic, which generally does not apply to Wikidata. It is true that individual dictionary entries are generally disallowed from entering Wikidata's Main namespace, because for that we have the lexeme namespace, however here we are talking about senses that are shared among a high number of languages. Even if we created an item for the sense "important", we still would need to create lexeme entities for each of the languages that have lexemes that represent that concept, because each language has its own peculiarities regarding pronunciation, use, etc. By allowing the senses of verbs and adjectives in the main namespace, we are not going against the purpose of creating a separate lexical namespace, because such q-items would be a complement to lexeme entities, not a replacement. As you say, we are missing a piece of the puzzle.--Micru (talk) 22:14, 6 November 2018 (UTC)

@Micru: Although I still feel that adding senses for verbs, adjectives and adverbs on mass would represent a substantial shift in the usage of the Wikidata Main namespace, if it is eventually concluded that such senses are valid Q Items, then I agree that this would be a nice solution to the problem, as we are already using Q Items to link noun senses. The next question is then: what would we make these items instances of / subclasses of? Or would we instead link them to their noun counterparts with a new property (e.g. "run" -> "running"; "beautiful" -> "beauty")? Or both? Liamjamesperritt (talk) 22:45, 6 November 2018 (UTC)

@Liamjamesperritt: It would definitely be a new practice to add items for senses, and as such it should be discussed thoroughly. I find your question about "instances of / subclasses of" too generic, because each one will have a different value, plus several other statements might help outline their meaning. I would say that "beautiful"<indicates quality>"beauty", about "running" I am not so sure because the item seems to conflate different concepts (sport, terrestrial locomotion), so probably it should be split.--Micru (talk) 00:00, 7 November 2018 (UTC)

So I've been reading up a bit on WordNet - https://wordnet.princeton.edu/ - what they've done for English is sort of create the conceptual items we are talking about. See the section on "Relations" on that page for how they differently handle noun, verb, and adjective relations; the hierarchies or groupings are quite different. The total number of items that might need to be created for verbs, adjectives and adverbs could be estimated from their counts - I would guess it will be well under 100,000 (most of their synsets are nouns already). So I don't think it would be in any way a big burden on Wikidata to add these concepts as items, they'll be at about the 1/1000 level of items. ArthurPSmith (talk) 16:30, 7 November 2018 (UTC)

@ArthurPSmith: Very interesting. This idea of a "synset" does make a lot of sense. And with the current state of the Wikidata data model, it certainly seems as though Items would be the simplest choice for representing these concepts, especially if it is determined that it would not be a burden on the Main namespace. Perhaps there should be a discussion about whether this direction should be taken? Liamjamesperritt (talk) 02:48, 8 November 2018 (UTC)

I agree, we need more input on this. I have started a thread on the project chat.--Micru (talk) 13:48, 8 November 2018 (UTC)

I would say that the synsets for nouns are already the Q-items. At least that is how I have used it. For instance, tape recorder (Q213777) is linked to WordNet via http://wordnet-rdf.princeton.edu/wn30/04392985-n and exact match (P2888). — Finn Årup Nielsen (fnielsen) (talk) 01:50, 13 November 2018 (UTC)

Senses are now displayed before Forms

Latest comment: 5 years ago2 comments2 people in discussion

Hello all,

Based on several requests, we now display the Senses section before the Forms one on Lexemes.

If you still see the Forms first on a Lexeme, you should purge the page or do an edit on it.

If you have any issue, the related ticket is this one, you can also ping me.

Cheers, Lea Lacroix (WMDE) (talk) 15:17, 12 November 2018 (UTC)

I noticed this change Friday or Saturday, was wondering when it would be announced - thanks! ArthurPSmith (talk) 17:11, 12 November 2018 (UTC)

Lemmata for Latin verbs

Latest comment: 5 years ago14 comments6 people in discussion

In Wiktionary, the main lemma for most Latin verbs is the first person conjugation. However, most of the Latin verbs in Wikidata so far have the infinitive as the lemma. I want to start adding Latin verbs to Wikidata, but I'm not sure if I should set lemmata to the first person or the infinitive. Any advice? Liamjamesperritt (talk) 04:02, 11 November 2018 (UTC)

Just follow the existing ones, similar to other languages. --- Jura 04:16, 11 November 2018 (UTC)
There's no reason to have the "main" lemma in infinitive. AFAIK every serious dictionary uses first person, and so do many Wiktionaries, including the Latin one. Alas, the ~~English and~~ German Wiktionar~~ies~~y uses infinitive lemmas, so you can expect a strong opposition from this side.--Shlomo (talk) 07:13, 11 November 2018 (UTC)
Agreed with Shlomo, we should follow the way specialists and reference works on that language use. Pamputt (talk) 10:21, 11 November 2018 (UTC)
Yeah, let's follow what specialist contributors already do. As Wiktionaries shouldn't be copied, we can't really follow them. --- Jura 12:57, 11 November 2018 (UTC)
Many Wiktionaries follow standard dictionary protocols and contain a wealth of valuable lexical information, so why do you say that Wiktionaries should not be followed? Should we not do our best to align the Lexeme namespace with Wiktionaries in order to provide structured data support for Wiktionaries, as the Main namespace has done for Wikipedias, Wikiquotes, etc.? Liamjamesperritt (talk) 19:49, 11 November 2018 (UTC)
This discussion is not about the information itself, but about structural issues. There are good reasons, why the data model of Wiktionaries shouldn't (and can't) be followed:
The software used for powering Wiktionaries is text-based, which is an appropriate solution for Wikipedias, Wikisources, Wikibooks etc., not so much for a dictionary, which is primarily a database of lexical information. Still it can be used after introducing many workarounds and strict rules concerning the pages' structure. Wikidata software is more appropriate to process the lexical information as data and doesn't have to follow all the Wiktionaries' workarounds and limitations.

There are many Wiktionaries and they are autonomous projects. Various Wiktionaries use different solutions for the problems mentioned sub (1) and Wikidata can't follow all of them. Even in the case discussed here, we can follow the ~~en.wikt and~~ de.wikt (etc.) and use infinitive as "main" lemma, or we can follow la.wikt and fr.wikt (etc.) and use 1st person sg. for this purpose. Or we can have multiple lemmas without saying, which one is the main one, and let the user decide — this is not possible in Wiktionaries, but it is possible here.

--Shlomo (talk) 09:14, 12 November 2018 (UTC)
Um, just for your information, en.wiktionary uses the first-person singular present active indicative as the lemma for Latin verbs. See wikt:WT:Lemmas. —Rua (mew) 10:43, 12 November 2018 (UTC)
Ehm, thanks, mea culpa. It was stuck somwhere in my memory, which seems to be not so reliable any more ;)--Shlomo (talk) 16:48, 12 November 2018 (UTC)
As Wiktionaries aren't CC0, we can't import most of their content and model. Obviously we shouldn't otherwise these valuable projects would get aborted by another WMF project. We can still link to their pages and they can do the same. Lexemes at Wiktionary would probably have been preferable, but somehow a series of users we have hardly seen editing lexemes since wanted otherwise. Now people like you and me who actually contribute are stuck with the current situation. --- Jura 05:39, 13 November 2018 (UTC)
We should not be "stuck" with anything; everything can be changed. I agree Latin lemmata should follow standard practice among classicist (I am a lapsed classicist myself) and use the first-person present active indicative form for verbs. Ijon (talk) 00:07, 14 November 2018 (UTC)
Useful could be to include several forms in the lemma. --- Jura 12:57, 11 November 2018 (UTC)
That's possible. The question was about the "main" lemma (whatever it is), which should be, per definition, only one. The way I understood it is, that main lemma is the one with plain language code (in this case la). The infinitive can be added as alternative lemma, e.g. with the code la-x-Q179230.--Shlomo (talk) 15:28, 11 November 2018 (UTC)

I don't see any problem with having multiple lemmata. I feel the that the first person present tense makes most sense for Latin lemmata and I will start contributing as such, but also adding the infinitive for people who are more familiar with the infinitive as a lemma could be useful as well. Liamjamesperritt (talk) 19:24, 13 November 2018 (UTC)

Documenting dialects?

Latest comment: 5 years ago3 comments3 people in discussion

Does Lexeme support dialects yet? How can one mark a form one adds as belonging to a particular dialect, or as being a substandard (but very common) variant? Ijon (talk) 00:04, 14 November 2018 (UTC)

Or senses that only exist in a particular dialect. —Rua (mew) 00:26, 14 November 2018 (UTC)

There is a new property location of sense usage (P6084) that kind of addresses it (not so much dialect, but location), but I don't think there is anything like that for forms yet. Liamjamesperritt (talk) 01:07, 14 November 2018 (UTC)

List of properties for Lexemes and List of Lexemes by language

Latest comment: 5 years ago5 comments4 people in discussion

Where can I find list of properties for lexemes? I think proporty with its use examples would be very helpful to understand how/where to use them. It is quite difficult to understand which propoerty should be used for what. List of Lexemes by language would be very helpful to find language specific lexemes and edit them. Regards,-Nizil Shah (talk) 05:16, 6 November 2018 (UTC)

@Nizil Shah: list of all lexeme related properties is here: Template:Lexicographical properties. Some of properties have Wikidata property example for lexemes (P5192) specified but if you have problem how to use some property just ask here. You can get list of lexemes in your language two ways. Easy one is to list all linkings to language item in lexeme namespace. For example here are all lexemes in Gujarati (Q5137) see here. Second method is to run query. Hope this helps. KaMan (talk) 11:17, 6 November 2018 (UTC)

@Nizil Shah: Besides of the template that KaMan has linked, there is also Wikidata:List of properties/linguistics, or you can also browse properties using Prop explorer. In any case you can also look at the showcased lexemes or already existing lexemes in your language and find inspiration there.--Micru (talk) 12:42, 6 November 2018 (UTC)

@Nizil Shah: You can also get a list of properties in Ordia: https://tools.wmflabs.org/ordia/property/ — Finn Årup Nielsen (fnielsen) (talk) 01:36, 13 November 2018 (UTC)

Thank you for the lists. I will look into them and ask for clarification if I need to understand any property. I will propose missing property if any missing.-Nizil Shah (talk) 06:25, 16 November 2018 (UTC)

List of your lexemes that need senses

Latest comment: 5 years ago5 comments2 people in discussion

This URL lists all the lexemes you’ve created until 18 October 2018, the date senses became available. (Ever since, you’ve always added senses to your new lexemes, right? 😉) Perhaps it’s time to add some sense(s) to them? --Lucas Werkmeister (talk) 23:15, 7 November 2018 (UTC)

The other way is to query for all lexemes without any sense in your preffered language: here is example for esperanto. KaMan (talk) 10:00, 8 November 2018 (UTC)

@KaMan: yes, depending on language that’s more or less feasible. Nikki also suggested querying just for one’s own lexemes that don’t have senses yet, but that turned out to be a problem due to a query service bug – see T209034. --Lucas Werkmeister (talk) 12:58, 16 November 2018 (UTC)

Also, I’d like to advertise the following snippet I added to my common.css:

/* make the “add form/sense” button *really* obnoxious if a lexeme doesn’t have any forms/senses yet */

.wikibase-lexeme-forms.wikibase-listview:empty + .wikibase-addtoolbar,
.wikibase-lexeme-senses.wikibase-listview:empty + .wikibase-addtoolbar {
  transform: scale(3);
  transform-origin: left top;
  margin-bottom: 6ex;

  border-radius: 2px;
  border-style: dotted;

  animation: 0.5s cubic-bezier(1,0,0,1) infinite alternate flash_yellow_background_red_border;
}

@keyframes flash_yellow_background_red_border {
  0% {
    background-color: transparent;
    border-color: transparent;
  }
  100% {
    background-color: yellow;
    border-color: red;
  }
}

With this, the “add sense/form” link will be huge and flashing bright yellow if a lexeme doesn’t have senses or forms yet, to remind you to add some. (For users of the Wikidata Lexeme Forms tool, this snippet became more useful with the change to move senses above forms last week, since it means the “add sense” link is no longer at the bottom of the page.) It’s probably not for everyone – in fact I originally wrote it half as a joke – but it’s actually grown on me, and I’d recommend you to at least try it out. --Lucas Werkmeister (talk) 12:58, 16 November 2018 (UTC)

@Lucas Werkmeister: Looks cool, thanks! KaMan (talk) 13:39, 16 November 2018 (UTC)

Linking with Wiktionary

Latest comment: 5 years ago21 comments9 people in discussion

Now we have nearly all main features (Forms, grammatical categories, Senses, translations...) I would like to have links to Wiktionary pages (wasn't it the main idea to have Wiktionary repository in Wikidata?). When (and in which form) is it planned? --Infovarius (talk) 15:11, 25 October 2018 (UTC)

@Infovarius: Wiktionary URLs are based on lemmata so it's trivial to generate a link (and one fitting your needs, either the main lemma or a specific form, not senses as wiktionaries don't have a structures for senses, and some Wiktionary have different structure for anchor link to languages and section inside a page). Someone from the Lexeme team can confirm (or infirm) but I remember that there is no plan to explicitely store link to wiktionary on Wikidata (some people did with some hack around this though). Cdlt, VIGNERON (talk) 16:26, 25 October 2018 (UTC)

That's exactly the problem. Wiktionary pages are based on lemmata, so that one page can contain several lexemes with the same lemma. Wikidata lexicographical pages are based on lexemes and one page can contain several lemmata. Meaningful linking is in this situation surely not trivial. It could be done via statements (on WD side) and wikilinks in appropriate sections of each wiktionary, but I can't imagine the maintenance of such system.--Shlomo (talk) 05:45, 26 October 2018 (UTC)

Yes, there are several lemmata in "each" Wiktionary page, so I suppose we should have some template (using Lua access to WD Lexemes) in each section of it. But inversely, most Lexemes should correspond to unique Wiktionary page so it should be simple to add such linking. --Infovarius (talk) 08:07, 26 October 2018 (UTC)

Many Wikidata Lexemes (L-items) have multiple lemmas and may (or may not) correspond to several Wiktionary pages. We don't know.
The fact that there is a Wiktionary page with the name corresponding to a lemma doesn't mean, the Wiktionary page contains section with a lexeme corresponding to the Wikidata L-item.

--Shlomo (talk) 09:16, 26 October 2018 (UTC)

Pardon, Shlomo? Can you please provide an example of Lexeme which have multiple lemmas? I cannot imagine it... --Infovarius (talk) 12:12, 29 October 2018 (UTC)

Sure. Check these: colour/colour/color (L1347), color/colour/colour (L791), mazal/מזל (L12373), вода/voda (L2068), 大きな/おおきな (L661), מים/מֵם (L8305), ном/ᠨᠣᠮ (L7957).--Shlomo (talk) 16:45, 29 October 2018 (UTC)

And this is just for the main lemma (which has just an indicative value), each forms is a different Wiktionary entry and each senses is a different section of these entries. I'll try to make a schema to make it more clear on how linking is complex here. Cdlt, VIGNERON (talk) 11:51, 31 October 2018 (UTC)

We have no "forms" in ru-Wiktionary as separate so it's not a problem. Links to sections are not necessary too. But "color/colour" is the problem, yes. May be to suppose they are not numerous and just to ignore them?.. Infovarius (talk) 14:35, 1 November 2018 (UTC)

I created phab:T195411 a while back asking for a special page which could be used with Cognate to make it possible to navigate between Wikidata and Wiktionary. I get the impression the developers aren't convinced though. - Nikki (talk) 11:28, 26 October 2018 (UTC)

Yes, I think it's time to add them. Aren't they already all centrally stored? So one could easily display at least the ones leading to the Wiktionary in the same language. --- Jura 18:39, 25 October 2018 (UTC)
I've been somewhat disappointed that enwiktionary seems to have pretty much everything I look at covered already pretty well. However, I just worked on fencing (L33095) where I realized the number of senses one gets from wikidata items with that label is quite a bit more than what enwiktionary had. So there's hope that we actually can be useful beyond interlanguage linking :) ArthurPSmith (talk) 20:59, 25 October 2018 (UTC)
ArthurPSmith, ти дійсно вважаєш, що англійський Вікісловник настільки повний? На жаль, все не так добре як би хотілось. Hopefully you get the point. :) --Base (talk) 14:22, 17 November 2018 (UTC)

To answer Vigneron's question: there is no plan for a specific development to add Wiktionary links, as it can easily be covered by statements. This would also mean that we don't have to follow the 1-n rule of the Wikipedia interwikilinks. Depending on how you decide to model it, a Lexeme could link to several Wiktionary pages, and several Lexemes could link to the same Wiktionary page. Lea Lacroix (WMDE) (talk) 09:37, 26 October 2018 (UTC)

Sounds good. Thanks for the quick reply. --- Jura 10:51, 26 October 2018 (UTC)

Could you explain what you have in mind when you say it can be easily covered by statements? I can't see any sane way of doing it. - Nikki (talk) 11:28, 26 October 2018 (UTC)

In the FAQ, the very first question is "Why will this project be useful for Wiktionary editors?" and the reply talks about Indonesian Wiktionary populating Estonian words from Wikidata. Was that just a dream, or has anything like that been implemented? Maybe the reply should be changed into something that resembles actual reality? --LA2 (talk) 17:44, 26 October 2018 (UTC)

@LA2: as far as I know, nothing has been implemented yet (no surprise there, Lexemes are still at an early stage) but this section and the example still is true, it's up to the wiktionaries to decide to use Wikidata data (or not). Cdlt, VIGNERON (talk) 17:42, 27 October 2018 (UTC)

Yeah, I had it in mind! I thought that the initial plan was to have "central repository of lexicographical data" for Wiktionaries. But how can Wiktionary use it if they are not linked to each other? --Infovarius (talk) 12:12, 29 October 2018 (UTC)

@Infovarius: true, explicit links can make our life easier but per se, they are not needed to reuse Wikidata data. For proof, see all the templates on Commons who explicitly call for a specific Qid, Commons:Template:Creator or commons:Template:Artwork for the more common examples. Cheers, VIGNERON (talk) 11:51, 31 October 2018 (UTC)

It would be good to set up a version of Wikibase lexemes for Wiktionaries that are interested in having structured data without having to pay a high price. --- Jura 08:07, 30 October 2018 (UTC)

Showcase lexemes page

Latest comment: 5 years ago1 comment1 person in discussion

Hi! Looking at Wikidata:Showcase items I've created Wikidata:Showcase lexemes page with some initial proposition of criteria for inclusion. KaMan (talk) 17:21, 18 November 2018 (UTC)

Understanding properties and data model

Latest comment: 5 years ago7 comments3 people in discussion

I am following the Lexeme project since its proposal. Over the years, the words used in data model and properties have became too technical to understand for new people as well as a person like me who have no linguistic knowledge. Sometimes I could not even understand a property and where to use them. I am working with small Gujarati Wiktionary and other Gujarati Wiki people who were waiting for Sense to add 200000 words from a public domain dictionary. Now we are stuck because it has became difficult to explain the data model and what from normal print dictionary should go where. Some technical things like "Gloss" is difficult to understand/explain. Broad and simple explanation in context of print dictionary will be a great help to people like us. We tried to map (which thing should go where) our public domain print dictionary to Wikidata Lexeme but we are stuck. The print dictionary has limited type of data in it. If we can map them, we might be able to create simple editing tool via OAuth to edit Wikidata Lexemes without confusing about too many things while editing. We could not even figure out Gujarati labels for properties and other technical labels due to lack of simple explanation. In short, people need simpler explanation in context of print dictionary because all editors are not linguists. Properties should be also explained this way with simple clear examples. Can we have it? If Wikidata Lexemes wants to attract editors, it need simplicity in explaining technical things. Regards,-Nizil Shah (talk) 05:47, 6 November 2018 (UTC)

Maybe it's easier if you list a sample entry we try to figure out how to map it. The problem with properties is that many haven't been created yet. --- Jura 06:30, 6 November 2018 (UTC)
@Nizil Shah: Is Wikidata:Lexicographical data/Glossary in any way helpful? KaMan (talk) 11:23, 6 November 2018 (UTC)
The model of BhagwadGomadal Gujarati dictionary is something like this:

Word | Meaning No. (one word can have multiple meaning) | Origin: Origin Language + Origin Word in Gujarati with its Gujarati Meaning (can be multiple or single words for origin along with meaning of each origin word) | Grammar Category | (Subject of the word for this meaning e.g. Music or Computing) | Meaning: Gloss sentence? + Synonyms | More info: More info (detailed aricle like infor or short 2-3 senence info) + More info related sentences etc. + Meaning of these sentences| Example: Example sentence + Example sentence Reference | Mutiple or single Phrases: Phrase No. + Phrase + Its Meaning + Explanation of the Meaning

@KaMan, Jura1: It is complex and I have already drawn flowchart to organise the information but could not understand which data can be handled where in wikidata. Feel free to ask for clarification in above model. I will try to answer. The digital non-machine readable dictionary is available here.-Nizil Shah (talk) 07:20, 16 November 2018 (UTC)

@KaMan:Please explain with example: what is gloss and what is not?-Nizil Shah (talk) 07:38, 16 November 2018 (UTC)

How to handle two lexemes in derived from lexeme (P5191) property? For example, lexeme Grihpati is derived from grih + pati.-Nizil Shah (talk) 06:14, 19 November 2018 (UTC)

@Nizil Shah: Instead of derived from lexeme (P5191) you can use combines lexemes (P5238). See högskoleutbildning (L33696) how it can be used.

As for gloss it is like description in Q-items of Wikidata. Some of us use two-three words in such description, some of us writes full definitions in it like in classic dictionary. For example of short examples of glosses see Mars (L8627). For example of longer gloss see Polska (L9751). KaMan (talk) 07:05, 19 November 2018 (UTC)

How to deal with en.wiktionary's multiple-words-per-etymology scheme?

Latest comment: 5 years ago4 comments2 people in discussion

As you may know, en.wiktionary's entry layout prioritises etymologies over lexemes: the lexeme is nested under the etymology, and in many cases multiple lexemes are nested under the same etymology. I find this problematic, so I recently started a discussion on en.wiktionary regarding a move towards a more lexeme-centered approach to etymologies. This would better match how Wikidata lexemes are structured, and IMO also better reflect the reality that no two lexemes truly have the exact same etymology. There is some support for the change, but also some opposition, as you can see in the discussion. The main point given by opposers is that it is valuable to group etymologically related words together, which I do not dispute as such, but my counterpoint is that these words still have separate etymologies even if they are related, and this difference would be expressed in Wiktionary's etymologies if those words appeared on different pages (i.e. their lemma forms weren't homographs).

Because Wikidata has a lexeme-oriented approach to etymologies, how does this mesh with Wiktionary's format? How would Wikidata handle etymologies for multiple homographic lexemes that Wiktionary has merged into one common etymology (e.g. wikt:fine or wikt:حرس)? Are there other Wiktionaries that use this structure? How can Wikidata's etymological data be useful to Wiktionaries if there is such a big difference in data structure? And what is your take on en.wiktionary's approach in the first place, versus Wikidata's approach? (You can comment in the Wiktionary discussion, but I'm not trying to forum shop here, just trying to get different perspectives and solutions.) —Rua (mew) 13:50, 20 November 2018 (UTC)

@Rua: I have to say I like the enwiktionary organization, though (it is a wiki) I can't say I agree with every distinction or grouping made there. By "multiple lexemes nested under the same etymology" you are referring to homographs that are in different lexical categories, right? I think I would like it even more if we could do something similar without requiring the words to be homographs - i.e. grouping all the words with a similar original etymology together somehow. Simple cases like geology, geologist and geological seem like they should be grouped together, but we don't really have a way to do that here or in wiktionary right now. I agree that each lexeme does separately have its own etymological origin which we can represent just fine in Wikidata, so what's the best way to represent the relationship? For example it doesn't make much sense to repeat essentially the same etymology on each lexeme. Is there generally one lexeme that would be considered the 'original' in the language, from which the others are then derived? ArthurPSmith (talk) 16:30, 20 November 2018 (UTC)

That is usually my experience, yes. For the example of fine, in the first etymology section, all the other lexical categories derive originally from the adjective, which is first. It gets a bit more messy with things like that, which is both a pronoun and a determiner but could still be considered one lexeme. The same with prepositions: the two lexical categories of under, although clearly separate in meaning and use, can't be easily separated etymologically because the lexical boundary between adverbs and prepositions is rather fluid in most Indo-European languages. Going back to Proto-Indo-European, there wasn't even any distinction between prepositions and adverbs to begin with. —Rua (mew) 18:03, 20 November 2018 (UTC)

As for grouping related terms, you have to ask how related. English Wiktionary has categories for words derived from a particular PIE root, which is a sensible category of relationship I think. But a root is not really special, any morpheme can establish a relationship in some way. brotherhood and sisterhood are related by having a common suffix -hood, which English Wiktionary also has categories for. However, you can take this quite far. For something like geo-, do you consider every word with that morpheme to be related to each other? geothermal, geometry etc? And the same with -log-, the suffix? What about cases where words once contained a morpheme, but it has eroded so far down that it's not considered a morpheme anymore? —Rua (mew) 18:07, 20 November 2018 (UTC)

Input needed regarding IPA transcriptions

Latest comment: 5 years ago1 comment1 person in discussion

Some input is needed in Property talk:P898#Include delimiters?, about how to best express the distinction between phonemic and phonetic transcriptions. The linguistic standard is to enclose these in / / and [ ] respectively, but someone has also suggested using a qualifier instead. —Rua (mew) 18:15, 20 November 2018 (UTC)

Marking of language genre/style/variety

Latest comment: 5 years ago7 comments5 people in discussion

In Danish dictionaries, a word may be associated with an indication of the style of usage of the word. For instance, vovhund (L194) would be associated with children's language (Q1741898) and fucking (L37283) associated with vulgarism (Q1521634). Other possibilities could be slang (Q8102), technical jargon, informal, etc. For vovse (L128), I have been using has characteristic (P1552) setting it to children's language (Q1741898). Is that property the best way or have we another property or way? — Finn Årup Nielsen (fnielsen) (talk) 17:05, 15 November 2018 (UTC)

has characteristic (P1552) doesn't seem quite right; maybe part of (P361) or a (new?) subproperty would be better? In some cases instance of (P31) might also work. Also this seems like it would be more associated with specific senses than the lexeme as a whole, generally, no? ArthurPSmith (talk) 18:53, 15 November 2018 (UTC)

Of the properties already available, I would agree that part of (P361) is the most appropriate. It should also definitely be placed on specific senses, not on the lexeme. Liamjamesperritt (talk) 08:08, 16 November 2018 (UTC)

I don't find part of (P361) appropriate for vulgarism (Q1521634), euphemism (Q83464) or humorous (Q58233068). I prefer instance of (P31) and I already used it a lot. In Template:Lexicographical properties there is section for categorisation of senses. KaMan (talk) 08:19, 16 November 2018 (UTC)

Maybe a new property for this? Somehow I'd had thought that senses would have a field for that, just like lemma and forms have one. Maybe we just have to wait till the GUI is finished before adding senses. --- Jura 06:29, 17 November 2018 (UTC)

Thanks for the input. I have now taken the liberty to suggest a dedicated property for these cases, and it would be interesting to have your feedback on whether this approach is the way forward. I now see that it should be on the sense (as noted about), - not the lexeme. — Finn Årup Nielsen (fnielsen) (talk) 18:39, 20 November 2018 (UTC)

The property suggestion is here: Wikidata:Property proposal/language use. — Finn Årup Nielsen (fnielsen) (talk) 12:39, 23 November 2018 (UTC)

Sense-writing guide?

Latest comment: 5 years ago6 comments3 people in discussion

Could someone write a guide of some sort to document the best practices for writing senses? I'm not really sure what a proper sense should look like, especially for non-English, so I've avoided writing any. I'm only used to writing definitions for en.wiktionary, but I got the impression that senses should not be written that way. —Rua (mew) 12:54, 16 November 2018 (UTC)

The guidance given by the GUI seems to be that it should have 17 characters maximum (or is my browser broken?). We still need a couple of properties to add senses in a structured way. Supposedly QS will support them too, so you might want to simply link wiktionary at this stage. --- Jura 06:32, 17 November 2018 (UTC)
- I can see about 45 characters for the gloss fields. I think the interface has issues though. If I used the size of the input fields as a guide, I wouldn't be able to add lemmas with more than 4 characters. - Nikki (talk) 13:55, 19 November 2018 (UTC)
- I think you are probably using the wrong interface to enter lemmas: I get >80 chars. --- Jura 16:22, 21 November 2018 (UTC)
  - I'm talking about the lexeme-specific UI on lexeme pages. The page for creating a lexeme is fine, but that seems to use normal MediaWiki form fields. - Nikki (talk) 16:50, 21 November 2018 (UTC)
    - The approach you are trying to use is not the primary way to enter lemmas. --- Jura 17:16, 21 November 2018 (UTC)

Amend properties for use in lexemes

Latest comment: 5 years ago5 comments4 people in discussion

Please do required changes in transliteration or transcription (P2440) for its use in lexemes. Add required properties in it with appropriate changes in constraints.
language of work or name (P407) was suggested for use in lexemes during Wikidata:Property proposal/derived from language(s). It also need changes.
writing system (P282) (script) would be used with transliteration or transcription (P2440) so it also needs changes.
determination method (P459) qualifier would be also used with transliteration or transcription (P2440) so need some change probably. Use example, શ્રુતિ (L1992).
main subject (P921) needs changes too. Lexeme may need subjects like computing, music, economics. They are used in dictionaries. e.g. trojan.

Regards,-Nizil Shah (talk) 05:22, 19 November 2018 (UTC)

I'm unsure about need for transliteration or transcription (P2440). IIRC some proposed to use another lemma and forms representation for it.
What changes for language of work or name (P407) you propose? I used it in biały (L17693) in derived from lexeme (P5191) without problems
I'm also unsure about need for main subject (P921). I agree that classic dictionaries have it but we have item for this sense (P5137) and all statements in target item so do we really need do duplicate statements of concepts?

KaMan (talk) 07:14, 19 November 2018 (UTC)

If we are going to use transliteration or transcription (P2440), it should really go on the forms, since each form will have a different transliteration. I do think it would make sense to have a property for marking senses which are specific to a particular field, but I don't think main subject (P921) is the right one to use for that, the semantic relation between works and subjects seems completely different to the one between senses and fields to me. - Nikki (talk) 13:12, 19 November 2018 (UTC)

Note that we have ISO 15919 transliteration (P5825) for ISO 15919 transliterations. - Nikki (talk) 13:12, 19 November 2018 (UTC)

main subject (P921) is entirely improper for lexemes. The closest we have is really facet of (P1269). A property specific to lexemes would be needed for field-of-use labels. Circeus (talk) 11:13, 25 November 2018 (UTC)

1000 usage examples

Latest comment: 5 years ago15 comments6 people in discussion

I reached 1000 usage examples in Polish (all sourced) and I'm kind of surprised that adding them isn't popular. Only 22 in English, 0 in French. I would like to somehow encourage to add them. KaMan (talk) 09:43, 22 November 2018 (UTC)

I think, it isn't popular because in wiktionary usage example (P5831) adds to the sense. Using subject sense (P6072) this way is counter-intuitive. Don Rumata 11:02, 22 November 2018 (UTC)

It depends on Wiktionary version. I find subject sense (P6072) very easy to use and number of my usage examples express it. KaMan (talk) 11:16, 22 November 2018 (UTC)

No, it adds a mess and makes difficult to understand what illustrate an example. Don Rumata 11:25, 22 November 2018 (UTC)

Not at all. Just use it and try, it displays sense just inside field of statement. And when you click it it moves you to sense declaration. KaMan (talk) 11:28, 22 November 2018 (UTC)

It doesn't display the sense, it displays something like "L2981-S5". To see what L2981-S5 means, the user has to move to another section of the page. Using a link can make the navigation easier, but it's still another place. You can see the problem even better the other way round: Some user (or script) looks at the senses and wants to see an example for this sense. He has to scroll up the page and look through all the examples to see if some of them applies to the particular sense he's interested in. Moreover, the way you use the statement (e.g. in styczniowy (L2981)) doesn't provide the information, whether the purpose of the example is to demonstrate the use of a particular form or the use of the lexeme in a particular sense, or some property of the whole lexeme. Using the examples on appropriate place would easily resolve all these issues.--Shlomo (talk) 08:59, 23 November 2018 (UTC)

No. It displays "L2981-S5" only if there is no gloss in your language. When gloss is defined in your language then it displays it. Once UI will be improved to consider your prefered languages like in Q-items it will be better. KaMan (talk) 09:32, 23 November 2018 (UTC)

The big number of your usage doesn't indicate that your practice is good, it just indicates you're happy with it. The low number of other people's usage on the other hand can indicate they're not so happy. Maybe it's time to consider the possibility, that they may have a good reason not to be.--Shlomo (talk) 08:59, 23 November 2018 (UTC)

No. Because placing usage examples in senses is already possible and I counted them. Yesterday there was only 10 usage examples in senses. KaMan (talk) 09:32, 23 November 2018 (UTC)

Stop delete it from senses and you will see what happens. Don Rumata 13:06, 23 November 2018 (UTC)

Adding usage examples to senses shouldn't be a problem, as long as subject form (P5830) is also added as a qualifier. Forms also should be included in usage examples (hence the original idea of adding them to the lexeme). Liamjamesperritt (talk) 04:07, 24 November 2018 (UTC)

@KaMan: Thanks for your work on that. I have been focusing in other areas (like adding senses in English - over 2500 now), but I think you have set a good pattern for how to do this, which is definitely important going forward. ArthurPSmith (talk) 16:16, 23 November 2018 (UTC)

Nice work. I think it's a good idea in general. As I haven't added any to Wiktionary, I couldn't really mass-import them. I'm not entirely convinced by the sense you give to "referenced" in that context. --- Jura 10:58, 25 November 2018 (UTC)
@Jura1: Thanks, but I don't understand your comment. Why do you mention Wiktionary here? I do not import any example form Wiktionaries. The sense of "referenced" is that examples are taken from real, live language in national text corpora. KaMan (talk) 11:12, 25 November 2018 (UTC)
- I see. Seems reasonable. In that case, the sample would be an attestation that is considered especially illustrative. Good to hear Polish Wikipedia has made it into the national text corpora. --- Jura 05:35, 26 November 2018 (UTC)

BabelNet ID's for senses

Latest comment: 5 years ago3 comments2 people in discussion

Should we add BabelNet ID (P2581) to senses of lexemes? Liamjamesperritt (talk) 04:14, 24 November 2018 (UTC)

For English only lexemes? KaMan (talk) 11:15, 25 November 2018 (UTC)

For any sense of a lexeme that is found in BabelNet (BabelNet is multilingual). Is it OK to add such external identifiers to senses? Liamjamesperritt (talk) 20:29, 25 November 2018 (UTC)

A proposed U.S. Geologic Names Lexicon ID

Latest comment: 5 years ago3 comments2 people in discussion

Does the creation of a proposed U.S. Geologic Names Lexicon ID fall within this project's domain? Please advise so I include the appropriate parties in the discussion at U.S. Geologic Names Lexicon (Q59159827). Many thanks. -Trilotat (talk) 16:30, 24 November 2018 (UTC)

How does example entry looks like in this lexicon? KaMan (talk) 11:13, 25 November 2018 (UTC)

Any of the lithostratigraphic units, e.g. Aztec Sandstone at reference. I ask, because the site is a "lexicon" described as a "National compilation of names and descriptions of geologic units." Perhaps they misuse the term "lexicon". -Trilotat (talk) 06:07, 26 November 2018 (UTC)

Additional ISO 639-3 languages

Latest comment: 5 years ago13 comments6 people in discussion

What needs to happen for Lexeme to accept additional languages? I am going to be presenting to some speakers of the (indigenous) Noongar (or Nyungar) language of Western Australia, but Lexeme does not accept 'nys' as a valid language code. Ijon (talk) 00:11, 14 November 2018 (UTC)

There are still many language codes that have not yet been added to Wikidata. You can request to add language codes here. Liamjamesperritt (talk) 08:38, 14 November 2018 (UTC)

That page describes the process specifically for monolingual text properties. That's not what lexemes use. Perhaps @Lea Lacroix (WMDE): can tell us what the process for lexemes is. - Nikki (talk) 09:07, 14 November 2018 (UTC)

Could you describe what you're doing? nys is in the list of available languages, so it should be possible to use it. - Nikki (talk) 09:07, 14 November 2018 (UTC)

@Ijon: I tested the following process:

Enter the lemma
in the field Language of Lexeme, type "Noongar" and select the item
in the field Spelling variant of the lemma that appears, type "nys" and select the language
enter the lexical category

It seems to work. At which stage do you encounter an issue? Lea Lacroix (WMDE) (talk) 09:52, 14 November 2018 (UTC)

It was in adding a 'nys' gloss to an existing lexeme in another language. 'nys' was not accepted as a language there. Ijon (talk) 12:28, 15 November 2018 (UTC)

That's weird, I just tested here and it works. What kind of error do you get? Lea Lacroix (WMDE) (talk) 14:18, 15 November 2018 (UTC)

@Lea Lacroix (WMDE): I notice you didn't answer Ijon's original question. How do we get missing languages added? Noongar might already be there, but there are already plenty of others we're missing. - Nikki (talk) 21:46, 15 November 2018 (UTC)

The list is the same as for monolingual text. For now, we should apply the same process. In the next weeks we will also start a discussion about how to improve the existing process. Lea Lacroix (WMDE) (talk) 09:45, 20 November 2018 (UTC)

@Lea Lacroix (WMDE): can you confirm that Lexicographical data uses the same set of languages as monolingual text? In phab:T210293, Nikki wrote “Lexemes don't use the monolingual text list of languages, so adding these [= the languages that were requested on the ticket] for monolingual text won't make them available for lexemes.” Do you know which files need to be changed for phab:T210293? Sascha (talk) 19:22, 29 November 2018 (UTC)

It doesn't use the list for monolingual text. You said in phab:T195740 that it's not the same as the list for monolingual text and I'm still not able to use language codes which were added for monolingual text, so the situation doesn't seem to have changed. For example, Southern Ndebele was added for monolingual text a long time ago (phab:T155430) but was not supported when I reported phab:T209282 a couple of weeks ago. - Nikki (talk) 21:09, 29 November 2018 (UTC)

I've just created balga (L37468) in Nyungar without problem. Meanwhile, it is always possible to use the general code "mis" (alone or with a precision in the private use : mis-x-Qid) as a (more or less) temporary solution if no code ISO is working/available. Cdlt, VIGNERON (talk) 07:38, 16 November 2018 (UTC)

Thanks, everyone, it does seem to work now. Perhaps it was a connectivity issue disguising as a general error. I was on very dodgy WiFi. Cheers. Ijon (talk) 16:58, 22 November 2018 (UTC)

Phoneme, grapheme and Wikidata:Lexicographical data/Notability

Latest comment: 5 years ago18 comments4 people in discussion

Hi, Jura1 do not agree with what I added to Wikidata:Lexicographical data/Notability. I was hoping that the vote would close the debate, but that doesn't seem to be the case. Do any of you disagree with the version I added? Pamputt (talk) 10:47, 25 November 2018 (UTC)

I think you opened a discussion about phonemes, didn't address any question that were raised, wrote a summary that didn't reflect the topic of the discussion you opened and, further on, accused people who disagreed with you of trolling. All this in a field you hardly contribute actively. --- Jura 10:53, 25 November 2018 (UTC)
@Jura1: Could you be more precise on each point you talk about
didn't address any question that were raised

which question was not addressed? It seems to me the question of the vote was pretty clear.
wrote a summary that didn't reflect the topic of the discussion you opened

could you give a diff of what you are talking about?
accused people who disagreed with you of trolling

I explained you why and it looks like you do not change your way to discuss.
All this in a field you hardly contribute actively.

it does not matter (this is not an argument), all the voters contribute actively and supported the fact that phonemes (and by extension graphemes) are not part of the lexeme namespace. Pamputt (talk) 20:47, 25 November 2018 (UTC)
It's wasting of our time by recalcitrance of Jura1. The question in voting was clear, the votes were clear, the result of voting was clear, and Jura1 not raised any question during nearly month of voting. The problem is that problematic lexemes raised in voting are authored by Jura1 and he seems cannot preserve objective POV. I strongly support actions by Pamputt. KaMan (talk) 11:04, 25 November 2018 (UTC)
Thanks KaMan. I wait for other opinions and if we all agree, I will think what to do. Pamputt (talk) 20:47, 25 November 2018 (UTC)
- I find this comment highly inappropriate, not only towards myself than towards other users who contributed in the field. Supposedly KaMan already considers having successfully discouraged others from participating. It seems like neither they nor Pamputt are actually interested in contributor's opinions. Which may be consistent with with Pamputt hardly contributing to lexicographical data in Wikidata, nor actual contributors being pinged. --- Jura 05:31, 26 November 2018 (UTC)
  - Jura1 I asked you above to develop your arguments but you do not do it... Pamputt (talk) 06:41, 26 November 2018 (UTC)

Phonemes don't have forms, as they are aural entities and have no consistent written representation in language (IPA is an alphabet, not a language), nor senses, as they are meaningless by themselves, so I agree that they don't really belong as lexemes. The lexeme namespace should not be used to describe every facet of language, but just lexemes in each language (the main namespace can cover the rest). Liamjamesperritt (talk) 21:55, 25 November 2018 (UTC)

@Liamjamesperritt: phonemes can be represented in different ways, each can be a separate form. The approach varies by language, so storing samples and other information about them in L-namespace seems much more convenient than in Q-namespace. The sense feature can easily link back to a general item. However, here the question is mainly if we voted about graphemes or Pamputt replied to questions they were asked. --- Jura 05:31, 26 November 2018 (UTC)

@Jura1: which questions are you talking about? This is the third times I asked for getting information. Pamputt (talk) 06:41, 26 November 2018 (UTC)

@Jura1: I think I see where you're coming from in terms of convenience, however I think that adding the ways that phonemes can be represented in different languages as forms is an abuse of the data model (as forms should be referring to the way a lexeme can be inflected within a given language). That's just my opinion of course. And on the subject of the voting, etc.: I'm not familiar with what has already taken place, although let this be my vote, for whatever it's worth. Liamjamesperritt (talk) 05:40, 26 November 2018 (UTC)

@Liamjamesperritt: it's really up to us to give substance to the technical data model. I think forms should be used to store strings how they are actually being found, be it in text or phonetic transcription. --- Jura 05:45, 26 November 2018 (UTC)
@Jura1: That's fair enough I suppose. However, the other issue is that the data model requires lexemes to be local to a specific language, and phonemes are language independent. I just can't reconcile the IPA being used as a language, since it is just not a language. Language implies meaning being conveyed, and the phonemes in the IPA are individually meaningless. So, I still don't consider phonemes to be valid lexemes. But it's not my decision, as you said it is up to all the contributors to decide the data model, and it could be convenient (although I place convenience below validity in importance). Liamjamesperritt (talk) 05:55, 26 November 2018 (UTC)

@Liamjamesperritt: here is a sample: invalid ID (L21067). I don't think the language should be IPA. --- Jura 06:06, 26 November 2018 (UTC)

Jura1 I would be interested to understand why you cannot achieve what you want to do by storing phonemes in the Q-namespace? I do not see why you cannot use open-mid front rounded vowel (Q80731). Why invalid ID (L21067) is better than open-mid front rounded vowel (Q80731)? Pamputt (talk) 06:41, 26 November 2018 (UTC)

@Jura1: I see. I was going off of invalid ID (L21070). But invalid ID (L21067) is certainly more acceptable in my opinion as it is local to a language. However I still personally feel that the 'Sense' is being abused, as it is not a meaning, but only a link to a Q-item describing the phoneme. Although I see that having text representations could be useful, the forms of invalid ID (L21067) are just restating the information from the Q-item, especially the X-SAMPA (Q614484) (which is clearly not a Grammatical Feature). Lexemes need not be abused and treated like Q-items (as has been done with invalid ID (L21067) as well as with -are (L1654)); that's what Q-items are there for. Liamjamesperritt (talk) 07:00, 26 November 2018 (UTC)
@Liamjamesperritt: well, the sample is language specific and if we add more forms that are language specific, it wouldn't be of much help on a non-language specific item and the features of lexeme namespace wouldn't work. For English, some or all listed on w:Pronunciation_respelling_for_English#Traditional_respelling_systems could be included. I suppose it could be seen as abusive if the entity didn't clearly define its features, but I think it does here. do you see any definition issues? (e.g. if one would call it an "acronym", see section below). --- Jura 07:11, 26 November 2018 (UTC)

@Liamjamesperritt: invalid ID (L21070) was changed that way by @Infovarius:, maybe he has an explanation for it. --- Jura 07:11, 26 November 2018 (UTC)

(in English) Is "B" an acronym, a letter, a grapheme, a noun, or all four?

Latest comment: 5 years ago5 comments4 people in discussion

Currently, it seems to be a noun or an acronym ( https://www.wikidata.org/w/index.php?title=Lexeme:L4624&oldid=722004492 ) or a letter/grapheme ( https://www.wikidata.org/w/index.php?title=Lexeme:L20818&oldid=745745418 ).

I'm somewhat dubious about a single letter being an acronym, but supposedly by extension it could be. Supposedly any letter/grapheme can be considered a noun, but I'm somewhat dubious about this being the main use in a dictionary.

@SR5: seems to be active in the field. --- Jura 05:31, 26 November 2018 (UTC)

Talking about English, it is probably a noun, an abbreviation, and a symbol. If the linguists specialized on English consider "letters" a separate word class (part of speech) rather than a more or less special case of common noun, let it be a letter; I haven't found any base for it yet, though.

I don't agree that any grapheme can be considered a noun; the English name of the letter B is "B" with plural being "Bs" or "B's", which implies it can be a noun. On the other hand, the English name of the grapheme ; is "semicolon" and you probably won't find a dictionary which lists ";" as a lexeme or ";s" as it's plural form. The name of the letter Д is "Д" in Russian (at least according to some dictionaries), making it a noun in Russian; I doubt, that "Д" is a legitimate lexeme in English.--Shlomo (talk) 07:29, 26 November 2018 (UTC)

I would say in Polish "B" could be two lexemes, one being noun meaning letter and one being acronym meaning something represented by this acronym (if there would be such acronym in Polish). Fortunately I don't know any acronym B in Polish, only in multiple languages (Q20923490) (like blood type, byte, boron) so I would stay with single lexeme about noun about letter. For "A" I have at least five meanings as noun https://sjp.pwn.pl/sjp/;2438340.html KaMan (talk) 07:56, 26 November 2018 (UTC)

B as blood type is more a symbol than an acronyme; the same may apply also for byte and boron, I'm not sure though. Anyway, it's only a matter of terminology, IMHO not an essential one.--Shlomo (talk) 09:33, 26 November 2018 (UTC)

An acronym also has a lexical category separate from it being an acronym. Acronyms can have plurals and other kinds of inflections, and of course within a sentence it acts like the lexical category it represents. So using "acronym" as a lexical category is wrong. The same reasoning can be applied to letters and graphemes as well. —Rua (mew) 12:43, 26 November 2018 (UTC)