Wikidata talk:Lexicographical data/Archive/2019/11

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Process for adding a new language code?

Can you point me to the documentation for adding a new language option for strings to Wikidata? Is having an official ISO code a prerequisite?

The language in question is the constructed language Toki Pona. Currently, I'm using "mis-x-Q36846" as a language identifier for the lemma (like it's done in L1), and "mis" for the usage example (because that field won't accept the longer version), see the lexeme pona. Being able to use "tok" would be great. Does that seem feasible? --blinry (talk) 15:07, 27 October 2019 (UTC)

It's a bit of a mess - see Wikidata talk:Identify problems with adding new languages into Wikidata - that links to the current process which involves creating a ticket on Phabricator. It may take several months... ArthurPSmith (talk) 19:04, 28 October 2019 (UTC)
But the criteria seem to be fulfilled in this case! Thanks a lot, Arthur, I opened a ticket! --blinry (talk) 18:09, 30 October 2019 (UTC)
Didn't we just remove Tokemon as valid language? --- Jura 08:12, 31 October 2019 (UTC)
@Jura1: Tokemon? I don't know Tokemon. But Toki Pona (Q36846) is clearly a valid language for Lexeme, that said there is no ISO 639 code (yet) for this language. As the last application for an ISO code was in 2007 and the rejection said « If Toki Pona survives the next few years and continues to develop, both in applications and in user base, then the RA will be open to consider a new request for assignment of a code element for Toki Pona. » indeed a new application should be done (which in turn would unblock the creation of a Wikimedia code). Cheers, VIGNERON (talk) 07:31, 3 November 2019 (UTC)
In this case, I suppose we can wait till they assessed that, otherwise we will end up deleting things once again. If WMF deleted it in the meantime, one can assume that they might have come to the opposite conclusion.
As we already have problems defining codes for monolingual text and lexemes for valid cases (taking months or years), I don't think we should repeat the experiment with that code. --- Jura 07:51, 3 November 2019 (UTC)

Lexeme mistakes

Referring to Wikidata:Project_chat#Lexeme_mistakes

Hi there I tried to fix the label of Ägyptologe (L72724) to uppercase "Ägyptologe". I was using Help:QuickStatements with the following command:

L72724	Lde	"Ägyptologe"

I got an error, where is my mistake? Bigbossfarin (talk) 22:04, 5 November 2019 (UTC)

@Bigbossfarin: "Lde" is for label but Lexemes have no label. AFAIK, QuickStatements doesn't not support lemma in Lexemes (and I don't know other tool that does it either). Cheers, VIGNERON (talk) 08:38, 6 November 2019 (UTC)

Why does Kadazandusun (Q5317225) has a spelling variant option?

Kadazandusun (Q5317225) writes in Latin script (Q8229) only. So, there's no need the "Spelling variant of the Lemma" option. Anyone can help how? --Tofeiku (talk) 13:01, 8 September 2019 (UTC)

What's the problem? Just don't use it. --Infovarius (talk) 15:39, 11 September 2019 (UTC)
I mean is there a way to make the option dissapear? I don't see this option when I'm adding Malay or English lexemes. That's why I asked. If it's not possible then nevermind. --Tofeiku (talk) 09:45, 12 September 2019 (UTC)
Also, I'm trying to add Brunei Bisaya (Q3450611) lemmas into Lexeme but it says "The supplied language code was not recognized." and I need to choose an option in "Spelling variant of the Lemma". So which option should I choose? --Tofeiku (talk) 06:44, 14 September 2019 (UTC)
Please give example Lexemes to see the situation. --Infovarius (talk) 22:32, 14 September 2019 (UTC)
I want to add the word "lampun" which is a Bisaya Brunei (bsb) noun which means "durian". --Tofeiku (talk) 12:01, 15 September 2019 (UTC)

I think this is the same problem as the one described below. That is, when a new lexeme is being created, the form automatically shows an additional field called "Spelling variant of the Lemma" if the "Language of the Lexeme" field is filled with some languages, which is very confusing. You can try it out by entering e.g. English (Q1860) vs Kadazandusun (Q5317225) in the language field of the lexeme creation form. I get the same issue with Cape Verdean Creole (Q35963). I'll comment on the section below where the issue is more clearly identified. --Waldyrious (talk) 17:29, 15 November 2019 (UTC)

Spelling variant of the Lemma option

I created sebang (L184375) which is a West Coast Bajau (Q2880037). But when I want to create a lexeme of that language, a "Spelling variant of the Lemma" option shows up and I need to choose one. This language is written in Roman/Latin script. So the only option that I could choose there is "mis". Can anyone help? --Tofeiku (talk) 04:37, 22 September 2019 (UTC)

You probably should ask Language Committe (@Amire80:) to add "bdr" as Wikidata language code. --Infovarius (talk) 16:57, 24 September 2019 (UTC)
I'll add it to Universal Language Selector. It also has to be added to Wikibase. Add a subtask under https://phabricator.wikimedia.org/T144272 . --Amir E. Aharoni (talk) 12:03, 25 September 2019 (UTC)

I'm getting the same problem with Cape Verdean Creole (Q35963), and it looks this also happens for Kadazandusun (Q5317225) if I'm reading this correctly. In the case of Cape Verdean Creole (code: kea), I had already created a task in Phabricator (phab:T127435) a while ago, which has allowed me to enter labels and descriptions for Wikidata items in the language. VIGNERON suspects the lexeme creation form issue (for kea) may have been due to a "no value" entry for the language code of Cape Verdean Creole (Q35963), which he recently removed; but the form still has the same behavior... --Waldyrious (talk) 17:38, 15 November 2019 (UTC)

Also, maybe this is a topic for a separate discussion, but is it correct to call the extra field "Spelling variant of the Lemma", when what it expects is apparently a regular language code? Am I missing something? That naming is definitely adding to the confusion here. --Waldyrious (talk) 17:42, 15 November 2019 (UTC)
First a general comment: the whole process of using a language on Wikidata seems broken. I wont talk much about it, pages like Wikidata:Language barriers input already do it extensively. But now, can we think about a solution ? I may be bold but given the result of Wikidata talk:Identify problems with adding new languages into Wikidata, should we just stop following the LangCom rules and make our own rules? (for instance: allowing all valid codes) any opposition if I start a RfC? (and I could use some help to write it, so anyone is welcome ;) )
For the bug, more exactly, I suspect that both problems come from the same source.
And +1 for "Spelling variant of the Lemma".
Cheers, VIGNERON (talk) 18:23, 15 November 2019 (UTC)

Verb forms that act as a noun

Hello! In some languages (maybe in most of them) some verb forms can act as a noun. For example, in Spanish, the infinitive can be used as a noun (comer vs. el comer). In Basque this is even more interesting, because a verb can have a form called nominalized verb (Q74674960). We can model it as a form of a verb, but, interestingly enough, if it acts as a noun, it can be inflected in many different ways (65, to be more exact). So, should we have a property to link the verb form with exactly the same noun form where all the declensions appear? How could we model it? -Theklan (talk) 14:35, 12 November 2019 (UTC)

@Theklan: even if it's a form of the verb, if it acts like a noun of it own then you should have 2 distincts lexemes.
Old tune: if it's the same meaning and the same main lemma, just linking the sense to the same item with item for this sense (P5137) is enough to find one from the other. To be redundant you can also use derived from lexeme (P5191) if grammar consider the noun to be derived from the verb (but is it always the case? In others languages - in French at least -, it can be either way, it depends).
Cheers, VIGNERON (talk) 16:02, 12 November 2019 (UTC)
@VIGNERON: Yes we should have two different lexemes but we should link them somehow. Maybe with a property called nominalized form, for example. -Theklan (talk) 16:07, 12 November 2019 (UTC)

@VIGNERON: I have created a property proposal. -Theklan (talk) 20:41, 12 November 2019 (UTC)

Cognate property missing?

I did not find it. Do we infer these automatically? What about da: skræmme, sv: skrämma were the vowel changes?--So9q (talk) 20:37, 18 November 2019 (UTC)

Dupicate lexeme script

I would like to duplicate lexemes. This would be very helpful when dealing with similar lexemes between Norwegian, Swedish and Danish. I know some JS and would like to know how to go about coding this if it does not exist yet.--So9q (talk) 10:06, 19 November 2019 (UTC)

Hi, a bit off-question, but have you thought about using lexeme forms tool for creating those lexemes instead? I'm not sure about the amount of similarity between the languages mentioned by you, but if you had to change the spelling of one or two words for every duplicated lexeme, I can almost see the tool being faster in the end. (If you still want a script for that, you can make it at least have it pass the data through the tool first for verification, see the documentation for the tool how) --Adrijaned (talk) 18:26, 19 November 2019 (UTC)

Cleanup of use of Q1182686 in lexemes

Could someone help find and clean up so that definite (Q53997851) is used instead?--So9q (talk) 12:55, 19 November 2019 (UTC)

Replace P31 with "language use" on senses

Hi, Finn created language style (P6191), which is very nice I think, but in many places P31 is still used instead. Can someone clean it up?--So9q (talk) 17:46, 19 November 2019 (UTC)

query for finding those places --Adrijaned (talk) 18:16, 19 November 2019 (UTC)
SPARQL magic :). Unfortunately there are almost only false positives on that list: https://www.wikidata.org/wiki/Lexeme:L40658#S2 https://www.wikidata.org/wiki/Lexeme:L55191#S2
(they use P31 but we only have to fix when any of these values)
I tried this but I dunno yet how to make queries inside queries:
SELECT ?l ?form WHERE{
  values ?val {
    SELECT ?mening WHERE {
    ?mening wdt:P279 wd:Q183046.
    }
  }
  ?l a ontolex:LexicalEntry;ontolex:sense ?form.?form wdt:P31 ?val
}
Try it!
--So9q (talk) 19:22, 19 November 2019 (UTC)
This query seems to work:
SELECT ?l ?sense WHERE {
  {
    SELECT ?mening WHERE {
    ?mening wdt:P279 wd:Q183046.
    }
  }
  ?l a ontolex:LexicalEntry ; ontolex:sense ?sense .
  ?sense wdt:P31 ?mening .
}
Try it!
--Adrijaned (talk) 19:56, 19 November 2019 (UTC)
Thanks! I found a working query myself at the same time just before I saw your post. 46 items to fix.--So9q (talk) 20:06, 19 November 2019 (UTC)

Separation of senses and lexemes

I read about the data model and failed to understand why the current data model of lexemes does not separate lexemes and senses. I would prefer to have senses in a separate normal namespace with labels and translations via labels and link to them from words and vice versa.

This would make it much easier to handle one to many (and many to many) translations because a lemma (danish: bil) would link to a sense W:1:

da motorkøretøj med fire hjul motor vehicle with four wheels
en A wheeled vehicle that moves independently, with at least three wheels, powered mechanically, steered by a driver and mostly for personal transportation; a motorcar or automobile.
fr Véhicule terrestre à quatre roues, de une à sept places, muni d’un moteur et d’une réserve d’énergie pour celui-ci, ce qui rend ce véhicule autonome sur plusieurs dizaines à centaines de kilomètres. One-to-seven-seater four-wheeled land vehicle with a motor and a power reserve for it, which makes this vehicle autonomous over several tens to hundreds of kilometers.

and from there to another language of the sense and from there to the word(s) in that language covering that sense (english: automobile, car).

Currently the data model of wiktionary led language wiktionaries to develop their own ways of handling this (interwiki links between lemmas) and it is not that good as seen in this example the english and french definition is much more precise than the danish equivalent. The danish and french wiktionaries requires the vehicle to have 4 wheels but the english does not. If these were collected in Wikidata I'm quite sure these 3 definitions could be merged and we could agree what a car is:

  • something to transport humans in
  • something with at least 3 wheels
  • something that drives on land
  • something that is often covered against weather
  • something that usually requires a driver

So this is also a car. As it is right now a learner who tries to grasp what this is called in different languages could get confused by the 3 different senses in the different wiktionaries even though the clearly describe the exact same concept.

As I understand this is hard or impossible to implement using the current model. In our current model all senses have to be translated on every single lemma where they appear. This is IMO clearly not the best solution. Also it is prone to duplicate work because all lexemes with this sense would also link in the sense to an image of a car, the concept of car, etc. I would like to have all this in one place in the sense-namespace (here W).

  1. Are there any advantages of the current model?
  2. How is the translations between senses (not words) supposed to work?--So9q (talk) 14:22, 18 November 2019 (UTC)

After having read the tickets related to senses I see that there are also technical advantages to having a separate namespace for senses. This would aleviate all the problems we currently have with senses not being exported to SPARQL, senses UI that is problematic and very different from labels UI (= more to learn for newcomers, = burdon on developers to implement a special UI for senses of lexemes).--So9q (talk) 16:13, 18 November 2019 (UTC)

@So9q: item for this sense (P5137) links senses on lexemes to regular Wikidata items, and from discussions on this page (check the archives) I think is the main solution to the problems you raise. If you look at Lexeme:L3648 you'll see the first sense links to the associated Wikidata item; translations can be found through such links rather than adding translations directly on the sense. How would you distinguish your proposed sense-namespace from the regular item namespace in Wikidata? ArthurPSmith (talk) 18:29, 18 November 2019 (UTC)
Thanks for your reply :). Good question, I had not thought about translation happening on the Q-item which is of course the best when it is a concept described there. For verbs e.g. hoppe (L42726) I just added item for this sense (P5137) jumping (Q1151752) which forced me to add a sense also.
Lets take another example: og (L3833) there I added item for this sense (P5137) logical conjunction (Q191081) which is only partly true. It also works as a glue in sentences but I could not find a Q item for that. Does that mean it would be a good idea to create such a Q-item for this concept "something that works as glue in sentences"? If not how would translations work? I'm guessing I would have to add both to the english sense and swedish sense and nowegian sense, etc.... This means in theory up to 5000 statements for every language under translation (P5972)? This seems like a bad way to handle translations for senses that does not have a item for this sense (P5137) statement.
To answer your question: I think my proposed abstract sense-namespace only have to interact with the L-namespace and we only create senses there when the Q-namespace is not approriate because it is not a thing or a concept there, e.g. the sense of and that means "linking words and sentences". Thinking further about it, we might as well put everything in the Q-namespace and we avoid the distinction and confusion 2 separate namespaces entail.
I went ahead and created this place (Q75618710). WDYT?--So9q (talk) 19:03, 18 November 2019 (UTC)
I think that is the general approach we have decided on here, yes. Not sure on the meaning of the specific example here, but in general this should be fine. There are relatively few special types of words like conjunctions so they should not add too much to the scope of the regular item space. ArthurPSmith (talk) 14:59, 19 November 2019 (UTC)
Replying to myself I just found Fnielsens reply here which was very helpful in understanding the choosen data model of lexemes.--So9q (talk) 15:28, 20 November 2019 (UTC)

Work on language uses

Hi, I created linguistic usus (Q75810558) specifically for the purpose of being able to easily list our current language uses and added the ones I know of as instance of it. See the list via sparql or via the special page.

I updated our template accordingly.

For reference, here is a list of labels sorted by count in en:WT where top 40 (with subjects like "anatomy" and grammar like "countable" excluded) is:

transitive	82639
intransitive	37742
obsolete	29892
colloquial	28380
slang	23539
archaic	21237
rare	20097
informal	17601
uds.	17469
historical	14164
idiomatic	13661
dated	12805
figuratively	12101
literary	11804
reflexive	11189
US	9054
figurative	7710
vulgar	7646
chiefly	7356
dialectal	7010
Jersey	6966
UK	6603
derogatory	5119
pejorative	5053
by extension	5032
Min Nan	4158
regional	3902
Cantonese	3847
Hokkien	3702
poetic	3422
British	3273
ambitransitive	3191
in the plural	3024
formal	2997
relational	2975
humorous	2888

Some questions arise when looking at this list:

  • do we have a way to mark regional differences within english and chinese?
  • how de we mark transitivity?
  • where do the wiktionarians draw the line between dated and archaic? between historical and archaic? --So9q (talk) 09:50, 20 November 2019 (UTC)

Improve the new lexeme form (3 suggestions)

Hi. I suggest we improve restrict the search in the last field to this query.

Also it would be nice if the form would default to my chosen WD language or a cookie remembering the language set the last time I submitted the form.

Additionally I suggest we make it smarter so it can guess the forms of the word after e.g. having got the imperative forms of verbs. I rewrote a script doing that recently on en:WT. The tricky part is that some verbs have multiple words in danish like: https://en.wiktionary.org/wiki/tage_fat where only the first part is inflected and the particle is unchanged. We could detect this by looking for spaces.

The interesting part of that script is the data for the inflections in the languages, here for English:

en: {
		noun: ['{{en-noun', '}}', ['Plural', '', [
			[title + 's', ''],
			[title + 'es', '|es'],
			['(uncountable)', '|-'],
			['input', 'Other: ', '|']
		]], ],
		verb: ['{{en-verb', '}}', ['Third-person singular present tense', '', [
				[title + 's', '|' + title + 's'],
				['input', 'Other: ', '|']
			]],
			['Present participle', '', [
				[title + 'ing', '|' + title + 'ing'],
				['input', 'Other: ', '|']
			]],
			['Past tense', '', [
				[title + 'ed', '|' + title + 'ed'],
				['input', 'Other: ', '|']
			]],
			['Past participle', '', [
				[title + 'ed', '|' + title + 'ed'],
				['input', 'Other: ', '|']
			]]
		],
		adjective: ['{{en-adj', '}}', ['Comparative', '', [
				['more ' + title, '', "necdata['en-adjective3']='most '+title"],
				[title + 'er', '|er', "necdata['en-adjective3']=title+'est'"],
				['(not comparable)', '|-', "necdata['en-adjective3']='(not comparable)'"],
				['input', 'Other: ', '|', '', 'necfunction6("en-adjective3","en-adjective-3-3")']
			]],
			['Superlative', '', [
				['most ' + title, '', "necdata['en-adjective2']='more '+title"],
				[title + 'est', '', "necdata['en-adjective2']=title+'er'"],
				['(not comparable)', '', "necdata['en-adjective2']='(not comparable)'"],
				['input', 'Other: ', '|', '', 'necfunction6("en-adjective2","en-adjective-2-3")']
			]]
		],
		adverb: ['{{en-adv', '}}', ['Comparative', '', [
				['more ' + title, '', "necdata['en-adverb3']='most '+title"],
				[title + 'er', '|er', "necdata['en-adverb3']=title+'est'"],
				['(not comparable)', '|-', "necdata['en-adverb3']='(not comparable)'"],
				['input', 'Other: ', '|', '', 'necfunction6("en-adverb3","en-adverb-3-3")']
			]],
			['Superlative', '', [
				['most ' + title, '', "necdata['en-adverb2']='more '+title"],
				[title + 'est', '', "necdata['en-adverb2']=title+'er'"],
				['(not comparable)', '', "necdata['en-adverb2']='(not comparable)'"],
				['input', 'Other: ', '|', '', 'necfunction6("en-adverb2","en-adverb-2-3")']
			]]
		],
		pronoun: ['{{en-pron', '}}'],
		conjunction: ['{{en-con', '}}'],
		interjection: ['{{en-interj', '}}'],
		preposition: ['{{en-prep', '}}'],
		propernoun: ['{{en-proper noun', '}}', ['Plural', '', [
			['None', ''],
			[title + 's', '|s'],
			['input', 'Other: ', '|']
		]]],
		contraction: ['{{en-cont', '}}'],
		prefix: ['{{en-prefix', '}}'],
		suffix: ['{{en-suffix', '}}']
	},

As you can see these are the possible endings of the different forms. With this data a tool like Lexeme Forms can be extended with a possibility to generate the forms based on the stem/imperative form typed in by the user. The user can then choose the correct ones and save.

I honestly don't know if this is worth the extra effort here on WD but the script became popular on Wiktionary and many new entries have been create with the help of it.--So9q (talk) 18:27, 20 November 2019 (UTC)

Regarding the lexeme forms tool - automatic generation of the forms has been suggested, and rejected, as being out of the scope. We do however have a custom lexeme forms generation tool at least for czech in making externally (which is fine since there are like 4 people working on czech lexemes, me included) --Adrijaned (talk) 19:52, 20 November 2019 (UTC)
@So9q: I think few people use the "Create a new Lexeme" form directly - I've created most of the lexemes I've worked on (in English) using the Wikidata Lexeme Forms tool; I've also looked into Ordia a bit, as well as some of the others on the Wikidata:Tools/Lexicographical data page. These are open-source tools so I think you would be welcome to help them out with github issues or pull requests, etc. ArthurPSmith (talk) 20:33, 20 November 2019 (UTC)
@So9q: English module is not interesting as there almost no inflection in this language. We have a module for Russian nouns! See wikt:ru:Модуль:inflection. And it generates pretty inflection tables in almost all articles about Russian nouns now. --Infovarius (talk) 21:39, 21 November 2019 (UTC)

Import of wikisource:Diccionario de Educación

Is that ok? I'm not that good at Spanish, would someone like to join?--So9q (talk) 06:07, 21 November 2019 (UTC)

100 most translated concepts using lexemes

Hi :)

I just learned some more SPARQL and came up with this query, resulting in this:

 

Now I want to come up with a query that shows me a list of all of these missing a lexeme in Danish (so I can work on them). Can someone help with that?--So9q (talk) 17:42, 21 November 2019 (UTC)

I've managed to get a list of meanings where specific (here:Russian) language is presented:
SELECT distinct ?meaning ?meaningLabel ?count WHERE
{
     ?l a ontolex:LexicalEntry ; 
        dct:language ?language ;
        ontolex:sense ?sense. # get the sense
  ?sense wdt:P5137 ?meaning. #extract the meaning
  FILTER (?language = wd:Q7737)
  {
SELECT ?meaning (count(?l) as ?count) 
WHERE {
   ?l a ontolex:LexicalEntry ; 
        ontolex:sense ?sense. # get the sense
  ?sense wdt:P5137 ?meaning. #extract the meaning
}
group by ?meaning # this is to avoid "bad aggregate", see https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial#Painting_materials
order by desc(?count) # rank by the most translated concepts 
limit 100 # only show the 100 highest to avoid clutter
  }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"}
}
Try it!
But then one should subtract these results from the initial query, and I don't know how to subtract queries... --Infovarius (talk) 22:36, 21 November 2019 (UTC)

Property to use for word list

I am not entirely sure when to use which property for a lexeme. A current problem is:

Here is my attempt on being consistent:

  • If the word appears in a dictionary or a linguistic book or a linguistic article where there is more than just a listing, e.g., where it is explained to be belong to a specific class, I use described by source (P1343)
  • If the word appears in a written work, not explained, but used, and the work is copyrighted, then I use attested in (P5323)
  • If the word appears in a written work, not explained, but used, and the work is out of copyright I use it with usage example (P5831) together with stated in (P248) for the reference.

Finn Årup Nielsen (fnielsen) (talk) 16:31, 28 August 2019 (UTC)

@Fnielsen, Infovarius, Pamputt, Jura1: could we find one common consistent solution? I discovered today that for Swadesh list (Q152392) only part of (P361) is used: https://w.wiki/C46 (and dog (L1122) is using the su-item English Swadesh list (Q3242663)), and it triggers a constraint violation :/ Cheers, VIGNERON (talk) 12:08, 14 November 2019 (UTC)

Antonym missing symmetric constraint

I tried adding the antonym buy L3873 to buy, but it did not complain that sell did not have a antonym relation back to buy. I fixed it by adding symmetric constraint similar to on synonyms--So9q (talk) 22:13, 18 November 2019 (UTC)

Can it be added automagically somehow? -Theklan (talk) 10:04, 22 November 2019 (UTC)

Linking senses between languages

One of the features I think lexicographical data will have in the near future will be making better automatic translation systems. But I can't find how to link senses between languages. Is there any way? -Theklan (talk) 18:05, 16 October 2019 (UTC)

@Theklan: Right now there is translation (P5972) for direct translation links and item for this sense (P5137) for indirect links. See the "translations" section of Wikidata:Lexicographical data/Statistics (which I don't think has been updated for a long time). ArthurPSmith (talk) 18:12, 16 October 2019 (UTC)
@ArthurPSmith: Thanks! It seems that it is still not very used. -Theklan (talk) 18:15, 16 October 2019 (UTC)

As the linking should be symmetric.. is there any tool to do this? -Theklan (talk) 18:17, 16 October 2019 (UTC)

The indirect linking via item for this sense (P5137) is automatically symmetric (but you have to add the property on all the relevant language/lexemes that it applies to). So far the number of senses is far less than the number of lexemes (about 30,000 vs 200,000) so that should probably be addressed first! ArthurPSmith (talk) 18:20, 16 October 2019 (UTC)
which I should add, MatchSinn is a great tool for working on! ArthurPSmith (talk) 18:22, 16 October 2019 (UTC)

<old man still yells at cloud>Do we really need the property translation (P5972) when you can easily and obviously have the same results with item for this sense (P5137)? If we add all the possible value in translation (P5972), I fear the lexeme won't be usable anymore.</old man still yells at cloud> Cheers, VIGNERON (talk) 19:46, 16 October 2019 (UTC)

@VIGNERON: Well, the lexeme for water in Basque has three meanings. I think we could find item for this sense (P5137) for them. But most of the words don't have a item for this sense (P5137) for the sense, I guess. -Theklan (talk) 20:10, 16 October 2019 (UTC)
It does not hurt to create the item … author  TomT0m / talk page 20:18, 16 October 2019 (UTC)
Looks like sooner or later, good idea or not, such things have to happen, unfortunately. But I agree with you. One other stuff that is cool with item for this sense (P5137)   is that it allows to use the usual properties like subclass of (P279) to find hyperonyms. For example if sense A of an english word, then we can find « close » match in french even if they are not exact match. :
⟨ lapdog(en) ⟩ item for this sense (P5137)   ⟨ companion dog (Q38499)      ⟩
and
⟨ companion dog (Q38499)      ⟩ subclass of (P279)   ⟨ dog (Q144)      ⟩
anditem for this sense for that sense|Q144}} we automatically know that chien(fr) is an hyperonym of lapdog(en) … It occurs this is an actual example with the connections on Wikidata already set-up, checked afterward :) author  TomT0m / talk page 20:18, 16 October 2019 (UTC)
What would be the item for this sense (P5137) for the words what, would, be, the and for? Lexemes and items may be connected, but not always. -Theklan (talk) 20:30, 16 October 2019 (UTC)
@Theklan: what's the problem? you can use what (Q20656446), will (Q364340), definite article (Q2865743), cause (Q2574811) (and many more, and you can always create some if needed). I don't see why not always connected all lexemes to an item. Meanwhile, in the end, we will have probably more than 10 000 lexeme for "water", do they really need to all link to all the others? (knowing that when an item has 5000 statements, it already usually breaks things). Cheers, VIGNERON (talk) 10:03, 17 October 2019 (UTC)
@VIGNERON, Theklan, TomT0m: I agree that translation (P5972) is very problematic. However, the big issue currently with only using item for this sense (P5137) is that the labels for Wikidata items are almost exclusively nouns, and so there is not a clear way to link verb, adjective, adverb, and functional words to items. There is a proposal for a potential solution at Property_talk:item for this sense which is to use item for this sense (P5137) to also link non-noun senses to items that reference the same concept as the non-noun sense. Would be good to get some more opinions on that. Liamjamesperritt (talk) 01:22, 29 October 2019 (UTC)
@VIGNERON, Liamjamesperritt, TomT0m:I understand where the technical problem is, but we have exactly the same problem with taxon common name (P1843), we could have more than a thousand representations for dog (Q144) easily. Also, if we can define each lexeme in all the languages, we may finish having thousands of definitions for every lexeme in hundreds of languages. Linking with items is relatively easy with dog (Q144), but think that in some languages the lexeme for dog (Q144) (not the item) is also an insult, and can vary by language. For example, txakur (L73419) means also police (Q35535) (insulting), but I would link to french flic or spanish madero, that are not related to a dog. The same happens for adverbs, verbs and language parts. So, I don't know which is the best technical solution for having a thousand translations, but in the end we will have a thousand translations somewhere.
So, I think that the best solution is having translations inside the lexemes, and finding a way to make them symmetrical. Something like interwiki links could be a good solution. -Theklan (talk) 08:32, 29 October 2019 (UTC)
@Theklan: Based on For example, txakur (L73419) means also police (Q35535) (insulting), but I would link to french flic or spanish madero, that are not related to a dog. I’m not sure you fully understand the proposal. Of course it’s not related to « dog », and in the proposal it does not have to at all ! This sense txakur (L73419)-policeman (would be linked to the « policeman » item, never to the « dog » one). I don’t get your issue. author  TomT0m / talk page 09:31, 29 October 2019 (UTC)
@Theklan: same, I'm a bit lost too. Obviously, L73419-S1 ("dog" sense) and L73419-S2 ("police" sense) are 2 differents senses (and I'm even wondering: shouldn't it be 2 distincts lexemes?) but that's why it is stored in 2 differents senses, with each a separate item for this sense (P5137), how and when can anyone or any get confused?
@Liamjamesperritt: on your proposal, everyone (including me) does agree. There is some caveat (as always) but the general idea of the proposal sound gound and sane. I think we should go with it and stop limiting item for this sense (P5137) to nouns (which we don't really do anyway, https://w.wiki/AzH there is already 1983 lexemes - out of 16655, so ~12% - who are not noun (Q1084)).
Cheers, VIGNERON (talk) 13:23, 29 October 2019 (UTC)
I think I explained the issue in the wrong way. Of course, we can link txakur (L73419) to police (Q35535) and then police (Q35535) to flic, but I think it is more straightforward to make the link of the insulting ways directly and not indirectly, because it can create confusion. If you search about ways to say police (Q35535) in different languages you can finally get a list of insulting ways if you don't specify it. With lexemes the correspondence is straight and more clear, and the tone of the lexeme becomes more obvious.
On the other hand, no, there are not two lexemes: that's why we have more than one sense.
My point is that we will finally, in some years, have thousands of language related items stored somehow. Take ama/𒂼 (L1), we can have there all the words for mother written down. This would be exactly the same case as having the direct translation below, and I think it is better to have a translation than a sense if I can't have both. -Theklan (talk) 16:41, 29 October 2019 (UTC)
@Theklan: I get your point, what I don't get is why you think it's a good idea.
My view is that you suggest to do a lot of links, when a few is enough and easier (to create, to maintain, to re-use, etc.). To explain my point of view, let's re-use these images about the interwiki links before and after Wikidata :
  vs.  
In the case of the interwiki links at least, the direct links where a never-ending nightmare, never up-to-date, always problem of symmetry, bots where revert by bots themselves reverted by bots, and so on. I may be wrong but I'm pretty sure that it will be the same situation with translations.
On the other hand, linking the senses of thousands of lexemes to a single items and then do a simple query to have the same result seems better to me. It is easier, more simple, more elegant even, and it will resist and scale better.
Cheers, VIGNERON (talk) 11:34, 3 November 2019 (UTC)
@VIGNERON: I understand your point, and you give also the solution. Instead of linking to Qs, whe should have a way similar to interwikis to link senses. So, every time we add a link between senses, all the senses with the same meaning get automatically linked. So, instead of linking to an item and them making some reverse search for look to all lexemes linked to that item, we could have a direct linking between senses independent (or not) from the Q. Think that lot of tools could directly use this translation instead of having queries for each item linked to a sense. I honestly think that this solution is easier to mantain and more resilient. -Theklan (talk) 11:51, 3 November 2019 (UTC)
How do you imagine the linking if there would be 2 English and 3 Russian lexemes connected to 1 item sense? --Infovarius (talk) 22:05, 4 November 2019 (UTC)
@Infovarius: that would be the most common case (most words have synonyms), what is wrong with that? Cheers, VIGNERON (talk) 12:56, 23 November 2019 (UTC)
@theklan: This crucial problem of how the term is used (slang, vulgar, offensive) has been solved by Fnielsen with language style (P6191) which is added to senses. This makes it easy to get all vulgar senses in a language or make a child dictionary without offensive senses, etc.--So9q (talk) 05:30, 23 November 2019 (UTC)
@VIGNERON: Sounds good. I'll go ahead and update the usage examples for item for this sense (P5137) to make the increase in scope explicit. Liamjamesperritt (talk) 01:02, 30 October 2019 (UTC)

New section for lexeme queries in examples

Hi, I created a new section for lexeme queries. Feel free to add your queries there for beginners to learn! E.g.

--So9q (talk) 17:32, 20 November 2019 (UTC)

It would be nice to have a maintenance query with the nouns that link to a q from a sense which is missing an image.--So9q (talk) 08:55, 25 November 2019 (UTC)

You mean q missing an image or a sense? Because here deleting images from senses is discussed...

Discussion started about extending P6191 langage use to forms

see Property talk:P6191--So9q (talk) 22:57, 25 November 2019 (UTC)

Handling of trademark (Q167270)

How can we handle trademark (Q167270)? I have examples for Danish lexemes here: københavnerstang (L204254) and LEGO (L57918) where instance of (P31) is used? It is unclear to me whether a trademark (Q167270) is a lexeme. trademark (Q167270) is typically also associated with a certain graphical representation, e.g., "LEGO" and not "Lego" or "lego". trademark (Q167270) seems also mostly to be associated with a sense, e.g., "KØBENHAVNERSTANG"/københavnerstang (L204254) is associated with non-alcoholic drinks [4]. — Finn Årup Nielsen (fnielsen) (talk) 11:03, 17 October 2019 (UTC)

I don't think we should only have word only meaning a name of a company as a lexeme (out of scope). If it also means something else than the company (and it does in e.g. in: har du set mit lego? (have you seen my lego bricks) it is in scope IMO. I'm not in favor of creating lexemes for every company in the world.--So9q (talk) 09:29, 26 November 2019 (UTC)

Best practices

A few quick questions regarding best practices! :) --blinry (talk) 13:51, 27 October 2019 (UTC)

  • When adding an adjective, should I link to an item via item for this sense? For example, should a lexeme sense meaning "simple" link to the item "simplicity"?
  • Is it a best practice to link senses to other senses in other languages using translation?
  • Should the gloss of a sense be a translation or a description? Assume that I want to add an English gloss for a non-English lexeme.
  • Should the lexical categories be as specific as possible? For example, should I use verb or transitive verb?
  • Very interesting and important questions. And most of them are without consensus now, so I can only provide my IMHO:
  1. The most arguable. It is of no harm to add such a sense at least. The other solution is to create special item with the meaning "simple" - more arguable.
  2. This practice seems not effective (similar to pre-Wikidata interwiki linking) but it is advisable practice at the moment.
  3. I would add a brief translation+ short description. The main purpose of a gloss is to distinguish between senses I think. And if there is some translation claim or "item for this sense" then the gloss could be shorter.
  4. I would not use specific categories, but add
    ⟨ subject ⟩ has characteristic (P1552)   ⟨ transitive verb (Q1774805)      ⟩
    (or even create specific property "transitivity").
Regards, --Infovarius (talk) 15:38, 28 October 2019 (UTC)
@blinry, Infovarius: In regards to your first question, there is currently no clear consensus on linking verb, adjective and adverb senses to items using item for this sense (P5137). My own opinion is that we should be able to link an adjective sense such as "simple" to the item labelled "simplicity". There is a discussion about this at Property_talk:item for this sense. Would be good to get some more opinions on the matter and hopefully reach a consensus. Liamjamesperritt (talk) 03:05, 29 October 2019 (UTC)

Here is my IMHO (and open for discussion):

  1. You should item for this sense (P5137)-link adjectives when possible, but not "simple" to "simplicity". These words are different.
  2. If there is no item, then a link through translation may be practice, while perhaps not best practice (should we start creating items for verbs and adjectives?).
  3. I do usually translation, ";" as separator and description, e.g, "cat; domesticated feline" for kat (L40929)
  4. Lexical categories should be as unspecific as possible: verb, noun, adjective, etc. I used instance of (P31) (not has characteristic (P1552) in contrast to Infovarius) in Danish to specify specific categories, e.g., countable noun, mass noun, productive suffix, transitive verb etc.

Finn Årup Nielsen (fnielsen) (talk) 14:52, 29 October 2019 (UTC)

Keep in mind that Lexeme Senses refer to words, whereas Items do not refer to words, but rather concepts. So there is no notion of the lexical category of an Item. It is therefore expected that we will link Senses to Items whose labels are not the same as the Sense's lemma. Furthermore, the concept denoted by "simple" is the concept of "simplicity". And since we will not be adding Items whose labels are adjectives, my opinion is that linking these is the best way to achieve indirect synonyms and translations for non-noun Senses. Liamjamesperritt (talk) 04:30, 30 October 2019 (UTC)
Thanks for this clarification.--So9q (talk) 06:50, 24 November 2019 (UTC)
@Liamjamesperritt, So9q: I would not say that "simple" and "simplicity" are synonyms. I would say that the Q-items are synsets (in the WordNet parlance) and "simple" and "simplicity" are separate synsets in WordNet. It is unclear to me why we should not have Q-items corresponding to non-nouns. We already have(!), as color names are at least in Danish and English are adjectives. — Finn Årup Nielsen (fnielsen) (talk) 20:55, 25 November 2019 (UTC)
We only have colors because there are no noun equivalents AFAIK.
What about my 2 examples: offentliggøre (L222655) & sjællandsk (L54383) where I added the Q-sense of the noun and also give it a quality? That would spare us a few thousand Q-items but maybe there are downsides to that approach (although I can't think of any).--So9q (talk) 23:56, 25 November 2019 (UTC)
@fnielsen :I agree with P31 over P1552. I don't agree with adding the lexeme to sense with a ";". This is clutter IMO. Would you consider removing this?--So9q (talk) 06:50, 24 November 2019 (UTC)
@So9q: What would you then propose? For "cat; domesticated feline" would you then just write "cat" (then how do we disambiguate), "domesticated feline" or something else such as "domesticated feline, i.e., cat". — Finn Årup Nielsen (fnielsen) (talk) 20:40, 25 November 2019 (UTC)
@fnielsen : I meant that repeating the lemma has little or no value, I changed it to follow the danish sense which was good, IMO: domesticated feline that can catch mice.--So9q (talk) 23:56, 25 November 2019 (UTC)
@So9q: I am not sure that is the best way for kat (L40929). If we look on the English Wiktionary [5], the the Danish word "kat" is explained as the English word "cat". The lemma is not repeated, because the English word "cat" is not present anywhere else on the Wikidata page except perhaps through item for this sense (P5137). In the Svensk-Dansk Ordbog, the Swedish word "katt" is explained as the Danish word "kat". — Finn Årup Nielsen (fnielsen) (talk) 02:55, 26 November 2019 (UTC)
I don't think your reference to enWT is relevant. They never explain anything in the other languages except when it does not yet have an english equivalent term, if you want to see how others do it, look at kat in DDO: mindre pattedyr med smidig krop, kort snude, spidse, opretstående ører og blød pels; er god til at fange mus og holdes ofte som husdyr (smaller mammals with smooth body, short muzzle, pointed, erect ears and soft fur; is good for catching mice and is often kept as pets), my sense was a short version of that. I don't see a problem with only having "kat" in item for this sense (P5137) on that page. Anyway I view the Lexeme UI as a kind of machine room but it is not the way we are going to present words to a user trying to look something up in a dictionary. For that we have Ordia or the like services that make it pretty, easy to navigate, show links to other languages, etc. So not having the word cat/cat outside item for this sense (P5137) is not an issue IMO.--So9q (talk) 10:07, 26 November 2019 (UTC)

Mapping Lexemes and OmegaWiki

Hi y'all,

There is already OmegaWiki Defined Meaning (P1245) (with only 627 uses!) but it's for items and not Lexemes. Should we create an equivalent property for Lexemes? I'm not sure how OmegaWiki is structured and how it can be mapped (most probably as the sense level though). I'm also wondering, should be new property be in addition or replace OmegaWiki Defined Meaning (P1245)? Is it again the same question of explicit-direct vs. implicit-inffered - see the discussions about translation - it seems so but it should be confirmed.

Pinging the participant of the P1245 property proposal: @Bigbossfarin, SERutherford, John Vandenberg, purodha:

Cheers, VIGNERON (talk) 12:31, 30 October 2019 (UTC)

An equivalent property for Lexemes would be nice and OmegaWiki Defined Meaning (P1245) should only be used as a property for Senses in Wikidata not anymore in the Q-namespace. Bigbossfarin (talk) 17:29, 30 October 2019 (UTC)
What sense to use the property in Senses (sorry for word game) if it is already at corresponding item? It can be queried then and is redundant. --Infovarius (talk) 19:01, 30 October 2019 (UTC)
I agree with Infovarius. Senses are attached to a specific lemma (or Expression), whereas Defined Meanings are expression non-specific. It therefore makes more sense to map Items to Defined Meaning, and you can then map Senses to those Items using item for this sense (P5137). Liamjamesperritt (talk) 01:29, 31 October 2019 (UTC)
@Bigbossfarin, Infovarius, Liamjamesperritt: Ok, I understand and it make sense (pun intended  ), "DefinedMeaning" should indeed stay on items which is the best place for this identifier. So now, how can we do the mapping? (and have more than 1% of Meaning linked on Wikidata ;) ). Cdlt, VIGNERON (talk) 12:44, 4 November 2019 (UTC)
@VIGNERON: Good question. As far as I'm aware, OmegaWiki already has it's own mappings from Defined Meanings to Wikidata Items with pretty good coverage. Using those existing mappings, it should be pretty easy to automate the creation of the inverse statements on Wikidata. Liamjamesperritt (talk) 22:20, 4 November 2019 (UTC)
@Liamjamesperritt: Wikidata Items is just the DefinedMeaning for Wikidata Q identifier (Q43649390) (I've added the id on this item). Is there an actual mapping on Omegawiki side? (I've looked for it without success). Cdlt, VIGNERON (talk) 09:48, 5 November 2019 (UTC)
@VIGNERON: Unfortunately, it doesn't look like OmegaWiki's implementation of their MediaWiki extension is consistently compatible with the "What links here" tool, so I can't find a way to see a list of usages of the Wikidata ID as an identifier. However, it is widely used as an identifier (since OmegaWiki does not implement Properties) under the "Semantic Annotations" section of Defined Meaning pages throughout OmegaWiki. For example fluorescent lamp has a Wikidata mapping, but the corresponding Wikidata item fluorescent lamp (Q182925) does not have an OmegaWiki mapping. Try using the Random/DefinedMeaning special page to see some more examples (although if your browser is using caching then the Special:Random page may just return the same result everytime). Best, Liamjamesperritt (talk) 22:33, 5 November 2019 (UTC)
@Liamjamesperritt: thank you a lot, I was fooled too by the lack of link in the "What links here" tool. Is there an other way to find them? I see at the end of Special:SpecialPages that there is some export/search/download pages but I'm not sure to understand exactly how it works and if I can get the Wikidata mapping from their side. Does somebody here have an idea? If not, I'll ask directly on the International Beer Parlour of OmegaWiki. Cheers, VIGNERON (talk) 10:38, 6 November 2019 (UTC)
@VIGNERON: It doesn't look like there is any in-browser tool to easily query the data in OmegaWiki. I would say your best chance would be to download the data and query it locally. After looking at the OmegaWiki database layout, I think this query should do the trick:
SELECT object_id AS defined_meaning_id, text AS wikidata_id FROM uw_text_attribute_values WHERE attribute_mid = 1434066;
It should return over 10784 results. Best, Liamjamesperritt (talk) 13:09, 6 November 2019 (UTC)
@VIGNERON: Hey, I decided to download the OmegaWiki database and run the query myself which returned 10919 results. I sent you an email with the query results. Let me know if you want any help getting a bot built/approved to add all the statements to Wikidata. Best, Liamjamesperritt (talk) 01:34, 7 November 2019 (UTC)
@Liamjamesperritt: mail received, I randomly look at some values and everything seems good. Since it's a (relatively) small dataset and with simple data, in this case, I think we can just use QuickStatement. Do you want to do it or should I ? Cheers, VIGNERON (talk) 07:36, 7 November 2019 (UTC)
@VIGNERON: If you feel that it is small enough to not need approval, then I'll go ahead and do it. Liamjamesperritt (talk) 08:40, 7 November 2019 (UTC)
@VIGNERON: It's done. FYI, previously there were 632 uses of OmegaWiki Defined Meaning (P1245) in Wikidata, now there are 10665. I had to filter out a few hundred that were duplicates or that were for items that had been deleted or merged and were no longer valid, so not all 10919 statements were added. Best, Liamjamesperritt (talk) 22:24, 8 November 2019 (UTC)
Thank you! There still a lot to do but that's already way better! VIGNERON (talk) 10:04, 9 November 2019 (UTC)
Nice work! :)--So9q (talk) 10:10, 26 November 2019 (UTC)

Picture dictionary for kids and next step

Today I made an example that filteres based on language use resulting in a picture dictionary for children. It was remarkable easy to do. I love sparql! 😃 Next step is specialized dictionaries for special areas of knowledge like law, anatomy, etc. For this we need to improve the categorization of the terms among the q-items which is currently very lacking.--So9q (talk) 12:50, 23 November 2019 (UTC)

Can you please share your results? --Infovarius (talk) 20:15, 24 November 2019 (UTC)
Sorry, I typed this on mobile. I put it with the others in the examples. Note that we current lack a way to indicate offensive language use (might be because we do not have offensive senses of lexemes yet but I highly doubt that). When someone have created the relevant q-item I think we should add it to the query.--So9q (talk) 20:34, 24 November 2019 (UTC)
@So9q: couldn't you use language style (P6191) for "offensive"? (it's not exactly always the same thing but often it is quite close) Cdlt, VIGNERON (talk) 09:00, 26 November 2019 (UTC)
Yes, what I meant was that offensive (Q76500861) was missing (I just created it). It is now used here: https://www.wikidata.org/wiki/Lexeme:L226732.--So9q (talk) 09:08, 26 November 2019 (UTC)

Unknown lexical category

Because lexical category is mandatory to type in on the lexeme form I give it the Q-item for unknown when I don't know the category. E.g. https://www.wikidata.org/wiki/Lexeme:L226749. This can easily be queried and cleaned up by others so I suggest you do the same :)--So9q (talk) 11:54, 26 November 2019 (UTC)

Return to the project page "Lexicographical data/Archive/2019/11".