Wikidata talk:Lexicographical data/Archive/2019/12

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Use of combined grammatical features like “second-person plural”

Latest comment: 4 years ago5 comments4 people in discussion

We have a number of items that combine several grammatical features, such as second-person plural (Q51929403) = second person (Q51929049) + plural (Q146786). Should these be used as grammatical features of forms – and, if yes, instead of or in addition to the individual items?

Personally, I think we should not use these items. The data model allows for any number of grammatical features per form, so I don’t see why it would be necessary to combine several into one item – it just makes it more annoying to use the data (e. g. in queries), because you have to account for several possible combinations to model the same information. (There are even items that combine more than two features, e. g. second-person plural feminine (Q55098010).)

In the Wikidata Lexeme Forms templates, those combined items are used almost nowhere – the only occurrence so far is third-person singular (Q51929447) in English verb, where I think it’s somewhat defensible because there are no distinct forms for any other grammatical person or number (though I wouldn’t object to replacing this with third person (Q51929074) + singular (Q110786) either). But the work-in-progress template for Czech perfective verbs (CC Adrijaned and Strepon) proposes using the items together, e. g. having a form with first person (Q21714344), singular (Q110786) and first-person singular (Q51929218), so I think it would be good to clear this up now. What do you think? --Lucas Werkmeister (talk) 15:18, 17 November 2019 (UTC)

I agree with you for all the reasons you have presented. I think only the individual items should be used. The only advantage I see to use combined items is it is faster to add them by hand but because Wikidata is thought to be filled by bot, this argument is weak. This means, imho, that the combined items should be replaced by the individual items and then deleted so that they cannot be used anymore later. Pamputt (talk) 18:20, 17 November 2019 (UTC)

My opinion was more like "I don't care", so if there is an agreement to use the single items only (which is reasonable I think), I'm happy to follow it. --Strepon (talk) 19:27, 22 November 2019 (UTC)

Okay, I’ve just updated the English verb template so it no longer uses third-person singular (Q51929447) (CC ArthurPSmith as the original author of that template). I haven’t updated the existing lexemes yet, though. --Lucas Werkmeister (talk) 13:13, 30 November 2019 (UTC)

That's fine with me, thanks! ArthurPSmith (talk) 14:59, 2 December 2019 (UTC)

Help with cleanup of Q24133704 by bot

Latest comment: 4 years ago2 comments2 people in discussion

Hi, I found wd:Q24133704 used on 418 english verbs. It has been merged with present participle (Q10345583) and should be replaced by the latter. Unfortunately this is not supported yet by QS2 so I suggest we do it by bot via the wonderful LexData library. Ping the ones who I know have bots that work on lexemes: @Uziel302, yurik:--So9q (talk) 11:07, 29 November 2019 (UTC)

So9q, is there a reason there is no support to redirects? Sounds a lot of work to run a bot every time something becomes redirect.Uziel302 (talk) 20:09, 1 December 2019 (UTC)

Deprecation of P5972

Latest comment: 4 years ago10 comments4 people in discussion

I suggest we deprecate translation (P5972) and store all translations in the labels of Q-items instead. For verbs this often means creating a Q-item which denotes the action (starting the description with "action of" or "method"). This aleviates the problem with creating translation (P5972) symmetrically for every sense for every language of the world and utilizes the power of WD Q-items instead.

As an aside this also partly avoids the annoying bug of senses not being searchable when trying to add translation (P5972) to a sense. WDYT? --So9q (talk) 09:56, 20 November 2019 (UTC)

I think the suggestion was already discussed before and discarded.

The new point seems to be that it's meant as a bugfix (or a way to cover-up a missing feature). That seems silly to me. --- Jura 10:04, 20 November 2019 (UTC)

I just found https://www.wikidata.org/w/index.php?title=Wikidata:Properties_for_deletion&oldid=950463144#oversættes_til_(P5972). I'm somewhat surprised to see you could not reach consensus in the matter. I guess I will have to be patient.

I see no problem with creating Q-items for every concept in the world, being that a special word that links other words or phrases (like and (L1385) that you can view as one of our greatest historical linguistic inventions because it made binding together concepts of speech possible) or an item describing any human action. I did not see any arguments for keeping out concepts like this out of the Q namespace. I created and (Q75815335) as an example. In wiktionary there are just about 3,1 mio. definitions and I would guess we have at least half and maybe 2/3 of those already. In other words what I am proposing is creating the ~1 mio. missing concepts (mostly human actions like Q75801718 and Q75800566 that I created).

I realise that with our current manpower and lack of license interoperability this work is probably going to take years. If we solve the relicensing of senses from wiktionary by convincing them to relicense senses as CC0 this creation of items could be done by bot in less than a month and only the linking to senses and cleanup would be left.--So9q (talk) 10:37, 20 November 2019 (UTC)

I think the solution is rather the opposite .. we move lexemes over to Wiktionary. --- Jura 11:01, 20 November 2019 (UTC)

Care to share any arguments for that? I disagree, I don't think Wiktionary is a good platform for dictionary work at all.--So9q (talk) 12:03, 20 November 2019 (UTC)

The platform doesn't need to change and the content is already there, even in form of triples. It's actually surprising how far they got over the years .. (same was discussed further up on this page) --- Jura 18:57, 21 November 2019 (UTC)

@So9q: how did you get a number "3.1 mln"? It isn't at mentioned page and I guess it is hard to derive from millions of inclined forms which are stored in English Wiktionary. --Infovarius (talk) 20:07, 21 November 2019 (UTC)

It is the sum of all glosses. But now that you point my attention to it the number is wrong because I assumed that all glosses (all lines in the dump starting with #) is something we need to add as a Q-item. This is false as it is only the english glosses we need to cover, the others are most of the time just referring to an english gloss. So say there are 800,000 valuable glosses/definitions/concepts in en:WT and we have somewhere between 1/2 and 2/3 of them already (I get this number when sorting by glosses for English: 715409).--So9q (talk) 20:37, 21 November 2019 (UTC)

Agreed, translation (P5972) is a dead end that we only cause nightmarish problems. As the Lexemes grow, we will end up with several thousands statements on each Lexemes which is not manageable. Cheers, VIGNERON (talk) 13:54, 22 November 2019 (UTC)

At WikiProject_sum_of_all_paintings, we had a discussion about the items used for "peinture" and "tableau" (translated in English as "painting" and "painting on a moveable support").

Two distinct concepts, but not really useful as actual classes for instances of items and not necessarily likely to have separate articles in the same Wikipedia.

The result is that these two ended up being merged. If they had been used as values for "item for this sense", this would likely have resulted in a loss of meaning for indirect translations. --- Jura 08:06, 3 December 2019 (UTC)

Pictures on senses - bad idea?

Latest comment: 4 years ago13 comments6 people in discussion

Hi, I wonder if it is a good idea to link directly from senses to images because these could instead be linked on the corresponding Q-item (and probably already is). Linking to images on every single sense of every single lexeme in all languages seems like a bad idea to me. WDYT?

I'm aware that ORDIA currently displays images which is very nice, but it could just as well look those up on the P5137 linked Q-item.--So9q (talk) 18:52, 21 November 2019 (UTC)

Could you elaborate why this "seems like a bad idea"? Pamputt (talk) 21:23, 21 November 2019 (UTC)

Ok, lets take an example: police (Q35535) has two pictures already. Now 5000 people come along and create a lexeme for police in every single language on the planet and adds a picture in the sense also. This is a waste of time and resources and if you multiply this by every noun in the world it becomes an enormous waste. Lets utilize RDF and pull in the image from the Q-item instead and update in one place if the image is not suitable.--So9q (talk) 22:19, 21 November 2019 (UTC)

I agree that if we have the item link, this information may be automatically added, so we are don't need to add it again and again. Can it be done? -Theklan (talk) 10:03, 22 November 2019 (UTC)

Yes (I added this query to the examples).--So9q (talk) 11:47, 22 November 2019 (UTC)

+1, I see not good reason to put images on Lexemes (at the sense level or elsewhere) and we should remove them. Cheers, VIGNERON (talk) 13:45, 22 November 2019 (UTC)

The datamodel chosen for lexemes includes subentities for senses. As such senses can and should have statements on these subentities. If the chosen datamodel had linked items directly instead, we wouldn't have these options.

Given that we don't have a definite answer on How to tell how many senses has a lexeme, I don't see how this could change.

I do agree that creating 5000 senses for police isn't a good idea, but I think most contributors didn't embark on that. Focus could rather be to list potential senses of a single lexeme and, if desired, illustrate this.

I would be interested in a series of lexemes for words related to snow in Eskimo–Aleut languages, each with a picture on each sense. --- Jura 13:04, 23 November 2019 (UTC)

Yes, obviously senses *can* have statements, but that doesn't mean they *should* have statements (an obvious example is language of work or name (P407), we could add them on Lexemes but we don't as we have better way to do it). In particular, we are talking here about storing twice pictures, once in item and once again in the sense linked to the item. That's redundant and unnecessary, it will only leads to problems (ultimately lexemes too big to be useful). Cheers, VIGNERON (talk) 14:27, 23 November 2019 (UTC)

Don't worry about lexeme size. Until most lexemes reach the mere form part of pl-lexemes, there is space left. It will me more accurate if you illustrate each sense with a picture even if you end up linking a third of them to items that might have an image or might not have an image. --- Jura 20:43, 24 November 2019 (UTC)

I worked on a lot of lexemes creating the picture dictionary. Is there any example where you think putting an image on sense is better than on the Q item? Your example of snow is not bad, but I don't see a reason why not to create a q item for each different type. In Danish we have the following snow related nouns: slask, slud, sne, snetæppe, snefald, snebyge, snefnug, snestorm, grødis, isflage, frost, islag, isslag, tøsne, sne med skorpe, sneskorpe, dybsne, snevejr, sneklump, snebold, sneplov, snefog, snefygning, pulversne, snehule, snehytte, snebold, sneblindhed, sneboldeffekt, sneboldkamp, snebrille, snebunke, snedrive, snedrys, snedynge, snedække, snekanon, snekaster, snekrystal, snekæde, snelag, snelandskab, snemand, snemark, snemelding, snerydning, hagl, haglvejr...

Non nouns: sne, sne til, sne inde, oversne, sneglat, snerig, snescooter, snesjap, snesko, sneskovl, sneskraber, sneskred, sneslynge, snesmeltning, snetykning, snevejrsdag, snevinter, --So9q (talk) 08:17, 25 November 2019 (UTC)

I have been using the pictures on lexeme senses for quiet a while. Here are some thoughts:

A sense may have been created where there is yet no Q-item. In Danish, there are currently 71 of the cases where there is an image but no link to a Q-item, e.g., pandekagerulle (L54596), å (L36714) and for example L37311-S3 (cf. L37311-S2), see https://w.wiki/Ckf . The counter argument is to say that the Q-item should just be created, but that may entail a good deal of effort, e.g., what is an "å" compared to a river or a stream.
Senses are tied to a language and Q-items are not. For the Danish word politi (L43239), you (at least I) could expect that a sense would display an image of Danish police. Indeed politimand (L43299) (policeman) shows an image of a Danish policeman. Likewise senses for a language such as Vietnamese you would expect to see a Vietnamese policeman.
I do not think that adding an image to the sense would make the lexeme size unmanagable. It results in very few triples compared to the many forms in highly inflected languages.
I already use the Wikidata graph and connection via Q-item to pull in the images in Ordia with something like "?lexeme1 ontolex:sense / wdt:P5137? / wdt:P18 ?lexeme1Image". It is not a problem to use both, see the "Compound and derivation graph", in, e.g., https://tools.wmflabs.org/ordia/L2310
I am unsure about how well we can align senses in different languages to concepts. In Linking ImageNet WordNet Synsets with Wikidata (Q50347076), I wrote about my difficulties in aligning WordNet's synsets with Wikidata's Q-items. There is also this report: Merging DanNet with Princeton Wordnet (Q65923665) (fulltext: https://zenodo.org/record/3463358). These alignments are not necessarily easy: Perhaps an image might help to clearify the meaning of a sense.

— Finn Årup Nielsen (fnielsen) (talk) 21:35, 25 November 2019 (UTC)

Here are my replies

I fixed å, you are right that sometimes it is not clear where the distinction goes, e.g. between river and stream vs between å and flod. This uncertainty should be added to the Q item and be clarified on the sense in a usage note, see below.
Using both might be the best solution, see below. I found some examples where you put a good image on the sense and the q item had no image, that should be avoided I think. We can see it as sharing the image with the rest of WD when not strictly necessary to keep it on the lexeme. With my "solution" below the q items will gain a lot by our efforts. Image on sense should be a very rare thing I think and only to illustrate something explained in a note next to it.
I agree, UI wise it might be better to have images on senses as there will be a very long list of images on q items if my proposal below is accepted.
I "solved" this by adding multiple images to the q item and gave them language of work = Danish. This could theoretically result in 5000 images on a single item, one for all languages in the world but realistically it might only be around 200 as one for each state in the example driver's license (Q205647). Note that the cases where multiple states have the same concepts like military, police, driving license, boat motor permission (fr), fishing permission, are relatively few. I can think of some other concepts like person but that could be solved like woman and man by ranking a collage of them highest.
I agree that adding an image may sometimes be helpful. In Wiktionary they have usage notes and we should have that on senses as well to clarify use. Can someone propose a property for this to be used on form, sense and lexeme level (i suggest not to restrict it unless a good reason for restriction comes up). A usecase is e.g. https://www.wikidata.org/wiki/Lexeme:L54385 compare with https://en.wiktionary.org/wiki/-sk#Danish--So9q (talk) 08:37, 26 November 2019 (UTC)

I'm not really convinced by the approach at Q205647#P18. --- Jura 08:28, 3 December 2019 (UTC)

Talk about a MachtSinn-like script for WD UI that infers sense-labels from Q-items

Latest comment: 4 years ago2 comments2 people in discussion

see https://www.wikidata.org/wiki/Lexeme_talk:L55575 --So9q (talk) 09:41, 26 November 2019 (UTC)

It's a nice tool to expand lexemes, but we could just query the sense(s) based on item labels and aliases. --- Jura 08:19, 3 December 2019 (UTC)

Should suffixes be prefixed with a hyphen/dash?

Latest comment: 4 years ago2 comments2 people in discussion

If so, with which one? I am primarily asking for German here but interested in the matter more generally, as well as in the related question as to whether prefixes should be suffixed accordingly. I suppose the answer is yes to both for languages where hyphens or dashes are commonly used for hyphenation, and in that case suggest that there be some sort of constraint check (have not looked into this myself yet). --Daniel Mietchen (talk) 22:51, 1 December 2019 (UTC)

It is useful. In ru-wikt we also set prefixes to be suffixed with "-" (like "pre-") and interfixes surrounded with "-" (like "-abil-"). But hyphen prefix we use for flexions, while suffixes are surrounded with "-" too. Infovarius (talk) 15:22, 6 December 2019 (UTC)

German language lexeme cleanup

Latest comment: 4 years ago5 comments3 people in discussion

On project chat, there was recently a problem report about lexemes in German: see Wikidata:Project_chat/Archive/2019/11#Lexeme mistakes.

Looking at other creations by the same user, I noticed that it's still unresolved/possibly went unnoticed by German language lexeme editors. --- Jura 08:26, 3 December 2019 (UTC)

What do you propose? I cleaned up a handful of them, but it's a lot of work... ArthurPSmith (talk) 21:45, 3 December 2019 (UTC)

Any of the ways suggested so far should do. Deletion is probably the worst option, but I could batch nominated them. --- Jura 15:32, 4 December 2019 (UTC)

For the capitalization part, this is typically the thing that should be corrected by bot (if there is indeed no exception). In order to help, here is all lemma starting by a letter between a and z (BTW, does someone know how to simply ask for all lowercase letter?):

SELECT DISTINCT ?l ?lemma WHERE {
?l a ontolex:LexicalEntry ; dct:language wd:Q188 ; wikibase:lexicalCategory wd:Q1084 ; wikibase:lemma ?lemma .
FILTER regex (?lemma, "^[a-z]").
}

Try it!

And same for the lemmata at the form level:

SELECT DISTINCT ?l ?lemmata WHERE {
?l a ontolex:LexicalEntry ; dct:language wd:Q188 ; wikibase:lexicalCategory wd:Q1084 ; ontolex:lexicalForm ?form .
?form ontolex:representation ?lemmata .
FILTER regex (?word, "^[a-z]").
}

Try it!

Speaking of language needing help, I could use some help for French which is also in a very bad shape (a lot of missing information and the second worst ratio forms per lexeme, inferior to 1 - meaning that most lexemes have 0 forms! - and only Nynorsk (Q25164) is worst). Cdlt, VIGNERON (talk) 14:55, 4 December 2019 (UTC)

It's also something that can be done by bot (and was request) .. neither probably needs any fixing. It's not in bad shape as such .. --- Jura 15:32, 4 December 2019 (UTC)

Senses with proverb words

Latest comment: 4 years ago10 comments3 people in discussion

Hi, I've had this bit of a problem with writing senses for some lexemes recently. In particular, I have problems with words that are used as parts of proverbs and take a special meaning when in them. For instance, take mák (L10888), which in czech means either "poppy seed" or "poppy flower". However, when used in "ani za mák" ("not even for a poppy seed"), it means "not at all", and when used in "jako máku" ("like poppy seeds"), it means "a lot". How would I write those two senses of the lexeme while maintaining that it must be used together with those words, and perhaps even noting down that it is a proverb (Would just L10888-S4instance of (P31)proverb (Q35102) make sense for that?) --Adrijaned (talk) 19:00, 4 December 2019 (UTC)

Maybe they fit one of these: Wikidata:Lists/locutions/types ? --- Jura 19:12, 4 December 2019 (UTC)

Personally I would not add these as another senses for the word. It is clearly the case of idiom/phrase thing - the word itself does not have that meaning, that meaning exists for that specific collocation only. In my opinion these should be independent lexemes. For example, I created bílá vrána (L221052) with lexical category of phraseme (Q5551966) (well I do not know much about terminology in this area, it can always be changed into something more suitable later on) but it would need some more research on how to work with Czech phrases. And I am not sure how to link (for example) černá ovce (L221051) with černý (L221043) and ovce (L10561) together. Any ideas? --Lexicolover (talk) 20:58, 4 December 2019 (UTC)

I would personally definitely be against multi-word lexemes. Could the solution maybe be requesting new property "Used in proverb" for use in lexemes? No idea as to what it's data type would be tho. Could proverbs get their own Q-items? --Adrijaned (talk) 21:48, 4 December 2019 (UTC)

I kind of understand why one would be against multi-word lexemes, but these already exist here (eg. nominal phrases) and by definition of lexemes it is not wrong. And datawise it has advantages to have phrases as separate units from their parts. --Lexicolover (talk) 22:32, 4 December 2019 (UTC)

Alright, after some reading, I think I will follow the path of phraseme (Q5551966) lexemes too. Thanks! --Adrijaned (talk) 15:18, 5 December 2019 (UTC)

However, one thing that leaves me wondering, is what should be the forms of such frazemes? They don't have such nicely defined grammatical categories like noun (Q1084) or adjective (Q34698) --Adrijaned (talk) 15:25, 5 December 2019 (UTC)

You are right. There are many more issues (especially when it comes to verb phrases) that are not clear on how to deal with them. As for forms I am not entirely sure but I think we could follow what part of speech those substitute for. For example ani za mák (L228043) acts as adverb and bílá vrána (L221052) acts as noun; so we can use each respective set of forms as in adverb or noun lexemes for these phrasemes. --Lexicolover (talk) 22:43, 6 December 2019 (UTC)

What about doing something like ani za mák (L228043)subject has role (P2868)adverb (Q380057) for for instance ani za mák (L228043)? (Or perhaps using object of statement has role (P3831) instead, I'm not sure about their difference) --Adrijaned (talk) 23:05, 6 December 2019 (UTC)

Not sure if it is scientifically correct and what advantages would it have. There would be different roles depending on the context and point of view. But if you think it is useful feel free to do it. --Lexicolover (talk) 23:42, 6 December 2019 (UTC)

Homophone property missing?

Latest comment: 4 years ago4 comments3 people in discussion

I did not find it. I was trying to enter the ones found in here: https://en.wiktionary.org/wiki/here --So9q (talk) 20:24, 18 November 2019 (UTC)

I believe one was proposed at some point, but we decided it was better to find homophones by looking for matching IPA transcriptions. ArthurPSmith (talk) 15:00, 19 November 2019 (UTC)

Ok, thanks--So9q (talk) 17:42, 19 November 2019 (UTC)

Actually, I recall there being an "homonym" proposal, which failed to distinguish between homophones and homographs, and was accordingly voted against. I don't know/remember if there was an actual separate proposal for an homophones category. Circeus (talk) 00:13, 9 December 2019 (UTC)

How to exclude terms efficiently from childrens picture dictionary?

Latest comment: 4 years ago8 comments5 people in discussion

If we are to generate precise and well defined childrens dictionary we need some effecient way to mark and exclude lexemes with certain concepts like anal sex (Q8398) which has a very informative picture but is not well suited in a picture dictionary for young children.

One route is to introduce another language use like: "not suitable for young children". Another is to exclude like so in the query which seems to work quite well:

 # Exclude out of scope concepts
 MINUS {?q_concept wdt:P31 wd:Q3624078.}. # countries
 MINUS {?q_concept wdt:P31 wd:Q608.}      # sex
 MINUS {?q_concept wdt:P31 wd:Q8386.}     # drugs
 MINUS {?q_concept wdt:P31 wd:Q15142889}  # weapon family

WDYT?--So9q (talk) 23:31, 2 December 2019 (UTC)

I don't think an automated approach will work for this. ArthurPSmith (talk) 21:44, 3 December 2019 (UTC)

@ArthurPSmith: Not sure what you mean. Could you elaborate? Here is the English picture dictionary query FWIW.--So9q (talk) 08:16, 4 December 2019 (UTC)

@So9q: I also think that trying to do automatic exclusion will never be efficient. Plus, this kind of query will always forget to exclude some things, here for an obvious example sex organs are not excluded as well as everything related to violence like murder. And the other way around will exclude things you may want (why exclude countries here ? and not all the geo/topo/exo/odo/oro/hydro-nym ?). I guess a better approach the opposite: to include only what you want. Cheers, VIGNERON (talk) 11:20, 4 December 2019 (UTC)

Ok, I understand.--So9q (talk) 11:36, 4 December 2019 (UTC)

How about adding a specific sense with a intended public (P2360) statement? --- Jura 16:29, 4 December 2019 (UTC)
- Good idea! So what would be the intended public (P2360) of anal sex? intended public (P2360)adult (Q80994)? I like this solution because we can easily create a game like MachtSinn to help mark these senses.
- Does anyone disagree with marking all senses linked to the following concepts as adult?:
  - human sexual behavior (Q608)
  - drug (Q8386)
  - weapon family (Q15142889)
  - tobacco product (Q44106)
  - antisemitism (Q22649)
  - racism (Q8461)
  - conflict (Q180684)
  - cadaver (Q48422)
  - crime (Q83267)--So9q (talk) 13:09, 8 December 2019 (UTC)
  - Is family (Q8436) linked with human sexual behavior (Q608)? African Americans (Q49085) with racism (Q8461)? The Holocaust (Q2763) with antisemitism (Q22649)? Should we exclude them? --Infovarius (talk) 18:21, 10 December 2019 (UTC)

Help with question about OpenRefine

Latest comment: 4 years ago1 comment1 person in discussion

https://github.com/OpenRefine/OpenRefine/issues/2240#issuecomment-563417852

Indeed! Lexemes are not supported at the moment. We would be very interested to add support for this, but as far as I am aware nobody has planned to spend time working on this. If anyone is keen, I would be happy to discuss potential architectures. One first step towards this happens in Wikidata-Toolkit: Wikidata/Wikidata-Toolkit#437

It would be useful if you could give a few examples of the datasets you would import with OpenRefine if that feature was available. As a user, how do you expect this integration to work?

--So9q (talk) 10:23, 10 December 2019 (UTC)

Modeling spelling variants

Latest comment: 4 years ago1 comment1 person in discussion

Maybe of some interest here, but trying to avoid crossposting: I’d been wondering about how best to model spelling variants in the project chat before I saw that this talk page might have been a better place.―BlaueBlüte (talk) 12:10, 10 December 2019 (UTC)

MachtSinn update and help wanted

Latest comment: 4 years ago1 comment1 person in discussion

Hi, Michael updated the English database with my improved query and there is now ~7000 high quality matches waiting for you :p

Michael invites you to help with pull requests to fix the outstanding issues with the back- and frontend.

Some tips:

If you match a specialty term like spiral (Q189114) please make sure that it has a property which identifies it within a broader context like facet of (P1269)computer science (Q21198) in this case.
Please use MS to find wider omissions in WD, e.g. today I found all poker terms via petscan and gave them sport (P641):poker (Q80131)--So9q (talk) 13:10, 12 December 2019 (UTC)

Co-maintainers needed

Latest comment: 4 years ago1 comment1 person in discussion

Hello all,

Just sharing a message from user:MichaelSchoenitzer who is looking for co-maintainers for the tool MachtSinn and the Lexeme Python library.

If you have some knowledge in Python and Javascript and would be willing to help, feel free to contribute or get in touch. Great tools like these deserve some time and care :) Thanks in advance, Lea Lacroix (WMDE) (talk) 07:35, 18 December 2019 (UTC)

Aren't senses language independent?

Latest comment: 4 years ago6 comments6 people in discussion

Hello! I am a pretty unexperienced, but interested fan of Wikidata and just checking out the new Lexicographical part. As I was thinking about the data model, I was asking myself if it really makes sense to have the senses of a lexem attached to the lexem. As I understand senses, they are language independent entities, similar to to the existing items (Q values). So my questions are:

Why aren't senses in a own namespace, making them language-independent?
Wouldn't it have made sense to use (or create) items as senses?

--Franzsimon (talk) 21:00, 3 December 2019 (UTC)

@Franzsimon: The data model is fairly new and this is definitely something we've discussed extensively - see just on this page for discussions about using item for this sense (P5137) for instance. I think the model in Wikidata was based in part on the way senses are defined in the Wiktionaries, some of the distinctions can be very subtle and not necessarily deserving of an item in themselves. But maybe that level of detail isn't really necessary, or maybe those items really should be created where the distinctions are clear enough? Anyway, this is definitely an area that we may want to work out a different solution for in the long run. ArthurPSmith (talk) 21:51, 3 December 2019 (UTC)

@Franzsimon: I'm not fully sure to follow your idea. Yes and no, senses are mostly language independent but they aren't totally independent. Right now, in most case - for better and worse-, sense is just a glose and a link to the corresponding Q item (which is a different language-neutral namespace, so independent in a way). But you need this link as there always be a relation somewhere between Lexemes and Items. The current way is maybe not perfect but I think it works quite well (I would have a lot to say about gloses but that's a different matter), like Arthur I would like to hear more about your idea. Cdlt, VIGNERON (talk) 11:57, 4 December 2019 (UTC)

I suppose the Wikidata lexicographic data is inspired from wordnets (WordNet) and the Lemon model. These resources distinguish between word, wordsense and synset, and it would correspond to L-, L-S- ,and Q entities. You can read more about the Lemon model at https://lemon-model.net/ — Finn Årup Nielsen (fnielsen) (talk) 14:26, 5 December 2019 (UTC)

Different languages draw the boundaries around words slightly different. Having senses be language specific allows us to be more specific about how a given languages draws their boundaries. ChristianKl ❪✉❫ 16:06, 20 December 2019 (UTC)

@ChristianKl: This can also be achieved with a separate namespace for senses by creating separate Wikidata entities for senses that differ slightly. Senses that seem identical can still share the same entity, such that a potential to avoid a lot of duplication remains. --Njardarlogar (talk) 17:28, 22 December 2019 (UTC)

New property needed?

Latest comment: 4 years ago1 comment1 person in discussion

Hi, the lexeme drone in danish has been imported twice. I would like to model this by either adding derived from on each sense or by adding one drived from on the lexeme with 2 values with qualifier: apllies to sense. This qualifier does not seem to exist.WDYT about proposing it--So9q (talk) 12:03, 25 December 2019 (UTC)

Important change related to this project is now being discussed in the project chat

Latest comment: 4 years ago2 comments2 people in discussion

See Wikidata:Project_chat#Changes_needed_to_the_Wikidata:Notability_for_the_lexeme_project_to_be_able_to_succeed--So9q (talk) 22:18, 16 December 2019 (UTC)

Discussion closed and now archived at Wikidata:Project_chat/Archive/2019/12#Changes_needed_to_the_Wikidata:Notability_for_the_lexeme_project_to_be_able_to_succeed. Cheers, VIGNERON (talk) 12:29, 26 December 2019 (UTC)

lexicographical data too big...

Latest comment: 4 years ago2 comments2 people in discussion

Please help how to get grammatical features in forms of Russian nouns?

SELECT ?l ?lemma ?word (GROUP_CONCAT(DISTINCT ?grammaticalFeature; SEPARATOR=', ') AS ?lem)
WHERE {
  ?l a ontolex:LexicalEntry ; dct:language ?language ;
    wikibase:lexicalCategory wd:Q1084 ;
    wikibase:lemma ?lemma ;
    ontolex:lexicalForm ?form .
  ?form ontolex:representation ?word ;
    wikibase:grammaticalFeature ?grammaticalFeature .
  ?language wdt:P218 'ru'.
#  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
} GROUP BY ?l ?lemma ?word ?lem
LIMIT 10

Try it!

This query times out. --Infovarius (talk) 20:49, 26 December 2019 (UTC)

@Infovarius: probably not with the SPARQL Wikidata Query Service, indeed it's too big (2 475 622 results right now!). If you really want to use it, you need to filter and/or to do a more simple query:

SELECT ?l ?lemma ?word ?grammaticalFeature
WHERE {
  ?l a ontolex:LexicalEntry ; dct:language wd:Q7737 ; wikibase:lexicalCategory wd:Q1084 ; wikibase:lemma ?lemma ; ontolex:lexicalForm ?form .
  ?form ontolex:representation ?word ; wikibase:grammaticalFeature ?grammaticalFeature .
}
LIMIT 100

Try it!

Cdlt, VIGNERON (talk) 11:30, 27 December 2019 (UTC)