Open main menu

Wikidata talk:Lexicographical data

Overview

 

Documentation

 

Development

 

Tools

 

Support for Wiktionary

 

How to help

 

Lexemes

 

Discussion

 

Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2019/07.


Contents

Again lettersEdit

I propose to merge a (L20817) and a (L45484). It is a letter of the same script, independently of language. And we can use "multiple languages" or similar instead of language in the headings. @Airon90, Liamjamesperritt, Jura1: any objections? --Infovarius (talk) 20:36, 21 June 2019 (UTC)

  • We could use a separate sense for each language .. what's the advantage of merging them? --- Jura 20:39, 21 June 2019 (UTC)
    • Separate sense? Are italian letter "a" and english letter "a" really different? --Infovarius (talk) 21:43, 23 June 2019 (UTC)
  • @Infovarius, Jura1: Most dictionaries include letters as dictionary entries, so it should be fine to at least include them. I've just been making them language specific since letters in each language have their own etymology that could be modeled. Additionally you could also add a Translingual Lexeme as well, as Wiktionary currently does... Liamjamesperritt (talk) 01:42, 22 June 2019 (UTC)
  • @Airon90, Liamjamesperritt, Jura1, Infovarius: Do we need these lexeme at all? (and is it even lexeme? can't the item be enough?) And if yes, why? Depending on the need, we may keep them separate, merge them or delete them both and it's hard to say without specifics. But, point-blank, I don't think the merge is a good idea (nor a translingual lexeme, A (Q9659) is already there for that). Liamjamesperritt: letter etymology depending on language? could you give an example? Cdlt, VIGNERON (talk) 11:14, 22 June 2019 (UTC)
    • I don't have strong opinion. Probably Lexemes can contain multilingual stuff and model them a bit differently that items do. --Infovarius (talk) 21:43, 23 June 2019 (UTC)
    • Actually to be honest, the etymology of letters could still be modeled as Items. I personally don't believe letters qualify as Lexemes (as they don't have a proper sense), but Jura has been insistent that letters be added as Lexemes. Liamjamesperritt (talk)
  • Maybe a more general question is if we or you actually need lexeme namespace. What do you use it for? Do you create any? Do you plan to contribute? --- Jura 11:18, 22 June 2019 (UTC)
    • If you asking me I contribute from time to time. I find Lexemes useful in quering, more than Wiktionary articles. --Infovarius (talk) 21:43, 23 June 2019 (UTC)
    • @Jura1: The amount of contributions someone makes shouldn't give them more or less of a voice. Wikidata is not just for people who contribute data, but also for people who use the data. Every Wikidata user's opinion should be valid here. Liamjamesperritt (talk) 13:50, 29 June 2019 (UTC)
      • Not really. Last time Infovarious edited 1 of some 30 similar items to change it to something he preferred. He didn't edit any of the other 29, so obviously, the whole thing became problematic. When his edit later came up for discussion, he didn't participate. So if his objective is to query data, one should examine how this can be achieved and how the present solution doesn't allow it. --- Jura 14:00, 29 June 2019 (UTC)
  • I say they should not be merged. In Danish, the letter has grammatical gender, definiteness, plural/singular and pronounciation, see https://ordnet.dk/ddo/ordbog?query=a. That would be different from another language. — Finn Årup Nielsen (fnielsen) (talk) 17:01, 8 July 2019 (UTC)
  • Agree also with Finn. How do you intend to treat pronounciation in a merged item ? V!v£ l@ Rosière /Murmurer…/ 03:47, 19 July 2019 (UTC)

Shaping language variants occurrencesEdit

Hello! In some days we will have a really big bunch of nouns in Basque uploaded as Lexemes in Wikidata. Every word has 46 forms, and they will have also their senses. This can help us to make some good experiments using Basque words as a base, for example in Wiktionary. But we have started thinking on future developments based on this words, and we have thought on the project Ahotsak that records people talking about many things and then they make exact phonetic transcriptions of what is being said.

Basque is a language with lots of variants and subvariants, but the words are standarized. Let's say aita (L49255), the word for father. This is written as aita, but can be pronounced as aitá, áita, aitte, atxa... this subvariants can be well formatted on Wikidata, and even add an audio. But, we can also provide where this word has been recorded (with coordinates or, most commonly, locality), so we can build, in the future, isoglosses with this information. How can we model this location information, so we have in the future phonetic and written testimony of the variants' extension? -Theklan (talk) 14:38, 28 June 2019 (UTC)

Hi Theklan !
Great idea!
First, I would say that each variant deserve to have it's own separate form.
You can start by looking at mądry (L24242) (140 forms, our record so far and with several audio files).
For the precise localisation, I don't know (it has never been done as far as I can tell). Plus, I'm wondering: shouldn't it be done on the Wikimedia Commons side ? (not sure it's enough).
The only case I see where something like that is explicitly indicated is if the variant is specific to a dialect (or any lect for that matter). Then you can use this lect in the Spelling variant, for instance eu-x-Q17354876 for Q17354876.
Cdlt, VIGNERON (talk) 15:48, 28 June 2019 (UTC)
If it's spelled the same then I don't think different forms are warranted. If the dialect/variant is named or somehow conceptualizable as an item, then we have pronunciation variety (P5237) which can be attached as a qualifier to the IPA transcription (P898) or pronunciation audio (P443) statements on the form, and the geographic coordinates etc. can be attached to the item. If that doesn't really make sense, then I think you can just add additional qualifiers to the pronunciation audio (P443) etc. statements on the form. ArthurPSmith (talk) 15:54, 28 June 2019 (UTC)
@VIGNERON: I'm not talking about forms, but about pronouncing variants, that are not officialy coded but exist. In the example I give, aita (L49255), you have all the forms stated, but the pronunciation of most of this forms will vary depending on the place. Take for example oui (L9089). In most places (afaik) it is pronuncied [wi], but you know that is not uncommon to hear it as [we] or even [ue]. The word is /oui/, but the pronunciations can be geographically shaped without being real variants or forms. In Basque this is very evident: you can say where a speaker cames from if instead of pronouncing [etxe] (house) pronounces [etxí]. But you can also hear, for the definite form etxea the pronunciations [etxea], [etxie], [etxia], [etxiya] or [etxiye]. And this can't be subforms of the definite form, but can definitively be shaped as data: (someone) can write its pronunciation, we can have separate audio files and we can shape a place for the recording. The issue is: how can we shape this in a perfect way so it is not only for Basque language?
@ArthurPSmith: Indeed, we can shape it as language variants, but this variants could be too much, as the dialect can be named but it's not something official. -Theklan (talk) 19:20, 30 June 2019 (UTC)
@Theklan:: Hello, I would do it (and as I understand VIGNERON suggest almost identical way) as following:
FORM: etxea (gramatical features)
      STATEMENT: <IPA>: [etxea]
                 QUALIFIER: <pronunciation variety>: dialect 1 item
      STATEMENT: <IPA>: [etxie]
                 QUALIFIER: <pronunciation variety>: dialect 2 item
                 QUALIFIER: <pronunciation variety>: dialect 3 item
      STATEMENT: <IPA>: [etxiya]
                 QUALIFIER: <pronunciation variety>: dialect 4 item
I think that would be correct approach. It might be tricky for languages that do not have well described dialects, but there still should be possible to just use region instead of dialect. But the issue I see is that you might be doing (your own) research in this area, so you would be getting lot of data here. Wikibase is great tool to analyse data but I am not so sure if Wikidata is so great in this area (well it depends on what we expect from it, so I am not against it).--Lexicolover (talk) 19:55, 30 June 2019 (UTC)
@Theklan: oh, my bad, since you write it down I thought it was spelling variants too and no just pronunciation variant (I'm probably biased by Breton here). In that case, I would put the several pronunciation in several statement of the same form with the precision in qualifier. And you raise a good question, it should be consistent for all languages but I'm not sure we actually have a definitive structure for that (and it make me think that mądry (L24242) may not correct as with several pronunciations it wouldn't be clear which statement refers to which pronunciation, Lexicolover proposal sounds better, ping @KaMan: what do you think?). So thanks again for raising the question but sadly I don't have a definitive answer :/ But hopefully, the community will soon agree on it ;) Cheers, VIGNERON (talk) 20:00, 30 June 2019 (UTC)
@Lexicolover: This proposal sounds GREAT! -Theklan (talk) 20:44, 30 June 2019 (UTC)
@Theklan: good to see that Basque content will increase. One question, how to you plan to add this "really big bunch of nouns"? Do you plan to do it by hand and did you code a bot for that? In the second case, could you run your bot on few example to evaluate what we will get at the end? Thanks in advance. Pamputt (talk) 08:58, 29 June 2019 (UTC)
@Pamputt: The data is being uploaded by Elhuyar Fundazioa using a bot. They have been evaluated and aceepted for that. -Theklan (talk) 19:20, 30 June 2019 (UTC)
@Theklan: Ok, what has been done by Elhuyar_Fundazioa looks fine because it concerns only Form. However, I wonder about the Senses because they are copyrighted data. I would like to be sure that the Elhuyar Dictionary is licenced under CC0 or equivalent. It does not seem to be the case according to this page that says that the data are licenced under CC by-nc-nd. Pamputt (talk) 20:11, 30 June 2019 (UTC)
@Pamputt: They are uploading it within an agreement with the user group, so yes, now the data will be under cc0, and they have inserted a link to their dictionary, so we can mutually benefit from each other (they provide translations and soon they could take also images from Commons to illustrate their dictionaries). In the same way, magic (L3) has a link to OED, that is not free -Theklan (talk) 20:21, 30 June 2019 (UTC)
@Theklan: I have no doubt that you are working with them. My point is if we start to upload data from their dictionary, then they should update their licence to CC0 otherwise it is a licence violation. Or maybe they could send a ticket to OTRS in order to say officially they release their dictionary under CC0. About magic (L3), as far as I know, this is different case because no data of this lexeme come from OED (this is only a link). Pamputt (talk) 20:58, 30 June 2019 (UTC)
@Pamputt: I think we are mixing their multilanguage dictionary (which is under cc-nc-nd) and the definitions, which are not covered in this online dictionary they are linking, and will be uploaded as senses but are not online there with that license. -Theklan (talk) 21:13, 30 June 2019 (UTC)
@Theklan: sorry if I mix the multilanguage dictionary and the definitions. This is indeed the case. So, the question becomes, where does the definition come from? Even if they are offline, there is a licence on them (the same as for the paper dictionary) so I would like to be sure that they are licenced under CC0. Is there any "proof" somewhere (a simple email from Elhuyar Fundazioa to OTRS should be enough)? Just to be sure I understand correctly, is there already a Basque lexeme with one definition? Pamputt (talk) 21:29, 30 June 2019 (UTC)

Ok @Pamputt:! I have written them so they can say something here or take action. It will take some days, though. -Theklan (talk) 08:44, 1 July 2019 (UTC)

FYI, if there is the need to authenticate data providers and to state that the data is release under CC0, you can use the OTRS queue at info wikidata.org. Lea Lacroix (WMDE) (talk) 08:23, 7 July 2019 (UTC)

masculine inanimate in Polish language (Q52943434) and inanimate masculine (Q54020181)Edit

Hello. What is the purpose to have masculine inanimate in Polish language (Q52943434)? I think it should be merged with inanimate masculine (Q54020181). I do not see any advantage to have "masculine in French", "masculine in Spanish", "masculin in Italian", "masculine in German", ... Pamputt (talk) 18:26, 1 July 2019 (UTC)

@Paweł Ziemian: because you created this item, could you tell us what you think (reply in Polish if you want :)). And maybe KaMan have some opinion on this as well. Pamputt (talk) 08:21, 6 July 2019 (UTC)
When I created masculine inanimate in Polish language (Q52943434) as "rodzaj męskorzeczowy / masculine inanimate" to use it for words in Polish, the inanimate masculine (Q54020181) did not exist yet. The suffix "in Polish" was added later. See the history amd talk page of the items. Paweł Ziemian (talk) 21:53, 6 July 2019 (UTC)

Bot creation to move lexicographical dataEdit

Hello.
I just started working to create a Bot for Wikidata which would be able to introduce a part of the lexicigraphical data from Lo Congrès online dictionary to Wikimedia.
The project concerns the Lexemes from 3 languages (French, Occitan Lengadocian and Occitan Gascon) and will add, in the first time, the Lemmes, the forms, the translation relationships between words and the variants relationships. This Bot will take the data from a .csv file and create new Lexemes from it.
I also would like to share the code with anyone interested in, so I am trying to make this Bot reusable for others languages.
I am just at the beginning of this project but I first wanted to introduce myself and the project to you. --Aitalvivem (talk) 18:36, 3 July 2019 (UTC)

I'm not entirely sure, but it looks like this data is provided under a CC-BY license from this page. If you can get in touch with the owners of the content you should probably check that they are ok with your importing this data into Wikidata. From previous discussions of such sources, it seems clear that at least definitions (for senses) could not be imported by a bot, without further clarity on the license. ArthurPSmith (talk) 22:09, 3 July 2019 (UTC)
Yes, Wikidata is licenced under CC0, so all the data imported into it has to be CC0, at maximum, as well. To have an idea of what is copyrithable in lexicographical data, you can read this legal analysis by a lawyer from the WMF. Pamputt (talk) 05:36, 4 July 2019 (UTC)
@ArthurPSmith @Pamputt Indeed some dictionarys used by Lo Congrès are provided under a CC-BY license but some others are free (3 dictonarys). I am working for Lo Congrès so we are aware of this constraint and we will only import free data to Wikidata.--Aitalvivem (talk) 13:03, 4 July 2019 (UTC)
Welcome Aitalvivem/AitalvivemBot :)
If we need to have an official statement from the organization at some point, to state that the data is released under CC0, we can use the Wikidata OTRS queue at info wikidata.org. Lea Lacroix (WMDE) (talk) 15:13, 4 July 2019 (UTC)

Hello, I wrote a function to create a Lexeme but when I try it on the test environment I always get the same error :

{
    "error": {
        "code": "failed-save",
        "info": "The save has failed.",
        "messages": [
            {
                "name": "wikibase-api-failed-save",
                "parameters": [],
                "html": {
                    "*": "The save has failed."
                }
            }
        ],
        "*": "See https://test.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes."
    },
    "servedby": "mw1287"
}

Any idea of how I could fix it ?--Aitalvivem (talk) 09:23, 8 July 2019 (UTC)

@Aitalvivem: Can you show us the request you send (without the edittoken, that is private), that led you to this? --Hoo man (talk) 09:59, 8 July 2019 (UTC)
@Hoo man:Of course, here is an exemple of a request generated by my code :
	{
		'action': 'wbeditentity',
		'format': 'json',
		'new': 'lexeme',
		'token': CSRF_TOKEN,
		'data': {'labels':{'ostal':{'type':'lexeme', 'lemma':'ostal', 'language':'oc'}}}
	}

You can find my code here https://github.com/aitalvivem/AitalvivemBot --Aitalvivem (talk) 10:14, 8 July 2019 (UTC)

@Aitalvivem: isn't labels a data for items only? Not sure (nor the error message, nor wbeditentity doc is clear) but maybe you can try with lemmas instead? Cdlt, VIGNERON (talk) 08:50, 10 July 2019 (UTC)
@VIGNERON: I tried to change "labels" for "lemmas". I had some clear errors messages so I had to adapt a few parameters in my request. But now I am stuck on the same error message again :/
here is my new request :
	{
		'action': 'wbeditentity',
		'format': 'json',
		'new': 'lexeme',
		'token': CSRF_TOKEN,
		'data': {'lemmas':{'oc':{'type':'lexeme', 'lemma':'ostal', 'value':'ostal', 'language':'oc'}}}
	}
I tried without the 'type' and 'lemma' parameters but every time the API answer is the same "failed-save" error message.--Aitalvivem (talk) 12:25, 10 July 2019 (UTC)
@Aitalvivem, VIGNERON: Here's the content of data that produced ljusgul (L54797). Note that you have to include the Q-id of the language and lexical category. --Vesihiisi (talk) 12:43, 10 July 2019 (UTC)
{
  "type": "lexeme",
  "lemmas": {
    "sv": {
      "language": "sv",
      "value": "ljusgul"
    }
  },
  "language": "Q9027",
  "lexicalCategory": "Q34698",
  "forms": [
    {
      "add": "",
      "representations": {
        "sv": {
          "language": "sv",
          "value": "ljusgul"
        }
      },
      "grammaticalFeatures": [],
      "claims": []
    }
  ]
}
@Vesihiisi:Thank you very much, that was exactly what I needed !--Aitalvivem (talk) 15:39, 10 July 2019 (UTC)

@Aitalvivem: I have deleted over 80 items you created because they did not comply with our notabilty policy (like Q65316588 "Pronom possessif 3e personne du singulier", Q65295676 "Pronom personnel réfléchi tonique 1ere personne du pluriel", or Q65247119 "Adjectif masculin pluriel"), which suggest that you are not familiar with how Wikidata is modelled nor with our policies. Most of the deleted items were created in one single day, which makes me think that you were on a rush. Also, I have performed several merges of items that you created as duplicate of already existing items. Wikidata's learning curve is not insurmountable, but it does take time. I bet you mean well and the task you are proposing seems to have great potential; however, at this point I think it would be unwise that you engage in mass edits. Please take some more time to get to know Wikidata. As you can see, there will always be plenty of people willing to answer any doubts, but you do need to ask for help when needed/unsure. Andreasm háblame / just talk to me 06:08, 23 July 2019 (UTC)

@Andreasmperu:Hello, I'm sorry for your troubles. I was trying to add the missing lexical category that I will use to insert the Lexeme with my Bot. Now I have modified my file which convert lexical category into items id to only use item that already exists in Wikidata. Just to be sure, could you tell me what was the problem with those items ? I thought they would meet the third criteria of the notability policy but I may have misunderstand it.Aitalvivem (talk) 09:38, 23 July 2019 (UTC)
@Aitalvivem: the "good" way is to add the categories in several parts. For instance singular (Q110786) + masculine (Q499327) is enough, there is no need for "singulier masculin" (and it's often easier to query afterwards). That said, it's not always clear nor possible (I still don't know for sure how to model some lexical categories like "plural of plural" for lagad (L114) here a new item is probably needed ; I've been here for almost 7 years and I think know Wikidata but still not always sure @Andreasmperu:), so don't hesitate to ask, indeed there is a lot of people you can help here  . Cdlt, VIGNERON (talk) 16:53, 23 July 2019 (UTC)

New user script to simplify adding forms on lexemesEdit

Hi everyone! I’ve written a user script (documentation) to make it easier to add Forms to Lexemes that don’t have any Forms yet: when you view a Lexeme without Forms, it will determine the matching template(s) of the Wikidata Lexeme Forms tool and add links to them below the regular “add Form” link (see the announcement tweet for screenshots). I hope that this will be useful to some of you! --Lucas Werkmeister (talk) 13:06, 8 July 2019 (UTC)

Lexemes to deleteEdit

Hey there,

While checking some numbers on Ordia I came accross a few Lexemes that are probably mistakes (people who tried to create an item, or at least entered as language something that is not a language):

Can I let someone check them and delete if needed? :)

I'm wondering what kind of query could help spot these mistakes. Lea Lacroix (WMDE) (talk) 15:56, 15 July 2019 (UTC)

Return to the project page "Lexicographical data".