Open main menu

Wikidata talk:Lexicographical data

Overview

 

Documentation

 

Development

 

Tools

 

Support for Wiktionary

 

How to help

 

Lexemes

 

Discussion

 

Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2019/10.


Lexemes to deleteEdit

Hey there,

While checking some numbers on Ordia I came accross a few Lexemes that are probably mistakes (people who tried to create an item, or at least entered as language something that is not a language):

Can I let someone check them and delete if needed? :)

I'm wondering what kind of query could help spot these mistakes. Lea Lacroix (WMDE) (talk) 15:56, 15 July 2019 (UTC)

A lot of query can help, some are already on Wikidata:Lexicographical data/Ideas of queries. Most of the time, filtering on the count can help to find outliers (wich may be fals-positive but most often not). For instance here the list of lexical categories only used once: https://w.wiki/6Xs, most need correction and some deletion (I already put invalid ID (L46090) and invalid ID (L43552) in WD:RFD, not sure what to do with กาเบรียล (L43417): could transcription of proper name be kept or not?).
There are also invalid ID (L58089). I have made a RfD [1]. — Finn Årup Nielsen (fnielsen) (talk) 08:11, 16 September 2019 (UTC)
There are various others lexemes where is it not clear whether the lexeme is totally wrong or just a wrong begining. invalid ID (L58719) invalid ID (L57857) invalid ID (L61908) AMC (L157090). — Finn Årup Nielsen (fnielsen) (talk) 08:48, 16 September 2019 (UTC)

Different meanings of the word for the direction leftEdit

Does anyone know the meaning of this item: different meanings of the word for the direction left (Q18340392)? It look like an early try to create lexicographical data. --Kolja21 (talk) 17:25, 22 August 2019 (UTC)

@Diwas: könnten Sie uns den Zweck dieses Objekts mitteilen? Pamputt (talk) 18:13, 22 August 2019 (UTC)
I'm guessing the idea was to link to pages like en:Sinister - but there never seem to have been any actual page links on the page. I'd recommend it for deletion now. ArthurPSmith (talk) 18:22, 22 August 2019 (UTC)
see https://tools.wmflabs.org/reasonator/?q=Q18340392&lang=de There are some Disambiguation pages about words meaning left, sinister, gauche, ... in variations. Wikidata should be able to show those groups of disambiguation pages and other groups of objects. different meanings of the word for the direction left (Q18340392) shows a way to link objects, that are not the same, but have particular the same or similar meanings. I do not need this data object, but wikidata loose a value-added service if it forget those links. I hope Wikidata:Lexicographical data will help to provide services to show users links between similar objects, like those or like a link between a profession an a professional, if one wikipedia language version have only an article about the profession and another language version have only an article about the professionals in this profession. --Diwas (talk) 02:41, 23 August 2019 (UTC)
I suppose that Lexemes could help: all these disambiguations can be linked to Lexemes which can be linked to some central item like different meanings of the word for the direction left (Q18340392). Look e.g. left (L3350). --Infovarius (talk) 14:58, 23 August 2019 (UTC)
Ah, it was being used as a class to group disambiguation pages. That's an interesting approach, I guess it doesn't hurt to keep it then! ArthurPSmith (talk) 17:55, 23 August 2019 (UTC)
It should still have some sort of properties, but I'm rather stumped as to what they ought to ought to be. I'll ask over at Project Chat. Circeus (talk) 19:09, 23 August 2019 (UTC)

Merging duplicates--is it possible?Edit

(L1027) and aluminum/aluminium (L18179) are duplicates but the "Merge Wizard" doesn't appear to support merging of lexemes. Is there another way to merge these two lexemes together? Dhx1 (talk) 18:10, 24 August 2019 (UTC)

@Dhx1: There is Special:MergeLexemes. --Shinnin (talk) 19:48, 24 August 2019 (UTC)
@Dhx1, Shinnin:   Done   Merged the trick is that to merge Lexemes they need to have the same main lemma for the same language (which was not the case, that's why I changed it Special:Diff/1004219174 before merging to unblock the merge). Cheers, VIGNERON (talk) 15:04, 28 August 2019 (UTC)

Adding reference URL to lexeme formsEdit

Hi, I try to add reference URL for this lexeme form and based on the warning I get, I suppose that the reference should be inserted for a statement, not as a statement itself. Could anyone suggest me what is the correct way to do that if I want to reference the form, not any statement? --Strepon (talk) 14:17, 25 August 2019 (UTC)

A usage example (P5831) statement seems like the most obvious option. As you seem to be basically saying "this is here because its in dictionary X", though, maybe attested in (P5323) is a better option (big maybe; I can't say I'm enamored of it, though: it's a little too "let's import this dictionary wholesale" for my taste, and I assume that website is not likely to be free and open content). Mind you, a regular "form of" shouldn't require any sort of sourcing whatsoever IMO (because it just follows from the normal properties—i.e. classes/declinations—of the word combining with the language's normal rules). I'd only source it separately only if it doesn't follow from any of the usual patterns in the language (e.g. the Past historic of verbs in -traire, which are a source of considerable hesitation in books that teach French conjugation). Circeus (talk) 03:37, 26 August 2019 (UTC)
You pointed out the problem which I am not sure about: how to source forms. I agree that natural and predictable patterns does not require any reference (on the other hand: if someone raises doubts about them, how to prove they are correct?), however sometimes I don't know which variants are codified precisely and I need to look to the dictionary - then I think it is proper to add the source. Is there any consensus or rule regarding this topic?
From your suggestions, attested in (P5323) sounds better for me, as I'm not adding examples; but I understand your concerns related to non-free sources. --Strepon (talk) 20:01, 27 August 2019 (UTC)

Batch importEdit

Hello, for a project I am working on we have developed a series of curated vocabularies which includes entry in various languages (mostly en and de but also fr and it) and in their various forms (some RDF here https://github.com/swiss-art-research-net/vocab ). I would like to add in batch the ones not present on wikidata but I cannot find a way to do it with quickstatements or openrefine. Do you know how can I add them?  – The preceding unsigned comment was added by Wpbloyd (talk • contribs) at 14:40, August 26, 2019‎ (UTC).

  • @Wpbloyd: There's a new "batch" mode in the Wikidata Lexeme forms tool - for example here for English nouns. If you're going to be importing on the order of thousands of entries you should probably get "bot" approval first - i.e. do a handful the way you plan to handle these and have people review them on the Wikidata:Requests for permissions/Bot page. ArthurPSmith (talk) 17:23, 26 August 2019 (UTC)
  • @Wpbloyd: What's the license of your data (and, specifically, the whole collection)? Because it should be CC0 in order to be imported to Wikidata Lexemes. --Infovarius (talk) 09:11, 27 August 2019 (UTC)
@ArthurPSmith: Thanks for the link! There are not so many entries but only around 100/150, so this should work! @Infovarius: thanks, that is true. I will add the cc0 license to github too! (Wpbloyd (talk) 13:10, 27 August 2019 (UTC))

@Wpbloyd: I suppose there is also the issue of how to indicate that the word is listed in SARI. — Finn Årup Nielsen (fnielsen) (talk) 08:55, 16 September 2019 (UTC)

@Fnielsen: From our side, that would not really be necessary in this case. However, it would probably be important to further discuss it for in case other projects want to deliver lexemes to wikidata. it could be seen as incentive.

@Wpbloyd: See also: the code of a bot developed to import data from an external source, and a python library to create your own bot. As already mentioned, the data imported into Wikidata must be CC0 and clearly identified as such. Thanks a lot for working on this project and keep us updated! :) Lea Lacroix (WMDE) (talk) 10:01, 25 September 2019 (UTC)

Multiple Pronunciations per FormEdit

There are often cases when the same written word form has different pronunciations. Each pronunciation has a number of properties itself, e.g. the sound file, IPA, region in which it is used, or the references to scholarly works about it. I would like to propose that we put pronunciation-related properties as qualifiers to a single property. The property itself would have the written form of the word, possibly repeated, with the linguistic stress marks applied to it (important for some languages like Russian).

Lexeme = <word>  e.g. "поперчивший"  (<he> peppered ...)
Forms = [
  {
    Form = <word-form>   e.g.  "поперчивший" (primary form is the same as lexeme)
    Statements = [
       Pronunciation Form = <word-form-pronounce1>    e.g. "попе́рчивший"  (one common usage, with the stress mark on the 2nd syllable)
          IPA = [pɐˈpʲert͡ɕɪfʂɨɪ̯]
          Sound = sound1.ogg
       Pronunciation Form = <word-form-pronounce2>    e.g. "поперчи́вший"  (another common usage, stress on the 3rd syllable instead)
          IPA = [pəpʲɪrˈt͡ɕifʂɨɪ̯]
          Sound = sound2.ogg
          Region = ...
          References = ...
    ]
  },
  ...
]

In some cases, even the stress will be the same, e.g. the word сессия (session) has two forms - [ˈsɛsʲ(ː)ɪɪ̯ə] and [ˈsʲesʲ(ː)ɪɪ̯ə] (see link for sound files). In this case we will simply have two identical values for the pronunciation forms but with a different set of qualifier values. If there are no objections, I would like to create a new property.
P.S. Another good example -- моветон -- 4 different pronunciations of the same word (with both IPA and sound files), and two of them reference a source. --Yurik (talk) 21:04, 27 August 2019 (UTC)

Property request created, please support. --Yurik (talk) 02:06, 28 August 2019 (UTC)

LexData: easy to use python libary to edit LexemesEdit

Since I couldn't found any library for editing Lexicographical data I wrote my own. It's still "beta" but working properly and quite easy to use. You can find it here [2] and the documentaion (incl. code example) here. Have found! I already created a tool like "Wikidata games" to add senses to existing lexemes – I will publish that as soon as it's presentable. -- MichaelSchoenitzer (talk) 23:20, 27 August 2019 (UTC)

Thanks for posting MichaelSchoenitzer! I have been writing to MW directly using the minimalistic pywikiapi library as part of my Russian lexem import project lexicator, but there could be cases where you want to simplify higher-level abstracts like lexem/form/sense. I also replied in your github (issues). --Yurik (talk) 00:08, 28 August 2019 (UTC)
Thanks a lot, that's amazing! I hope that will help people creating more tools for lexicographical data :) Lea Lacroix (WMDE) (talk) 08:47, 28 August 2019 (UTC)
@MichaelSchoenitzer: Gamification of addings senses is very welcome! There is some possibility in https://tools.wmflabs.org/hauki/browse/ru?sense=false but it is not ideal. --Infovarius (talk) 10:26, 28 August 2019 (UTC)

Incorrect datatype in the ttl fileEdit

While loading wikidata on Stardog on AWS, I am getting an error that that '-240000000-01-01T00:00:00Z' is not a valid value for datatype http://www.w3.org/2001/XMLSchema#dateTime. Can we fix this issue? I believe there are multiple such occurrences in the data, and would create an issue in any loading process.  – The preceding unsigned comment was added by Tushar1080 (talk • contribs).

@Tushar1080: How is 240 million years ago is related to any human language? --Infovarius (talk) 10:27, 28 August 2019 (UTC)

Property to use for word listEdit

I am not entirely sure when to use which property for a lexeme. A current problem is:

Here is my attempt on being consistent:

  • If the word appears in a dictionary or a linguistic book or a linguistic article where there is more than just a listing, e.g., where it is explained to be belong to a specific class, I use described by source (P1343)
  • If the word appears in a written work, not explained, but used, and the work is copyrighted, then I use attested in (P5323)
  • If the word appears in a written work, not explained, but used, and the work is out of copyright I use it with usage example (P5831) together with stated in (P248) for the reference.

Finn Årup Nielsen (fnielsen) (talk) 16:31, 28 August 2019 (UTC)

  • What is 120 Danish words (Q66809857)? Is it notable? --Infovarius (talk) 14:31, 29 August 2019 (UTC)
    • It is a word list with 120 Danish words that "floats around" amoung Danish language teachers. Several products are based on the list [3] [4] [5]. As far as I understand they are based on Hyppige Ord i Danske Børnebøger (Q66810132) (I am currently trying to find out whether there is any intellectual property associated with it). — Finn Årup Nielsen (fnielsen) (talk) 15:34, 29 August 2019 (UTC)
      • I am clearly not a lawyer but concerning the copyright of such list, I think you should have a look on the database right. Pamputt (talk) 16:27, 29 August 2019 (UTC)
        • @Pamputt: I don't see how this kind of automatic generated list can withhold rights (nor by copyright - it's the most used words anyone can compute the same list - nor by databaseright, the later lasting only 15 years - when was this list created - and has specific requirement, including "substantial investment"). BTW, wikipédias and witkionaries already copied similar list (the Swadesh lists for example) considering that not right applies to it.VIGNERON (talk) 10:04, 30 August 2019 (UTC)
          • Indeed, my message was not clear. I wanted to say that if there is any copyright on such list, it is probably protected by database rights. That's said, this one is probably not (as for Swadesh list on the Wiktionaries). Pamputt (talk) 16:45, 30 August 2019 (UTC)
        • @Fnielsen: catalog (P972) feels strange but so does attested in (P5323). Maybe a dumb idea but why not just use part of (P361)? Cheers, VIGNERON (talk) 10:04, 30 August 2019 (UTC)
          • P5323 seems appropriate. --- Jura 09:03, 16 September 2019 (UTC)

Why does Kadazandusun language (Q5317225) has a spelling variant option?Edit

Kadazandusun language (Q5317225) writes in Latin script (Q8229) only. So, there's no need the "Spelling variant of the Lemma" option. Anyone can help how? --Tofeiku (talk) 13:01, 8 September 2019 (UTC)

What's the problem? Just don't use it. --Infovarius (talk) 15:39, 11 September 2019 (UTC)
I mean is there a way to make the option dissapear? I don't see this option when I'm adding Malay or English lexemes. That's why I asked. If it's not possible then nevermind. --Tofeiku (talk) 09:45, 12 September 2019 (UTC)
Also, I'm trying to add Brunei Bisaya (Q3450611) lemmas into Lexeme but it says "The supplied language code was not recognized." and I need to choose an option in "Spelling variant of the Lemma". So which option should I choose? --Tofeiku (talk) 06:44, 14 September 2019 (UTC)
Please give example Lexemes to see the situation. --Infovarius (talk) 22:32, 14 September 2019 (UTC)
I want to add the word "lampun" which is a Bisaya Brunei (bsb) noun which means "durian". --Tofeiku (talk) 12:01, 15 September 2019 (UTC)

Storing word componentsEdit

Some (many?) languages use word composition -- combining simple components (prefixes, roots, interfixes, suffixes, and endings) to create new words. Russian is that way for sure, but I think it is also common in German and Finnish languages. English has some of that too - "prepend" -- "pre" implies "before", and the root of the word has the sense of addition/joining (?). We already have combines (P5238) and root (P5920) properties, plus the series ordinal (P1545) qualifier, implying two ways to store data:

with combines (P5238)

Not sure how to indicate the type of the lexeme part here, or if its even needed. We may even have to store parts with dashes, e.g. -suffix, -interfix-, +ending, prefix-, .... The dash/plus will also immediately make it clear that the given lexeme is not a word, but rather a part of the word.

"prepend" (en)
combines (P5238)  ->  link to "pre- (prefix)" lexeme
  series ordinal (P1545) = 1
combines (P5238)  ->  link to "pend (root)" lexeme
  series ordinal (P1545) = 2
with root (P5920) + ...
"prepend" (en)
prefix (new prop) -> link to "pre- (prefix)" lexeme
  series ordinal (P1545) = 1
root (P5920) -> link to "pend (root)" lexeme
  series ordinal (P1545) = 2

Second approach visually disconnects different parts of the word across multiple properties, which is also not that great, but it allows the data user to tell word parts apart without looking at the part lexemes themselves... Which approach should we use? --Yurik (talk) 16:41, 10 September 2019 (UTC)

What about doing both? ArthurPSmith (talk) 17:14, 12 September 2019 (UTC)
  • I wouldn't worry too much about the way a solution looks in the current Wikidata GUI. BTW, there is also a qualifier to indicate which form (of a lexeme) is being combined.--- Jura 17:25, 12 September 2019 (UTC)

Storing "corresponds to" wordsEdit

In the gender-aware languages, nouns often have feminine and masculine versions. How should they link to each other? For example, a lexeme "doctor (feminine)" should have a connection to the "doctor (masculine)", and the reverse. --Yurik (talk) 20:50, 11 September 2019 (UTC)

Reminder: your input needed about integration of Lexemes in WiktionariesEdit

Hello all,

As the question about integration of Lexemes on Wiktionaries is regularly raised, I wanted to remind you that in order to help us understand better what you need and therefore what are the technical solutions we could provide, you can let a comment under this ticket. Feel free to explain why and how you would use Lexemes on Wiktionary, what kind of tool you would need (parser function, templates, Lua, etc.) and to provide a few examples on how Lexemes stored in Wikidata could be used.

More input from people who are working on Lexemes and/or Wiktionaries, or simply sharing enthusiasm for the feature, will help us moving forward with this project :)

Thanks in advance, Lea Lacroix (WMDE) (talk) 10:19, 12 September 2019 (UTC)

Regarding the issue of Wiktionary/Wikidata: I now see a large number of Russian nouns being created on Wikidata, e.g., авиагарнизон (L83061). They seem to come from Russian Wiktionary by YurikBot. @Yurik: I am wondering whether the Russian Wiktionary or Wikidata has discussed this issue? — Finn Årup Nielsen (fnielsen) (talk) 10:42, 12 September 2019 (UTC)
@Fnielsen: This was mainly discussed on the Russian Wikimedia Community server in Discord between Wiktionary, Wikidata and Wikipedia participants. We asked several questions on this forum, and now we are discussing in Russian Wiktionary about the possibility of relicensing sences under CC0. Iniquity (talk) 10:52, 12 September 2019 (UTC)
Also a few more discussion links: YurikBot bot flag discussion (discussing why it is needed), import announcement, and linking wiktionary pages to lexemes (every wiktionary page now has a link to corresponding lexeme), plus the above link about re-licencing word definitions (and other copyrightable content) under CC0. This is a long shot project, so for now I'm not importing any of them, just the non-copyrightable things like word forms, etc. --Yurik (talk) 15:00, 12 September 2019 (UTC)
Thanks for the update and for the links! Which one links to the discussion about the licence? I don't speak any Russian but I'd still try to follow what's happening :) I'd love to get an update from time to time about the reactions from the Russian Wiktionary community. Lea Lacroix (WMDE) (talk) 16:37, 12 September 2019 (UTC)
This one :) Iniquity (talk) 16:43, 12 September 2019 (UTC)
Thanks. I quickly looked at a translated version, and it doesn't look like a clear community consensus to me.
I'd like to state one more time that we should all be careful and import only data that is public domain or CC0 in Wikidata. This has an impact on future project as well as the image that people have about Wikidata, it's community, and how seriously we treat free licenses.
I think it would be great to have more Russian-speaking people involved in this project: @Infovarius, DonRumata, Kareyac:. Lea Lacroix (WMDE) (talk) 10:18, 16 September 2019 (UTC)
I think we need more reasons for integration Wiktionary with Wikidata. This process should be more mutually profitable for both projects. Most important to use Wikidata to collect general data which the same in all Wiktionaries in any languages. Don Rumata 12:21, 16 September 2019 (UTC)
I was participating in preparing of the import. I consider it quite complete and useful. Probably all except senses and dependant (synonims/hypernims/...) information has been parsed and imported. --Infovarius (talk) 11:22, 18 September 2019 (UTC)
@Lea Lacroix (WMDE): as mentioned elsewhere, the biggest reaction was "what, you can't use getEntity('L123') from Lua?!?". Seriously :). There is already some discussion on how to use this data directly from Wikidata (obviously it will take some time for all things to fall into place, but having Lua access is obviously the biggest obstacle). One other option could be to have a dedicated bot update wiktionary from wikidata, but that is obviously not as convenient, and will require a lot more work. --Yurik (talk) 02:14, 15 September 2019 (UTC)
  • Now that we have more experience in working with multiple Wikibases (such as the one on Commons), maybe it's time to review the optimal place in the WMF universe to place lexemes. An alternative could be to have it directly at lexeme.wiktionary.org .. this would also solve the main obstacle to adoption on some of the Wiktionaries. --- Jura 09:01, 16 September 2019 (UTC)
Apart from the technical difficulties to access a non-Wikidata Wikibase from sister projects and other arguments that have been widely discussed over the past years, changing the whole system now would require a huge amount of resources that we prefer involving in moving the project forward, for example by experimenting arbitrary access on some Wiktionaries. Lea Lacroix (WMDE) (talk) 11:12, 16 September 2019 (UTC)
It seems that many of the supporters of the current solution never really contributed to lexemes on Wikidata.
Personally, I had mostly ignored Wiktionary once it had been started as I found it too complicated to prepare in Wikitext. Returning there years later, I notice that they went quite far with templates and LUA and, e.g. the French Wiktionary, went quite far in terms of coverage. Given their achievement, we should try to support that and avoid anything that could be considered canibalizing them.
Once structured data on Commons is running, I think the additional work of setting up another Wikibase instance should be marginal. Besides, it seems to be a goal of WMF to be able to set up such instances easily. One might just need to wait a few months. For Wikidata, the impact should be negliable.
With the grant spent on converting Wiktionary senses to triples, maybe more content is already available for such a Wikibase instance at Wiktionary than currently here on lexeme namespace at Wikidata. We should make sure resources spent are well made use of.
Obviously, a request for such a Wikibase instance would need to come from the Wiktionary community. As long as the proposals are of the type "I deleted some lexemes and now it is smaller than Wiktionary", you probably wont have to worry about accomodating community input for your IT plans to support the community. --- Jura 21:46, 19 September 2019 (UTC)

common noun (Q498187) vs. common noun (Q2428747)Edit

There is a ontological problem of possibility major significance for the lexicographical data on Wikidata. In Russian Wikipedia, there are two articles for common noun/appelative and two Wikidata items: common noun (Q498187) vs. common noun (Q2428747). Other Wikipedia language editions link to both of them. Machine translation of the two Russian Wikipedia articles does to clarify to me what could or should be done. I see Wikidata lexeme entities linking to both items. That is probably not what we want. @ DonRumata, Yurik, Infovarius: I am wondering if any Russian speaking people could clarify? Should we move all Wikipedia language links (except the Russian) to common noun (Q2428747) as well as all lexicographic annotation and leave common noun (Q498187) stranded with a single (Russian) Wikipedia link? — Finn Årup Nielsen (fnielsen) (talk) 12:45, 16 September 2019 (UTC)

They are synonims, but wikt:ru:апеллятив (appellative) is a lingvistic term, commonly used as an antonym for wikt:ru:оним (onym). Don Rumata 13:16, 16 September 2019 (UTC)
The word wikt:ru:апеллятив is used only in scientific literature and most Russians do not know it, but wikt:ru:имя нарицательное is well kown, because it is taught in schools. Don Rumata 13:34, 16 September 2019 (UTC)
In Danish, we have "appelativ" (appelative) and "fællesnavn" (common name) but I would say they are the same concept, - an subsequently on the same item. It is not clear which item we should use for labeling appelatives in Wikidata. Do you have an opinion? If the only difference between the two Russian words is who is using the term, then should the Russian Wikipedia articles be merged? — Finn Årup Nielsen (fnielsen) (talk) 16:30, 16 September 2019 (UTC)
Definitely, we can merge these articles. I suggest using a more recognizable common noun (Q2428747). Don Rumata 17:15, 16 September 2019 (UTC)
I do not know whether there is a subtle difference, so that Russian Wikipedians would like maintain both. I suppose that if you merge the Russian articles and we subsequently merge the Wikidata items, then the merge should (as usually) go to the one with the lower identifier number, i.e., common noun (Q498187). The Russian label can then be change to what is most appropriate. — Finn Årup Nielsen (fnielsen) (talk) 09:16, 17 September 2019 (UTC)

Thoughts about sense future and usecases (on Wikidata mailing list)Edit

You can read and pariticipate in this thread, in english. The questionning is the future of Wikidata lexicographical functionalities like senses and how reasearchers and community could collaborate to build tools that helps us using and enriching the datas. author  TomT0m / talk page 14:38, 20 September 2019 (UTC)

I ran a query for Wikidata items with one-word lower-case labels (in English) that had P279 statements (i.e. likely to be generic terms, or in other words common nouns), and have been gradually adding them manually as lexemes with the sense linked to the item(s). This could probably be automated to a considerable degree - the English gloss on these can be taken directly from the Wikidata description (if any), since that's also CC-0. This obviously isn't special to English either. ArthurPSmith (talk) 17:26, 20 September 2019 (UTC)

Spelling variant of the Lemma optionEdit

I created sebang (L184375) which is a West Coast Bajau (Q2880037). But when I want to create a lexeme of that language, a "Spelling variant of the Lemma" option shows up and I need to choose one. This language is written in Roman/Latin script. So the only option that I could choose there is "mis". Can anyone help? --Tofeiku (talk) 04:37, 22 September 2019 (UTC)

You probably should ask Language Committe (@Amire80:) to add "bdr" as Wikidata language code. --Infovarius (talk) 16:57, 24 September 2019 (UTC)
I'll add it to Universal Language Selector. It also has to be added to Wikibase. Add a subtask under https://phabricator.wikimedia.org/T144272 . --Amir E. Aharoni (talk) 12:03, 25 September 2019 (UTC)

MachtSinn: new tool to quickly add Senses to LexemesEdit

We have huge amount of lexemes that lack senses, often we also have items describing the concept a sense of a lexeme is describing. I therefore wrote a Tool to match those and suggest missing senses of lexemes: MachtSinn. You can log-in with you Wikidata-Account and quickly endorse or revoke potential matches. If you endorse a sense it is automatically added to the lexeme and linked to the corresponding item with our account. It works with every language. Since I'm not good at design and CSS, the design of the site is a bit minimal – help is welcome. The code can be found on Github. -- MichaelSchoenitzer (talk) 20:43, 22 September 2019 (UTC)

MichaelSchoenitzer, this is awesome! Could you add some common shortcuts for each button please, e.g. "s", "r", and "n" to quickly perform the command without the mouse? And also add that as a tooltip for each of the buttons for easy discoverability? Also, please show lexical category (noun/adjective/...) next to the word, and possibly some well known top-level claims (i.e. gramatical gender and "has quality" values), and the list of forms?
And one other thing - some lexemes are duplicated on purpose despite having identical word, they correspond to different meanings, might have different origin, and different forms - you might want users when the current word has more than one lexeme. For example, L99999 and L100000 are both "мир", one in the meaning of peace (so no plural forms), and another in the meaning of the world (could have plural, i.e. worlds), and we wouldn't want to attach the wrong sense. Thank you for an awesome tool! --Yurik (talk) 01:26, 23 September 2019 (UTC)
@MichaelSchoenitzer: Wow, that's addictive! I have noticed a few issues that could maybe be improved (let me know if I should add this at github): (1) It seems to repeat some matches after I had hit "next" on them (after I did many others in between). (2) It doesn't seem to check that the match is already there? Maybe this is due to a WDQS delay? For example I matched Q983927 to L24318 (by hand, after your system had suggested Q58795659) but then less than an hour later I was given that specific suggestion. (3) There seem to be a lot of suggestions from "heraldic figures" or elements of some genome, rather than what I would think would be more common links. Maybe suggestions should be prioritized by number of sitelinks or some other measure of popularity of the item? ArthurPSmith (talk) 18:58, 23 September 2019 (UTC)
3) Ideal is to give the complete list of items with specific label and mark some of them that should be added to specific lexeme. But it is a dream :) --Infovarius (talk) 17:00, 24 September 2019 (UTC)
@Yurik, ArthurPSmith, Infovarius: Answering the questions: The tool at the moment only contains nouns (that got already enough results for now). (2) 'Next' just gets a new random potential match, so yes it the pool or matches in you language is small it might repeat soonish. "Reject" marks them as false-positive so that they won't be shown again (to anyone). (3) The tool uses matches that are saved in a local database (WDQS wouldn't be fast enough) so yes it might be, that it shows a match that someone already added by hand – if a match is saved with the tool it's however removed from the pool and should never be shown again. In any case the tool checks if the match is already there before saving, so it should never add duplicates.
I'm currently out of time that I can invest in the tool, but feel free to make pull requests, I'll merge them and update the tool. Especially the hotkeys sound like an awesome little improvement. -- MichaelSchoenitzer (talk) 20:41, 24 September 2019 (UTC)

This is brilliant! Thank you so much. And it is fun too! It looks like this tool helped increase the number of senses within a few hours by several percent, this is pretty awesome! --Denny (talk) 03:44, 25 September 2019 (UTC)

Warning! There's problem with homonymous lexemes! The tool doesn't differ them and tries to add each sense to each of them :( --Infovarius (talk) 20:21, 26 September 2019 (UTC)

@Infovarius, Yurik: I blacklisted all homonyms (as well as duplicates). -- MichaelSchoenitzer (talk) 17:23, 28 September 2019 (UTC)

Lexemes and Wikimedia translationEdit

I think it would be a good idea to make efforts to integrate Wikimedia translation tools with Wikidata lexeme datas. Both parts have something to win :

  • translators could benefit from the linguistic datas and lexeme definitions, item for that sense search to suggest completion on translations, for example.
  • lexeme datas could benefit from translator contribution if they identify a missing sense for a lexeme.

A possible useful feature that could help do that in Mediawiki would be to tag wikitext/articles with lexemes and senses. What do you think ?

author  TomT0m / talk page 07:05, 23 September 2019 (UTC)

Searching for Lexeme:thanks or Lexeme:Danke leads to error messageEdit

I filed a bug for it, but I found it curious enough to state it here: when you search for Lexeme:danke or Lexeme:thanks the system returns an error. --Denny (talk) 22:24, 24 September 2019 (UTC)

ASJP import? NoEdit

Today I learned about en:Automated Similarity Judgment Program, and that their dataset is CC0. The project tries to collect the words for a short, central list of 40 concepts in all the world's languages.

Shall we set up a Wikiproject to import it to Wikidata? Reach out to the ASJP folks, let them know?

I like this idea because it would help with setting up an initial set of lexemes for many languages, and thus we would have coverage no matter what we test. Who's interested? --Denny (talk) 16:55, 4 October 2019 (UTC)

@Denny: I'm not sure this data is very well suited to our lexeme approach. At the least, it would take considerable work to map it in, I think. Issues I have: (A) the "words" are recorded in what seems to be an idiosyncratic romanization, not in the native writing system (though I expect that many of these languages do not have a standard written form so that might work for them). See for example the words for "person" here, or the English page. (B) They appear to only have a single form for each word (which we could use, but that limits things). (C) They have 7655 "word lists" (I think that means languages?) which is way more than we support here - so we'd need to either chop the list down to what we know about, or figure out how to support MANY more languages here! All that said, it does seem like something very related to what we're trying to do, so I think reaching out to them would be a great first step! ArthurPSmith (talk) 18:08, 4 October 2019 (UTC)
Denny, where have you seen that the ASJP data are CC0? On the main page, it is written that the ASJP Database is licensed under a CC by 4.0 licence. Pamputt (talk) 20:07, 4 October 2019 (UTC)
@Pamputt:, ah, darn, I misread the logo. My mistake, you are right. Sorry. This topic can be archived. :( --Denny (talk) 22:43, 4 October 2019 (UTC)
License matters only for the case of definitions. But spelling and translations are pure facts which are not copyrightable at all. So we can use them, I suppose (adjusting wrong scripts, of course). --Infovarius (talk) 19:33, 7 October 2019 (UTC)
I am not a lawyer so I do not know. So, may I ask a naive question? If the word lists are not protected by CC by 4.0, which content of the website is covered by this licence. Do you think this is a copyfraud? At least, I think we should contact them to get their opinion. Pamputt (talk) 13:38, 8 October 2019 (UTC)
"7655 languages"? That's cool, we should support them all! Yes, Amir? ;-) --Infovarius (talk) 19:33, 7 October 2019 (UTC)
User:Infovarius, I think that it's the first time that I see a website that has content in more languages than jesusfilm.org! Even if it's very little content in each, it's still impressive.
And yes, we should support all of these languages eventually. It's knowledge, and we want all human knowledge, and this means all languages, even the extinct ones.
It's not exactly a dictionary, but a project with particular purpose in which I'm personally less interested, but it could also be used as a dictionary.
They indeed use a somewhat unusual romanization, but from a quick look it appears to be consistent, so it should be usable. If not for Wikibase Lexemes, then maybe for Wiktionary. --Amir E. Aharoni (talk) 10:49, 8 October 2019 (UTC)

Milestone - 200k lexemesEdit

My bot just created spiritualistically (L200000), the 200000th lexeme, while import Wiktionary adverbs! It means "in a way relating to being spiritual".  – The preceding unsigned comment was added by SixTwoEight (talk • contribs).

Linking senses between languagesEdit

One of the features I think lexicographical data will have in the near future will be making better automatic translation systems. But I can't find how to link senses between languages. Is there any way? -Theklan (talk) 18:05, 16 October 2019 (UTC)

@Theklan: Right now there is translation (P5972) for direct translation links and item for this sense (P5137) for indirect links. See the "translations" section of Wikidata:Lexicographical data/Statistics (which I don't think has been updated for a long time). ArthurPSmith (talk) 18:12, 16 October 2019 (UTC)
@ArthurPSmith: Thanks! It seems that it is still not very used. -Theklan (talk) 18:15, 16 October 2019 (UTC)

As the linking should be symmetric.. is there any tool to do this? -Theklan (talk) 18:17, 16 October 2019 (UTC)

The indirect linking via item for this sense (P5137) is automatically symmetric (but you have to add the property on all the relevant language/lexemes that it applies to). So far the number of senses is far less than the number of lexemes (about 30,000 vs 200,000) so that should probably be addressed first! ArthurPSmith (talk) 18:20, 16 October 2019 (UTC)
which I should add, MatchSinn is a great tool for working on! ArthurPSmith (talk) 18:22, 16 October 2019 (UTC)

<old man still yells at cloud>Do we really need the property translation (P5972) when you can easily and obviously have the same results with item for this sense (P5137)? If we add all the possible value in translation (P5972), I fear the lexeme won't be usable anymore.</old man still yells at cloud> Cheers, VIGNERON (talk) 19:46, 16 October 2019 (UTC)

@VIGNERON: Well, the lexeme for water in Basque has three meanings. I think we could find item for this sense (P5137) for them. But most of the words don't have a item for this sense (P5137) for the sense, I guess. -Theklan (talk) 20:10, 16 October 2019 (UTC)
It does not hurt to create the item … author  TomT0m / talk page 20:18, 16 October 2019 (UTC)
Looks like sooner or later, good idea or not, such things have to happen, unfortunately. But I agree with you. One other stuff that is cool with item for this sense (P5137)   is that it allows to use the usual properties like subclass of (P279) to find hyperonyms. For example if sense A of an english word, then we can find « close » match in french even if they are not exact match. :
< lapdog(en) > item for this sense (P5137)   < lapdog (Q38499)     >
and anditem for this sense for that sense|Q144}} we automatically know that chien(fr) is an hyperonym of lapdog(en) … It occurs this is an actual example with the connections on Wikidata already set-up, checked afterward :) author  TomT0m / talk page 20:18, 16 October 2019 (UTC)
What would be the item for this sense (P5137) for the words what, would, be, the and for? Lexemes and items may be connected, but not always. -Theklan (talk) 20:30, 16 October 2019 (UTC)
@Theklan: what's the problem? you can use What? (Q20656446), will (Q364340), definite article (Q2865743), cause (Q2574811) (and many more, and you can always create some if needed). I don't see why not always connected all lexemes to an item. Meanwhile, in the end, we will have probably more than 10 000 lexeme for "water", do they really need to all link to all the others? (knowing that when an item has 5000 statements, it already usually breaks things). Cheers, VIGNERON (talk) 10:03, 17 October 2019 (UTC)

Handling of trademark (Q167270)Edit

How can we handle trademark (Q167270)? I have examples for Danish lexemes here: københavnerstang (L204254) and LEGO (L57918) where instance of (P31) is used? It is unclear to me whether a trademark (Q167270) is a lexeme. trademark (Q167270) is typically also associated with a certain graphical representation, e.g., "LEGO" and not "Lego" or "lego". trademark (Q167270) seems also mostly to be associated with a sense, e.g., "KØBENHAVNERSTANG"/københavnerstang (L204254) is associated with non-alcoholic drinks [6]. — Finn Årup Nielsen (fnielsen) (talk) 11:03, 17 October 2019 (UTC)

Return to the project page "Lexicographical data".