Wikidata talk:Lexicographical data/Archive/2019/09

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Lexemes to delete

Latest comment: 4 years ago3 comments2 people in discussion

Hey there,

While checking some numbers on Ordia I came accross a few Lexemes that are probably mistakes (people who tried to create an item, or at least entered as language something that is not a language):

Can I let someone check them and delete if needed? :)

I'm wondering what kind of query could help spot these mistakes. Lea Lacroix (WMDE) (talk) 15:56, 15 July 2019 (UTC)

A lot of query can help, some are already on Wikidata:Lexicographical data/Ideas of queries. Most of the time, filtering on the count can help to find outliers (wich may be fals-positive but most often not). For instance here the list of lexical categories only used once: https://w.wiki/6Xs, most need correction and some deletion (I already put invalid ID (L46090) and invalid ID (L43552) in WD:RFD, not sure what to do with กาเบรียล (L43417): could transcription of proper name be kept or not?).

There are also invalid ID (L58089). I have made a RfD [1]. — Finn Årup Nielsen (fnielsen) (talk) 08:11, 16 September 2019 (UTC)

There are various others lexemes where is it not clear whether the lexeme is totally wrong or just a wrong begining. invalid ID (L58719) invalid ID (L57857) invalid ID (L61908) AMC (L157090). — Finn Årup Nielsen (fnielsen) (talk) 08:48, 16 September 2019 (UTC)

appellative (Q498187) vs. common noun (Q2428747)

Latest comment: 4 years ago6 comments2 people in discussion

There is a ontological problem of possibility major significance for the lexicographical data on Wikidata. In Russian Wikipedia, there are two articles for common noun/appelative and two Wikidata items: appellative (Q498187) vs. common noun (Q2428747). Other Wikipedia language editions link to both of them. Machine translation of the two Russian Wikipedia articles does to clarify to me what could or should be done. I see Wikidata lexeme entities linking to both items. That is probably not what we want. @ DonRumata, Yurik, Infovarius: I am wondering if any Russian speaking people could clarify? Should we move all Wikipedia language links (except the Russian) to common noun (Q2428747) as well as all lexicographic annotation and leave appellative (Q498187) stranded with a single (Russian) Wikipedia link? — Finn Årup Nielsen (fnielsen) (talk) 12:45, 16 September 2019 (UTC)

They are synonims, but wikt:ru:апеллятив (appellative) is a lingvistic term, commonly used as an antonym for wikt:ru:оним (onym). Don Rumata 13:16, 16 September 2019 (UTC)

The word wikt:ru:апеллятив is used only in scientific literature and most Russians do not know it, but wikt:ru:имя нарицательное is well kown, because it is taught in schools. Don Rumata 13:34, 16 September 2019 (UTC)

In Danish, we have "appelativ" (appelative) and "fællesnavn" (common name) but I would say they are the same concept, - an subsequently on the same item. It is not clear which item we should use for labeling appelatives in Wikidata. Do you have an opinion? If the only difference between the two Russian words is who is using the term, then should the Russian Wikipedia articles be merged? — Finn Årup Nielsen (fnielsen) (talk) 16:30, 16 September 2019 (UTC)

Definitely, we can merge these articles. I suggest using a more recognizable common noun (Q2428747). Don Rumata 17:15, 16 September 2019 (UTC)

I do not know whether there is a subtle difference, so that Russian Wikipedians would like maintain both. I suppose that if you merge the Russian articles and we subsequently merge the Wikidata items, then the merge should (as usually) go to the one with the lower identifier number, i.e., appellative (Q498187). The Russian label can then be change to what is most appropriate. — Finn Årup Nielsen (fnielsen) (talk) 09:16, 17 September 2019 (UTC)

Reminder: your input needed about integration of Lexemes in Wiktionaries

Latest comment: 4 years ago13 comments7 people in discussion

Hello all,

As the question about integration of Lexemes on Wiktionaries is regularly raised, I wanted to remind you that in order to help us understand better what you need and therefore what are the technical solutions we could provide, you can let a comment under this ticket. Feel free to explain why and how you would use Lexemes on Wiktionary, what kind of tool you would need (parser function, templates, Lua, etc.) and to provide a few examples on how Lexemes stored in Wikidata could be used.

More input from people who are working on Lexemes and/or Wiktionaries, or simply sharing enthusiasm for the feature, will help us moving forward with this project :)

Thanks in advance, Lea Lacroix (WMDE) (talk) 10:19, 12 September 2019 (UTC)

Regarding the issue of Wiktionary/Wikidata: I now see a large number of Russian nouns being created on Wikidata, e.g., авиагарнизон (L83061). They seem to come from Russian Wiktionary by YurikBot. @Yurik: I am wondering whether the Russian Wiktionary or Wikidata has discussed this issue? — Finn Årup Nielsen (fnielsen) (talk) 10:42, 12 September 2019 (UTC)

@Fnielsen: This was mainly discussed on the Russian Wikimedia Community server in Discord between Wiktionary, Wikidata and Wikipedia participants. We asked several questions on this forum, and now we are discussing in Russian Wiktionary about the possibility of relicensing sences under CC0. Iniquity (talk) 10:52, 12 September 2019 (UTC)

Also a few more discussion links: YurikBot bot flag discussion (discussing why it is needed), import announcement, and linking wiktionary pages to lexemes (every wiktionary page now has a link to corresponding lexeme), plus the above link about re-licencing word definitions (and other copyrightable content) under CC0. This is a long shot project, so for now I'm not importing any of them, just the non-copyrightable things like word forms, etc. --Yurik (talk) 15:00, 12 September 2019 (UTC)

Thanks for the update and for the links! Which one links to the discussion about the licence? I don't speak any Russian but I'd still try to follow what's happening :) I'd love to get an update from time to time about the reactions from the Russian Wiktionary community. Lea Lacroix (WMDE) (talk) 16:37, 12 September 2019 (UTC)

This one :) Iniquity (talk) 16:43, 12 September 2019 (UTC)

Thanks. I quickly looked at a translated version, and it doesn't look like a clear community consensus to me.

I'd like to state one more time that we should all be careful and import only data that is public domain or CC0 in Wikidata. This has an impact on future project as well as the image that people have about Wikidata, it's community, and how seriously we treat free licenses.

I think it would be great to have more Russian-speaking people involved in this project: @Infovarius, DonRumata, Kareyac:. Lea Lacroix (WMDE) (talk) 10:18, 16 September 2019 (UTC)

I think we need more reasons for integration Wiktionary with Wikidata. This process should be more mutually profitable for both projects. Most important to use Wikidata to collect general data which the same in all Wiktionaries in any languages. Don Rumata 12:21, 16 September 2019 (UTC)

I was participating in preparing of the import. I consider it quite complete and useful. Probably all except senses and dependant (synonims/hypernims/...) information has been parsed and imported. --Infovarius (talk) 11:22, 18 September 2019 (UTC)

@Lea Lacroix (WMDE): as mentioned elsewhere, the biggest reaction was "what, you can't use getEntity('L123') from Lua?!?". Seriously :). There is already some discussion on how to use this data directly from Wikidata (obviously it will take some time for all things to fall into place, but having Lua access is obviously the biggest obstacle). One other option could be to have a dedicated bot update wiktionary from wikidata, but that is obviously not as convenient, and will require a lot more work. --Yurik (talk) 02:14, 15 September 2019 (UTC)

Now that we have more experience in working with multiple Wikibases (such as the one on Commons), maybe it's time to review the optimal place in the WMF universe to place lexemes. An alternative could be to have it directly at lexeme.wiktionary.org .. this would also solve the main obstacle to adoption on some of the Wiktionaries. --- Jura 09:01, 16 September 2019 (UTC)

Apart from the technical difficulties to access a non-Wikidata Wikibase from sister projects and other arguments that have been widely discussed over the past years, changing the whole system now would require a huge amount of resources that we prefer involving in moving the project forward, for example by experimenting arbitrary access on some Wiktionaries. Lea Lacroix (WMDE) (talk) 11:12, 16 September 2019 (UTC)

It seems that many of the supporters of the current solution never really contributed to lexemes on Wikidata.

Personally, I had mostly ignored Wiktionary once it had been started as I found it too complicated to prepare in Wikitext. Returning there years later, I notice that they went quite far with templates and LUA and, e.g. the French Wiktionary, went quite far in terms of coverage. Given their achievement, we should try to support that and avoid anything that could be considered canibalizing them.

Once structured data on Commons is running, I think the additional work of setting up another Wikibase instance should be marginal. Besides, it seems to be a goal of WMF to be able to set up such instances easily. One might just need to wait a few months. For Wikidata, the impact should be negliable.

With the grant spent on converting Wiktionary senses to triples, maybe more content is already available for such a Wikibase instance at Wiktionary than currently here on lexeme namespace at Wikidata. We should make sure resources spent are well made use of.

Obviously, a request for such a Wikibase instance would need to come from the Wiktionary community. As long as the proposals are of the type "I deleted some lexemes and now it is smaller than Wiktionary", you probably wont have to worry about accomodating community input for your IT plans to support the community. --- Jura 21:46, 19 September 2019 (UTC)

Thoughts about sense future and usecases (on Wikidata mailing list)

Latest comment: 4 years ago2 comments2 people in discussion

You can read and pariticipate in this thread, in english. The questionning is the future of Wikidata lexicographical functionalities like senses and how reasearchers and community could collaborate to build tools that helps us using and enriching the datas. author TomT0m / talk page 14:38, 20 September 2019 (UTC)

I ran a query for Wikidata items with one-word lower-case labels (in English) that had P279 statements (i.e. likely to be generic terms, or in other words common nouns), and have been gradually adding them manually as lexemes with the sense linked to the item(s). This could probably be automated to a considerable degree - the English gloss on these can be taken directly from the Wikidata description (if any), since that's also CC-0. This obviously isn't special to English either. ArthurPSmith (talk) 17:26, 20 September 2019 (UTC)

Lexemes and Wikimedia translation

Latest comment: 4 years ago1 comment1 person in discussion

I think it would be a good idea to make efforts to integrate Wikimedia translation tools with Wikidata lexeme datas. Both parts have something to win :

translators could benefit from the linguistic datas and lexeme definitions, item for that sense Search to suggest completion on translations, for example.
lexeme datas could benefit from translator contribution if they identify a missing sense for a lexeme.

A possible useful feature that could help do that in Mediawiki would be to tag wikitext/articles with lexemes and senses. What do you think ?

author TomT0m / talk page 07:05, 23 September 2019 (UTC)

Searching for Lexeme:thanks or Lexeme:Danke leads to error message

Latest comment: 4 years ago1 comment1 person in discussion

Tracked in Phabricator
Task T233763

I filed a bug for it, but I found it curious enough to state it here: when you search for Lexeme:danke or Lexeme:thanks the system returns an error. --Denny (talk) 22:24, 24 September 2019 (UTC)

Batch import

Latest comment: 4 years ago5 comments5 people in discussion

Hello, for a project I am working on we have developed a series of curated vocabularies which includes entry in various languages (mostly en and de but also fr and it) and in their various forms (some RDF here https://github.com/swiss-art-research-net/vocab ). I would like to add in batch the ones not present on wikidata but I cannot find a way to do it with quickstatements or openrefine. Do you know how can I add them? – The preceding unsigned comment was added by Wpbloyd (talk • contribs) at 14:40, August 26, 2019‎ (UTC).

@Wpbloyd: There's a new "batch" mode in the Wikidata Lexeme forms tool - for example here for English nouns. If you're going to be importing on the order of thousands of entries you should probably get "bot" approval first - i.e. do a handful the way you plan to handle these and have people review them on the Wikidata:Requests for permissions/Bot page. ArthurPSmith (talk) 17:23, 26 August 2019 (UTC)
@Wpbloyd: What's the license of your data (and, specifically, the whole collection)? Because it should be CC0 in order to be imported to Wikidata Lexemes. --Infovarius (talk) 09:11, 27 August 2019 (UTC)

@ArthurPSmith: Thanks for the link! There are not so many entries but only around 100/150, so this should work! @Infovarius: thanks, that is true. I will add the cc0 license to github too! (Wpbloyd (talk) 13:10, 27 August 2019 (UTC))

@Wpbloyd: I suppose there is also the issue of how to indicate that the word is listed in SARI. — Finn Årup Nielsen (fnielsen) (talk) 08:55, 16 September 2019 (UTC)

@Fnielsen: From our side, that would not really be necessary in this case. However, it would probably be important to further discuss it for in case other projects want to deliver lexemes to wikidata. it could be seen as incentive.

@Wpbloyd: See also: the code of a bot developed to import data from an external source, and a python library to create your own bot. As already mentioned, the data imported into Wikidata must be CC0 and clearly identified as such. Thanks a lot for working on this project and keep us updated! :) Lea Lacroix (WMDE) (talk) 10:01, 25 September 2019 (UTC)