Wikidata talk:Lexicographical data/Archive/2018/05

Latest comment: 6 years ago by Pamputt in topic Serbo+-Croatian
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Moving existing properties for linking Q to L items

Hi,

AFAIK and IIRC, there has been no general discussion for moving existing properties for linking Q to L items (feel free to give links if I missed them). There is some properties like demonym (P1549) or female form of label (P2521) that right now have the monolingual text datatype (and maybe some with String like name in kana (P1814)) where maybe a lexeme datatype would be more appropriate (it could solve the grammatical problems like qualifiers for masculine/feminine or singular/plural).

What do you think? If the idea seems good, could we start a list of concerned properties.

Cdlt, VIGNERON (talk) 13:19, 13 April 2018 (UTC)

Wouldn't it be preferable to see if things work out rather than break two things at a time? Maybe next year.
--- Jura 13:25, 13 April 2018 (UTC)
@Jura1: of course it's not for today (and maybe not even for next year) but we can think about it nonetheless (and thinking doesn't break anything). My two main questions are : has it already been discussed ? (I don't think so, but maybe it has been) and is it a good idea in itself ? (maybe it can't work for some reasons - lemma vs. lexeme for instance - that I didn't think about yet). Cdlt, VIGNERON (talk) 13:36, 13 April 2018 (UTC)
I think long-term both demonym (P1549) or female form of label (P2521) should be replaced by inverse properties. When we have the data in inverse properties I would advocate to delete both (maybe in a year). ChristianKl17:07, 6 May 2018 (UTC)

New date of deployment: May 23rd

Hello all LexData enthusiasts,

And thanks a lot for your patience :)

We've been delaying the deployment of the first version of lexicographical data in order to fix some minor issues and work on security of the code. I'm glad to announce that the first release of Lexicographical data will take place on Wednesday, May 23rd.

Everything that I wrote here is still valid. I will continue giving you some news before and after the deployment. The documentation will be improved and of course, I'll be happy to collect all your questions and feedback on this page.

In the meantime, you can still:

Thanks for your support, Lea Lacroix (WMDE) (talk) 07:40, 2 May 2018 (UTC)

Draft for the RDF mapping

Hello all,

One of the things we're expecting a lot about lexicographical data on Wikidata is the ability to run queries. As previously announced, this will not be available for the first release on May 23rd, but you can already add some ideas of queries.

One of the steps to move forwards with the ability to query the data, is to have a RDF mapping ready. This task has been started by Tpt (thanks!) who created a draft for RDF mapping of Wikibase Lexeme. If you have knowledge on the topic, feel free to have a look and let comments directly on the talk page.

Cheers, Lea Lacroix (WMDE) (talk) 12:56, 8 May 2018 (UTC)

Conjugation

Hi all,

How can we indicate the conjugation of verbs? It's a lot of different forms (for example the French verb tenir). Tubezlob (🙋) 08:37, 25 April 2018 (UTC)

I think such details can better be answered with people that have been involved in Wiktionary. Otherwise we will likely make old mistakes again and reinvent wheels. -- JakobVoss (talk) 08:53, 25 April 2018 (UTC)
@Tubezlob: aren't conjugation just forms? I did a test last week for "aller"@fr, what do you think? On Wiktionnaries, there is templates that generates the conjugation semi-automatically (with parameters to override when the conjugation is irregular which is not unusual). Cdlt, VIGNERON (talk) 09:05, 25 April 2018 (UTC)
@JakobVoss: Sure, I read the help pages of Wiktionary, there is a lot of important and interesting information.
@VIGNERON: Yes I agree that these are forms. But for the verbe "aller", there are 48 forms (× 2 with composed tenses). It will be very difficult to see anything with one hundred forms. We need to have a way to order this by tenses and persons for display (maybe with a script). For regular verbs, I think we need a script/gadget that adds automatically the forms (like the templates of Wiktionary). Tubezlob (🙋) 10:26, 25 April 2018 (UTC)
Please don't reinvent wheels here. That has already been done many times on the Wiktionaries. Some of the inflection code is fairly complex (depending on the language) and can't be done with simple gadgets. The LexData "roadmap" includes automatic form generation but I'm unsure how this will be implemented. If it is in Lua perhaps some existing code could be reused. – Jberkel (talk) 13:32, 20 May 2018 (UTC)

Presenting the development team

Hello all,

I did a huge revamp of Wikidata:Lexicographical data/Development, it is now presenting the structure of the development team working on lexicographical data, but also the volunteers who may be involved in the core development or building tools.

I hope it will make it easier for you to understand who is working on the technical side, and who to contact if you have questions (spoilers: mostly me). If you have any question that is not answered by this page, if you would like to know more about a specific part of the development process, please let me know.

The page is translatable but I don't have the time to provide translations, I'm very sorry about that. If some of you could provide a translation (this should not take more than 15min), it would be wonderful. (edit: I also updated the banner)

Thanks! Lea Lacroix (WMDE) (talk) 12:26, 15 May 2018 (UTC)

That's great! Thanks a lot, merci beaucoup ! -- Noé (talk) 22:32, 15 May 2018 (UTC)

First release

Hello all,

As some of you may have noticed, today is the day where the deployment was announced ;) We're currently working on the last technical steps. Once everything is ready, I will do a proper announcement to the Wikidata and Wiktionaries communities. I will let you know if anything unexpected occurs.

Thanks for your patience :) Lea Lacroix (WMDE) (talk) 08:12, 23 May 2018 (UTC)

First experiment of lexicographical data is out

Hello all,

After several years discussing about it, and one year of development and discussion with the communities, the development team has now released the first version of lexicographical data support on Wikidata.

Since the start of Wikidata in 2012, the multilingual knowledge base was mainly focused on concepts: Q-items are related to a thing or an idea, not to the word describing it. Starting now, Wikidata stores a new type of data: words, phrases and sentences, in many languages, described in many languages. This information will be stored in new types of entities, called Lexemes, Forms and Senses. It will allow editors to describe precisely all words in all languages, and will be reusable, just like the whole content of Wikidata, by multiple tools and queries, everything that the community creates to play with words. Lexicographical data can be reused inside and outside the Wikimedia projects, and can provide support for Wiktionary.

The first release

A new namespace and several new entity types have been created in order to model words and phrases. If you’re new to this project, you can learn more by looking at the documentation, briefly describing the data model and the interface. The technical structure is set, but the editors remain free to model and organize data as they prefer, with the usual open discussions and community processes that we apply on Wikidata. Some discussions about new properties to create have already started: if you want to be involved in the early stage of the project to shape it, please participate!

Please note that the version that is now deployed is a first experiment, that will be continuously improved in the future. Some features are missing, some bugs may certainly occur. Here are the features that are included in the first release:

  • Add, edit and delete Lexemes, Forms, statements, qualifiers, references
  • Link between the different entity types (Item to Lexeme, Form to Item, etc.)
  • Entity suggestion when adding a property or a value

And the following features will not be included in the first version, but are planned for the future:

  • Find Lexemes and Forms via Special:Search
  • RDF support (which also means: the ability to query it with query.wikidata.org)
  • Support for Senses
  • Merging of Lexemes
  • Including the data on other Wikimedia projects, such as Wiktionary

How to try it?

The features described above are now deployed on Wikidata.org. Here are some suggestions of what you can do to explore this new territory:

  • If you’re not familiar with the structure of Lexemes, have a look at the documentation
  • Look at what is already existing. Please note that Special:Search and the search bar on the top right corner of pages is not supporting Lexemes yet. We’re working on this.
  • Create a new Lexeme with Special:NewLexeme
  • If a property that you need is missing, you can suggest it here
  • Discuss about how to model words and ask questions on Wikidata talk:Lexicographical data
  • Report bugs or issues that you may encounter: either on the talk page or on Phabricator, if you’re comfortable using it (create a task, add the tag Lexicographical data, and add Lea_Lacroix_(WMDE) as a subscriber)

About mass imports and tools

We kindly ask you to not plan any mass import from any source for the moment. There are several reasons behind that: first of all, like mentioned above, the release is a first version and we need to observe how our system reacts to the manual edits before starting considering automatic ones. The system may not be ready for big massive imports at the beginning. Second reason is legal. Lexicographical data in Wikidata is released under CC0, and the responsibility of each editor is to make sure that the data they will add is compatible with CC0. For more information, you can have a look at the advice of WMF Legal team. Finally, we strongly encourage you to discuss with the communities before considering any import from the Wiktionaries. Wiktionary editors have been putting a lot of efforts during years to build definitions, and we should be respectful of this work, and discuss with them to find common solutions to work on lexicographical data and enjoy the use of it together.

We also suggest you to wait a bit before building tools or scripts on the top of lexicographical data. The interface and its API are probably going to evolve during the next months, and the system may not be stable enough to support such tools. We will inform you as soon as it will be possible.

Next steps

After this first release, some improvements will be made on a very regular basis (new deployments every week). Once you tried playing with the new data, feel free to give us feedback. We’re looking especially to know what are the most important features for you to be worked on next.

  • What did you experiment while editing lexicographical data? What went wrong or was unexpected?
  • What bugs or troubles during the process did you encounter?
  • What are the features that are, in your opinion, the most important? Which one should we work on next?

If you’re interested in following the discussions and further announcements about lexicographical data, I encourage you to follow Wikidata:Lexicographical data and its talk page, where we will discuss about how to organize and structure data, new features to be added, ideas of tools and queries, and a lot of other things.

Additional note: with this new kind of data enabled on Wikidata, we expect some new editors to get interest in it, edit Lexemes, suggest properties or ask questions. They may not be familiar with all of our community processes and our ways to organize content. They will need help and support as well as links to useful resources to understand how the Wikidata community works. I hope that we will all be kind and patient, both with other editors and with the software that may not work exactly as we want it to at the beginning :)

Thanks to the people who tested the model and the interface before the release, who showed support and curiosity about lexicographical data on Wikidata!

If you have any question or idea, feel free to write on Wikidata talk:Lexicographical data or contact me. Lea Lacroix (WMDE) (talk) 11:34, 23 May 2018 (UTC)

UTF coding issue

See the language name in https://www.wikidata.org/wiki/Lexeme:L23?uselang=ca or https://www.wikidata.org/wiki/Lexeme:L23?uselang=es. It seems it is not coded UTF-8. --Vriullop (talk) 12:07, 23 May 2018 (UTC)

I believe this is filed as https://phabricator.wikimedia.org/T195359 ·addshore· talk to me! 12:11, 23 May 2018 (UTC)

Wrong suggestion based on the test site

Hi,

Maybe it's a just a cache problem that will disappear itself with time but right now, on Special:NewLexeme when you enter an language, for instance "fr", you get two suggestions : French (Q150) and Europe (Q46). The second one is a mistake as on Wikidata Q46 is Europa but Q46 is French on the test site : Q46. Strangely two suggestion lists appear one over the others and sadly the list on the top is the wrong one.

Same problem for the Lexical category, if I enter "noun" I got Q1084 (the right one from Wikidata) and Q8 (wrong one, from the test site).

Cdlt, VIGNERON (talk) 12:19, 23 May 2018 (UTC)

I believe https://phabricator.wikimedia.org/T191526 might be the issue you are describing? ·addshore· talk to me! 13:37, 23 May 2018 (UTC)
@Addshore: yes, thanks. My wrong suggestions are only mine because I tried on the test website. Cdlt, VIGNERON (talk) 14:05, 23 May 2018 (UTC)

Matching lexemes to items

I added the lexeme Lexeme:L249, "Andrew". How do I link that to Andrew (Q18042461), to show the relationship between them?

What about the lexeme "cheese" and cheese (Q10943)? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:00, 23 May 2018 (UTC)

This is not possible for now, since we don't have the Senses enabled yet. in the future, a property "refers to concept" included in the Sense could be one option. Lea Lacroix (WMDE) (talk) 13:03, 23 May 2018 (UTC)
Actually we already have the property created item for this sense (P5137) with exactly that meaning - but yes, it is expecting its domain to be the sense, not the lexeme itself. "Andrew" may only have one possible meaning, but for many words there are many, so the link should be from the sense, not the lexeme itself. ArthurPSmith (talk) 13:12, 23 May 2018 (UTC)
Yes, I would advice not to use the properties that are meant for Senses in the Lexemes. What is currently in Lexeme:L249 is not compliant with the data model. Please be patient, Senses will be soon available :) Lea Lacroix (WMDE) (talk) 15:33, 23 May 2018 (UTC)

Full and short names

I've now created Lexeme:L263 "Andy"; how do I indicate that that is a short form of Lexeme:L249 "Andrew"? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:09, 23 May 2018 (UTC)

you make a property proposal! ArthurPSmith (talk) 13:15, 23 May 2018 (UTC)
Something like "derived from" could work to make the link between the two Lexemes. Lea Lacroix (WMDE) (talk) 13:16, 23 May 2018 (UTC)
I would say too that "derived from" would suffice (maybe with qualifiers like determination method (P459) = hypocorism (Q1130279) or diminutive (Q108709), not sure for P459 maybe here a specific property is needed), if not, you can make a new property proposal. Remember that lexemes only exist for a few hours now, the technical structures is here but the community still need to think about how to use it, create the properties and the documentation. Cdlt, VIGNERON (talk) 13:23, 23 May 2018 (UTC)
Well "derived from" now exists - derived from lexeme (P5191). ArthurPSmith (talk) 13:46, 23 May 2018 (UTC)

Template:Property documentation needs fixing

The Property documentation template doesn't recognize the new lexeme example properties - see Property talk:P5191, there is an example given on the main property page, but it doesn't show it in the documentation template. @Jura1: do you know how to fix this? ArthurPSmith (talk) 13:45, 23 May 2018 (UTC)

default display of lexeme needs to show something more than just (Lxxx)

Similar to the issue with "what is already existing", the "what links here" page also shows just the Lxxx in parentheses, for example the (L298) here. I think the default display of a lexeme needs to at least show the title - preferably also it should show language and lexical category. ArthurPSmith (talk) 14:26, 23 May 2018 (UTC)

Yes. Which places do you consider most important so I can have the team focus one these first? (Obviously they will all come eventually.) --Lydia Pintscher (WMDE) (talk) 14:29, 23 May 2018 (UTC)
@Lydia Pintscher (WMDE): I would says that the Special:AllPages is the most important, this way we can do a CRTL+F search (which can replace the lack of "real" search). Cdlt, VIGNERON (talk) 14:51, 23 May 2018 (UTC)
Ok. I added phabricator:T195382. --Lydia Pintscher (WMDE) (talk) 15:01, 23 May 2018 (UTC)

Forms with the same representation

Hello, sorry if this has been asked before, but I am not quite sure how to handle lexemes with several forms with same representation. Should they all be entered separately (as I did it in Lexeme:L211), or should they be grouped (and one form would therefore have several grammatical cases in grammatical features)? The first way seems to be more logical, but I would like to be sure. --Sintakso (talk) 14:41, 23 May 2018 (UTC)

Hi @Sintakso:,
Yes, it has been asked before but better safe than sorry so don't hesitate to ask.
And yes too, even if several forms have the same representation they should be entered separately. I'm not sure if it's even possible to group them (and even if it is possible, it would be a nightmarish confusion). Latter, when lexemes would be fully deployed and the structure more stable, we should probably build tools to help with maintain and add forms.
Cdlt, VIGNERON (talk) 15:06, 23 May 2018 (UTC)

Easter Eggs

For those interested in finding all the hidden special L items, we're searching for them and compiling them here: Wikidata:Humour#Surprise_L_items. At the time of writing, we've found 17 but not been able to explain the justification for all of those yet. Wittylama (talk) 16:04, 23 May 2018 (UTC)

Sandbox properties and lexeme

They are at

@Lea Lacroix (WMDE): can we have Lexeme:L1234 as a second sandbox? Cdlt, VIGNERON (talk) 16:34, 23 May 2018 (UTC)
I'm sorry, but it's not possible to add special lexemes anymore. Feel free to create a second sandbox when you have need for one. Lea Lacroix (WMDE) (talk) 18:17, 23 May 2018 (UTC)

Query Service

When will it happen? It doesn't seem to work yet, but it isn't mentioned on Wikidata:Project_chat#Next_steps.
--- Jura 16:27, 23 May 2018 (UTC)

@Jura1: see #Draft for the RDF mapping. VIGNERON (talk) 16:32, 23 May 2018 (UTC)
It is mentioned in the following features, but we don't have precise date for now. We need to have the RDF mapping ready and reviewed before moving on to the integration into the Query Service. Lea Lacroix (WMDE) (talk) 16:43, 23 May 2018 (UTC)
Oh .. ok. Maybe another two months? I think I had forgotten how it was without it.
--- Jura 19:14, 23 May 2018 (UTC)

Structuration by languages

Hi,

Even if all the lexemes are in the same namespace, all languages are different (some radically so), that's how the World is. How to structure the community around that fact? Should we have separate places to discuss separately (like the projects for thematics in Q items) or should we keep this unique page [Wikidata talk:Lexicographical data]? Maybe at least, we could have a list somewhere with people who are willing to help for the lexemes of a specific language?

I know I'm thinking a bit too much ahead but I just stumbled upon Lexeme:L465 where an IP did a lot of edits on this lexeme in Arabic (+ 3000 and then -3000 octets, that's unusual). I don't speak the language and just know the very basic notion about Arabic. I could try to find an active user on Category:User ar but there is 230 so it's hard to choose and besides, knowing a language and knowing the linguistic of this language is two different things. Hence my idea of at least having a list of potential contacts.

What do you think?

Cdlt, VIGNERON (talk) 17:46, 23 May 2018 (UTC)

As a start, there's an attempt of list of contacts here :) Lea Lacroix (WMDE) (talk) 18:12, 23 May 2018 (UTC)
I think we should create subpages of this project (for example Wikidata talk:Lexicographical data/French) for every language with specific documentation for the language (written originally in English) and a specific talk page (at least for main languages). Tubezlob (🙋) 20:02, 23 May 2018 (UTC)
I believe it would still be very nice to keep most of the discussion in a cross-language space. try to build a representation system consistent between languages would be very nice to ease the reusability of the content (with tools and templates able to work in multiple languages). Tpt (talk) 20:40, 23 May 2018 (UTC)
@Tpt: very true, but I fear that's a trade-off, what we'll gain in consistency, we'll loose in specificity and integration of monolingual users (or just non-English speaking users). As I said, « I'm thinking a bit too much ahead » as my idea is that, right now, we need a common ground here first but we'll probably need specific grounds later (and it's important: in addition, not in place of, as the wikidata projects didn't replace the project chat or bistro). One of my favourite quote is by Konstantin Tsiolkovskii (Q41239) « The Earth is the cradle of humanity, but mankind cannot stay in the cradle forever. » We should enjoy and make the most of our « cradle » now but we should think of others ways to reach users too. Cdlt, VIGNERON (talk) 21:16, 23 May 2018 (UTC)
Yes, stating global before creating sub-projects sounds like a very good approach. It seems to work quite well with the items so we could hope it will work well also for lexemes. The shared maintenance of properties seems also a good way to keep consistency across sub-projects. So, +1 to this proposal Tpt (talk) 21:20, 23 May 2018 (UTC)
A little message to support VIGNERON. Working in other languages than English is required if we want this project to be a success. If there is only an English-speaking community, I can bet that it will fail. Pamputt (talk) 05:34, 24 May 2018 (UTC)

Special:NewLexeme - confusing interface

Hi, I shared the new Lexicographical data with some friends for evaluating the new feature. We find Special:NewLexeme to be very confusing:

  • "Language of Lexeme" - the placeholder says "enter item ID e.g Q10" - people don't know IDs. Instead it should be "Enter language" (suggestor anyways translated it to ID under the hood)
  • "Lexical category" - this should probably have closed options as it isn't quite clear what to fill there. If you write "noun" it suggest many non relevant options (such as Noun (Q2445359)). The description is not displayed, which make it even more confusing.

Eran (talk) 20:30, 23 May 2018 (UTC)

Thank you very much for your report. We will work to improve the interface during the next weeks. You can look at the Phabricator board to see the progress. Lea Lacroix (WMDE) (talk) 20:46, 23 May 2018 (UTC)

Mojibake

When adding a lexeme in Bokmål (Q25167), there is actual mojibake shown instead of the "å" in "Bokmål", here (even if I change to another language using &uselang). How is this possible in this brave new Unicode world? 😊 Jon Harald Søby (talk) 21:24, 23 May 2018 (UTC)

Thank you Jon Harald Søby. It has already been mention here : #UTF coding issue and there is also a bug phab:T195359. Cdlt, VIGNERON (talk) 21:53, 23 May 2018 (UTC)

Template:L

I created {{Lexeme}} (and {{L}} as a redirect), based on {{Property}}. However, {{L|249}} give an "invalid ID" error. Can anyone advise? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:11, 23 May 2018 (UTC)

Data access via Lua and parser function is still not enabled. So that probably fails because of this. --Lydia Pintscher (WMDE) (talk) 13:18, 23 May 2018 (UTC)
Right. The error message comes from {{Label|L249}} which needs access via Lua. Anyway, there is no multilingual label as usual. Instead it should retrieve somehow the "lemma-value" and "lemma-language". --Vriullop (talk) 15:43, 23 May 2018 (UTC)
I think we might want L:L100 as well, just like P:P31? — regards, Revi 12:09, 24 May 2018 (UTC)
@-revi: sounds reasonable IMHO. --Lucas Werkmeister (talk) 15:15, 24 May 2018 (UTC)
@-revi: switches hat I filed T195493 for this. --Lucas Werkmeister (WMDE) (talk) 15:17, 24 May 2018 (UTC)

Wait, search is not working??!!!

I hadn't quite realized the full implications of this:

Look at what is already existing. Please note that Special:Search and the search bar on the top right corner of pages is not supporting Lexemes yet. We’re working on this.

Is there no way to search for an existing lexeme? I've been adding some to populate examples, but I was assuming that my search not finding anything meant that somebody hadn't already added them - now I'm not so sure! "Advanced" search page here has boxes for "Lexeme" and "Lexeme Talk" but they don't seem to do anything!? How do we avoid duplicates? The "already existing" linked page doesn't really help, it just lists them by ID, it doesn't show the text or anything else about the lexeme. ArthurPSmith (talk) 14:18, 23 May 2018 (UTC)

For now the only place is unfortunately when adding a new statement that links to a Lexeme or a Form. Special:Search and showing Lemmas in listings is next on my list. --Lydia Pintscher (WMDE) (talk) 14:28, 23 May 2018 (UTC)
Oh, you're right, it does work on the property entry where a lexeme is expected, that's definitely a good clue regarding duplicates, ok, I'm a little happier. ArthurPSmith (talk) 15:38, 23 May 2018 (UTC)
  • Ideally Forms (similar to aliases) and language and grammatical categories (similar to descriptions) would be searchable too.
    --- Jura 10:41, 24 May 2018 (UTC)
Yes, it's planned on the long run. Lea Lacroix (WMDE) (talk) 10:45, 24 May 2018 (UTC)

Hack to search for a Lexeme: go to the sandbox, add or edit a statement with the property Sandbox-Lexeme (P5188), and type the word that you're looking for in the value field. Search is working here, you will see the lexeme displayed with the language and lexical category. Lea Lacroix (WMDE) (talk) 10:45, 24 May 2018 (UTC)

What's a representation?

While translating Wikibase Lexeme into Hungarian, I have encountered something that I don't quite get. What are exactly form and representation, and what is the difference between them? I get that form is kind of the idea of the form (with an ID and statements), while representation is the actual string that represents the given form, but how can one form have multiple representations? – Máté (talk) 14:36, 23 May 2018 (UTC)

Hi @Máté:,
You're understanding seems correct (at least to me and assuming that we talk about the same things) and as far as I know, a form only have one representation. Where did you see that a form can have multiple representations ?
Cdlt, VIGNERON (talk) 14:58, 23 May 2018 (UTC)
@VIGNERON: From the “+” sign which is part of the editing interface of forms :). I haven't tried to save it, but it does actually allow entering multiple representations for the same form. Can it be for multiple scripts like in the case of Serbian? (I am pretty sure its not for alternate spellings since those should have different set of statements as far as usage goes.) – Máté (talk) 15:04, 23 May 2018 (UTC)
@Máté: oh yes, I see (obvious, sorry). I'm not sure here, maybe it's like the "color"@en-US/"colour"@en-GB situation (two representations, but except for the spelling variation, all other data are the same - more or less*). More exactly on Lexeme:L114, I created the forms "daoulagad"@br (L114-F2) and "deulagad"@br-x-Q2924576 (L114-F4) but here maybe this should/could be group. @Tpt: what do you think? Anyhow, these kind of case are very limited. * I said more or less since at least the pronunciation vary too. Cdlt, VIGNERON (talk) 15:13, 23 May 2018 (UTC)
Yes, I believe that having multiple representations is for the "color"@en-US/"colour"@en-GB other cases of small spelling variations or writing in different alphabets. I would say that if the grammatical features are the same and if there is no need for having statement just about one of the variation then we should just have one form with multiple representations. To make it short if there is nothing that applies to one of the variation and not the other then the two should be merged in one form. But in most of the cases I assume that forms will only have one representation. Tpt (talk) 15:31, 23 May 2018 (UTC)

Some languages have several system of writing (Arabic, Latin...).--Cinemantique (talk) 15:33, 23 May 2018 (UTC)

Yes, but even if the two system are equivalent in the everyday life, there is different data to store about them. Like 中国 (zh-hans) and 中國 (zh-hant) is the same word and the same form (China in Chinese) but the two representation where created more than 2 millennia apart. The first is attested since the 20th while the second dated back to the 10th-7th BC (first know attestation is on hte He zun (Q5689054)). So for Wikidata, it could be best to stored 中国 and 中國 as two different forms. Cdlt, VIGNERON (talk) 16:17, 23 May 2018 (UTC)
This is also true of Vietnamese, which switched writing systems in the 20th century as well. Almost every Latin-alphabet word has at least one Han representation that's considerably older. – Minh Nguyễn 💬 16:19, 24 May 2018 (UTC)

So, would I be correct to go with my initial intuition to translate representations as “alakváltozat” (lit. form variation), which are used in Hungarian to describe words that are basically one lexeme (one entry in a dictinary) but have variations (veder–vödör; e-mail–ímél) with only minor stylistic differences? – Máté (talk) 15:39, 23 May 2018 (UTC)

Yes, it's the idea. For example a name lexeme in English has usually two forms, one singular and one plural. Tpt (talk) 15:47, 23 May 2018 (UTC)

Allows creation of duplicates

When creating a new lemma there seems to be no checking done to see if it has not already been created, and as far as I can see no way to search for a lemma. Danrok (talk) 18:45, 23 May 2018 (UTC)

For the search, yes it's problematic but it's in progress and temporary (see multiples previous discussions and warnings supra).
For multiple lemma (as already said also, see #One L-item for "tour"@fr: bug or feature?), it is intended, multiple lexemes will have the same lemma (and with the same language and in some rare cases, even with the same lexical category).
Cdlt, VIGNERON (talk) 18:50, 23 May 2018 (UTC)
Good point, a warning message of some kind is what's needed. Danrok (talk) 19:23, 23 May 2018 (UTC)
I created a ticket with a possible workflow, feel free to comment! Lea Lacroix (WMDE) (talk) 11:04, 24 May 2018 (UTC)

Plural forms

Hi,

I created the word "maison" in French (Lexeme:L525) and the form F1 "maisons" (plural). I added plural (Q146786) as a grammatical feature of this form, but should I also add feminine (Q1775415), even it's already stated for the main form with grammatical gender (P5185)? Tubezlob (🙋) 19:37, 23 May 2018 (UTC)

@Tubezlob: good question. I don't have really an opinion but I see pros and cons : con, more work and redundancy / pro, adding feminine (Q1775415) would be consistent with other lexemes (so easier to query, etc.). Cdlt, VIGNERON (talk) 20:19, 23 May 2018 (UTC)

I would add feminine on the form only if it is different from the gender of the Lexeme. I don't think that pluralization changes the gender, normally. --Denny (talk) 00:17, 24 May 2018 (UTC)

@Denny: FYI in French there is the strange case of Lexeme:L471 which can be masculine or feminine in plural (with a slight change in meaning but there is no senses right now so we can't model that part now). Cdlt, VIGNERON (talk) 06:15, 24 May 2018 (UTC)
@VIGNERON, Denny: if I recall well, there are exactly 3 fr lexemes which are masculine in singular form, and feminine in plural form : amour, délice, and orgue. French grammar is totally full of that kind of exceptions, and that is such a joy ;D
in French, we call it "l'exception qui confirme la règle" -> the exception confirms the rule --Hsarrazin (talk) 09:39, 24 May 2018 (UTC)

Confusing Form layout

I find the layout for Forms confusing, so I have opened a ticket about it with a suggestion about how to improve it.--Micru (talk) 07:09, 24 May 2018 (UTC)

@Micru: small change but very good idea, the alignment you propose is indeed much clearer. Cdlt, VIGNERON (talk) 07:13, 24 May 2018 (UTC)

Prefil language/lexical category

For property creation, the special page can be pre-filled with the label/description. Can we do the same for new lexemes, to fill language and lexical category. I tried:

but neither worked.
--- Jura 07:29, 24 May 2018 (UTC)

Hello Jura, thanks for your suggestion, I created a Phabricator task. Lea Lacroix (WMDE) (talk) 08:18, 24 May 2018 (UTC)

Showcase item

Any showcase item?--Jklamo (talk) 08:26, 24 May 2018 (UTC)

@Jklamo: it's a bit early for that, few lexemes or properties have been created. But you can take a look at Lexeme:L99 or Lexeme:L666 which are probably the best - and at least good - examples of what we have right now. Cdlt, VIGNERON (talk) 09:32, 24 May 2018 (UTC)
You can also take a look to Lexeme:L403.--Micru (talk) 09:48, 24 May 2018 (UTC)
As for Russian as the language with many inflected noun forms вода is a good example. It is also the best (and probably the biggest) article in ru:wikt :) --Infovarius (talk) 15:32, 24 May 2018 (UTC)

Fill form on lexeme creation ?

For new Lexemes, maybe the label should also be added directly as form.
--- Jura 09:46, 24 May 2018 (UTC)

Thanks! I edited the ticket to include both the representation and the spelling variants. Lea Lacroix (WMDE) (talk) 10:03, 24 May 2018 (UTC)
Good idea!--Micru (talk) 11:36, 24 May 2018 (UTC)

"Handling of values for "wikibase-form" data type is not yet supported."

When I try to enter Wikidata property example for lexemes (P5192) on conjugation class (P5186) I get the above message. I thought linking to Forms was supported? --Micru (talk) 09:46, 24 May 2018 (UTC)

Hey Micru, thanks for your report. It's a strange bug that we're already monitoring on phab:T195402. Lea Lacroix (WMDE) (talk) 09:56, 24 May 2018 (UTC)
Re @Micru:, can you try to delete cookies from your browser, and try again? It should solve the problem. Lea Lacroix (WMDE) (talk) 14:34, 24 May 2018 (UTC)
Thanks, Léa! It worked! --Micru (talk) 14:50, 24 May 2018 (UTC)

Capitalization: “Lexical Category” vs. “Grammatical features”

Is there a reason why the last word in MediaWiki:Wikibaselexeme-field-lexical-category-label/en is capitalized while the last word in MediaWiki:Wikibaselexeme-form-grammatical-features/en is not? -- IvanP (talk) 09:55, 24 May 2018 (UTC)

Hello, thanks for your report. It is still a bit fuzzy right now. In general, we would like to have capitals to Lexeme, Form and Sense, to avoid confusion with the other meanings of these words, especially for to two last ones. We would also suggest to do so for Properties and Items. For the other labels, we should apply the same rule as usual, which is having only the first letter capitalized. Lea Lacroix (WMDE) (talk) 09:59, 24 May 2018 (UTC)

simple past (Q1392475) vs. past tense (Q1994301)

I would welcome comments to this: Lexeme Talk:L4. -- IvanP (talk) 10:36, 24 May 2018 (UTC)

Sorting of forms

I noticed that in Lexeme:L469 there are forms sorted L469-F1, L469-F10, L469-F11... L469-F2, L469-F3... Logically they should be in numerical order JAn Dudík (talk) 12:42, 24 May 2018 (UTC)

Yeah we need to solve the ordering. The thing is that the form ID itself doesn't have any meaning so isn't really suitable for ordering. We have phabricator:T176405 for it. --Lydia Pintscher (WMDE) (talk) 15:29, 24 May 2018 (UTC)


I think defined order per language and lexical category based on grammatical features would be ideal.
--- Jura 18:03, 24 May 2018 (UTC)

small focus issue

Once I click "+ add Form" it could be nice if focus would be automatically on "Representation" field. Now I have to do extra click for every form. Little annoying. KaMan (talk) 13:43, 24 May 2018 (UTC)

Not essential but yes, +1 (and may I add that very often on the Wikidata the focus is not here I expect it, for me who used very little my mouse, this can be annoying). Cdlt, VIGNERON (talk) 13:47, 24 May 2018 (UTC)

grammatical aspect

How to add grammatical aspect to the verb? https://en.wikipedia.org/wiki/Grammatical_aspect KaMan (talk) 14:33, 24 May 2018 (UTC)

At first we should decide is it constant property of a lexeme or a variable property (and so, a property of a form). In Russian we can take it as a constant and thus to create verbs with different aspects as different lexemes. In this case I'd propose a new property "grammatical aspect". In the second alternative there is no need for a new property as it can be filled in "grammatical features" field. --Infovarius (talk) 15:53, 24 May 2018 (UTC)
In Polish it is a constant property of a lexeme as well. KaMan (talk) 16:51, 24 May 2018 (UTC)

Some Japanese-related questions

Greetings. My name is Jim Breen and I am coordinator of the JMdict Japanese-multilingual dictionary project. In one form or another it has been going for about 27 years, and manages the main dictionary of 180,000 entries and a named-entity dictionary of about 720,000 entries.

I am very interested in exploring linkages with the Wikidata lexicographical data, with such things as including lexeme codes into our entry information. As part of this a few questions and comments come to mind, from the perspective of Japanese lexicography:

  • how would multiple surface forms be handled in the Wikidata project? In Japanese it is common to find a word/term being written in several different forms, for example the loanword derived from the English word "diamond" can be either ダイヤモンド or ダイアモンド (the former is more common, but the latter is quite accepted.) In my opinion they are different forms of the same lexeme.
  • how will you handle parts-of-speech where there are families with different properties? For example Japanese has several classes of adjectives which inflect in different ways. They can be all labelled "ADJ", but there needs to be additional meta-information to enable users and software to respond appropriately. (In JMdict we use distinct POS tags, e.g. "adj-i" and "adj-na".)

That's probably enough for now. I hope I can explore these and other issues. JimBreen (talk) 07:41, 24 May 2018 (UTC)

First I notify Okkn who could better answer these questions than me.
Then, I would say (Okkn please correct me if I'm wrong) that :
  • indeed this is clearly two forms of the same lexeme
  • yes, this information need to be stored. It could be directly in the lexical category but I would tend to just put 'adj' in the lexical category and add a specific property for the specific class. Or maybe the classes can simply be inferred from the lemma (I'm unsure but it seems that this classes are based solely on the suffixes, no?).
Cheers, VIGNERON (talk) 08:00, 24 May 2018 (UTC)
Not entirely. The "i" adjectives have an inflecting part at the end (い), e.g. 高い, 寒い, which can become 高さ or 寒くない, etc. Others such as 確乎 or 綺麗 belong to different classes and take different suffixes. (Inferring from the lexeme is always risky.) JimBreen (talk) 09:34, 24 May 2018 (UTC)
Dear Dr. JimBreen: As you may all know, in Japanese grammar studied at elementary and secondary schools in Japan, "形容詞" (adjective) only refers to "i-adjective". And "na-adjective" is called "形容動詞" (literally "adjective verb", adjectival noun). So I think the part-of-speech of "綺麗", for instance, should not be adjective (Q34698), but adjectival noun (Q1091269), and only "i-adjective" lexemes should belong to adjective (Q34698). --Okkn (talk) 12:48, 24 May 2018 (UTC)
Thank you, Okkn-san. I'm aware of the 形容詞/形容動詞 situation. My question was about the labelling of lexemes such as 高い, 綺麗 and 確乎 in the Wikidata lexicographical database. If they all have the simple "ADJ" POS it obscures that they have different syntaxes with, for example, adverbial forms of 高く, 綺麗に and 確乎と respectively. I'm not suggesting a vast array of language-specific POSs; rather that there be additional classifiers available with the main POS tags to handle these sorts of situations.JimBreen (talk) 23:49, 24 May 2018 (UTC)
@JimBreen: According to this discussion, part-of-speech (lexical category) data will be stored as QID of Wikidata, such as verb (Q24905), adjective (Q34698), adjectival noun (Q1091269), noun (Q1084), proper noun (Q147276), pronoun (Q36224), adnominal adjective (Q11639843), adverb (Q380057), conjunction (Q36484), interjection (Q83034), auxiliary verb (Q465800), and Japanese particle (Q1480213). So I think the lexical categories in Wikidata and the POS tags in JMdict will be almost in one-to-one correspondence. --Okkn (talk) 04:17, 25 May 2018 (UTC)

not grammatical features markup

How can one markup not grammatical features of the form when one lexeme has two forms of the same grammatical category and feature (like plural of the noun) but one of them is rare or obsolete or depreciative? KaMan (talk) 10:55, 24 May 2018 (UTC)

Maybe a new property could be created to indicate this? It would also be useful as a qualifier for other properties (e.g. one of the senses of the lexeme "man" could have a statement P5190 (P5190)-dude with a qualifier such as "restricted usage"-colloquial language (Q901711)). --Sintakso (talk) 05:19, 25 May 2018 (UTC)

Invariant words

Can Lexemes representing an invariable word such as the have a Form (representing the only form there is for that word)? -- IvanP (talk) 17:47, 24 May 2018 (UTC)

If so, I would use IPA transcription (P898) only for Forms; if added to the whole Lexeme, the pronunciation of the basic form can be meant, but it would be redundant to have the property both on the Lexeme as well as the Form representing the basic form. Since there are forms that are not basic forms for which a pronunciation can be stated, I would also use IPA transcription (P898) for Forms representing a basic form. But what about a Lexeme representing an uninflectable word like because, is it to be left without a Form (so that its pronunciation has to be added to the whole Lexeme)? -- IvanP (talk) 18:21, 24 May 2018 (UTC)

In Lexeme:L2, language of work or name (P407) English (Q1860) is stated in regards to the IPA transcription. Is that not redundant as well? -- IvanP (talk) 18:21, 24 May 2018 (UTC)

On forms for invariant words, I would say yes, we should have at least one form and one sense for every lexeme. On the language of work or name (P407) qualifier - actually I think that is wrong, the better qualifier would be what was proposed in Wikidata:Property proposal/pronunciation variety (it would be nice if somebody created that!) ArthurPSmith (talk) 18:43, 24 May 2018 (UTC)
Yes, pronunciation variety, but only if the value is something more specific than English (Q1860), I would say, because the lexeme is already classified as English. General American English (Q3308526), for instance. -- IvanP (talk) 18:49, 24 May 2018 (UTC)
As I understand this qualifier was added because of constraints for IPA transcription (P898). I've already asked about its necessity there. --Infovarius (talk) 19:07, 24 May 2018 (UTC)

Two questions

  1. How should a grammatical gender be represented? Should we use e.g. grammatical gender (P5185)masculine animate (Q54020116) or two statements such as grammatical gender (P5185)masculine (Q499327) and grammatical gender (P5185)animate (Q51927507)?
  2. Should pronunciation properties (pronunciation audio (P443), IPA transcription (P898)) be added both to the whole lexeme and to the forms or just to the forms? The data model seems to suggest the latter, but it is currently used both ways.

Thanks in advance for answers. --Sintakso (talk) 14:05, 24 May 2018 (UTC)

Hi Sintakso,
  1. I would do separately but both are correct I think
  2. For me, pronunciations should only be one forms. Pronunciation of a lexeme make no sense to me as a lexeme is multiple various lemmata with very different pronunciations. I see a lot of things right now that are temporary and not ideal, but that's normal it's only the beginning of Lexemes.
Cdlt, VIGNERON (talk) 14:14, 24 May 2018 (UTC)

Lexical category – why not as a property?

There are different part-of-speech systems and some words are classified differently by different grammars, e.g., German lauter (as in vor lauter Freude) is seen as an adjective by Duden, an adverb by DWDS and a quantificative article by Progr@mm (cf. Wiktionary talk). kein is an indefinite pronoun according to Duden, but a negation article according to Peter Eisenberg. According to canoonet, the word solch- can be called a pronoun/determiner as well as an adjective.

These views could be represented by multiple values for a property lexical category. A preferred rank could be used if some classification is to be considered as the best (by linguistic or community consensus); for instance, some grammars consider a word such as pleite an “adcopula” (Adkopula) and not an adjective because it is restricted to predicative use, but this is not widely accepted, I would use the preferred rank for adjective (Q34698).

Is there a good reason why this should not be done? -- IvanP (talk) 23:25, 24 May 2018 (UTC)

@IvanP: In my opinion, based on what you explained, it seems to be a good idea to have a "lexical category" property for the cases where it is not enough with the provided field. Go ahead and propose it! --Micru (talk) 06:54, 25 May 2018 (UTC)
But why even have this field, then? -- IvanP (talk) 08:30, 25 May 2018 (UTC)
Me too, I don't understand why we have this kind of fields for lexical category (and even for forms): it's impossible to have multiple values (with ranks), to add qualifiers and references… Tubezlob (🙋) 08:38, 25 May 2018 (UTC)
@IvanP, Tubezlob: Personally I find it practical. Most forms have a simple "lexical category" and do not require multiple values with ranks, references, etc. so it saves work to have this field instead of having to create a statement every time.--Micru (talk) 08:40, 25 May 2018 (UTC)
It is the same with labels and descriptions on items. We intend to use them in listings and suggesters and other places so they need to be more restricted and more easily accessibly programatically. However there can be additional properties where this makes sense like for names on items. --Lydia Pintscher (WMDE) (talk) 08:43, 25 May 2018 (UTC)
I think we need a property for that, but that doesn't mean at all that the lexical category is not useful. Same goes for the lemma and the language by the way, where multiples values can be needed (for lemmata, we can already have multiple values but not in the same language, I add to cheat a bit on Lexeme:L95), ad absurdum are we going to remove lemma and language too and have nothing left? In most case, a word only have one lemma, one language and one category, so the system seems good to me. Cdlt, VIGNERON (talk) 09:04, 25 May 2018 (UTC)

First try

Hi,

I created some lexemes to see. And I already have so much question for the community on how to structure them, here some quite easy questions on basic structuration :

  • should we add the non-inflected form as the form ? ie. lagad for Lexeme:L114
  • how do deal with plural of dual ? for instance on Lexeme:L114 the word is "lagad"@br (eyes), the plural is "lagadoù"@br (eyes), the dual is "daoulagad"@br (a pair of eyes). But there is a plural of the dual : "daoulagadoù"@br (several pairs of eyes). I've put « Grammatical features dual, plural » but is it correct ? (wouldn't it be understand as « plural and dual » instead of « plural of the dual » ?)
  • how to deal with dialects ? for instance "daoulagad"@br is "deulagad"@br in Gwenedeg (Q2924576) (a dialect of Breton (Q12107)). It tried to enter either Q2924576 or a code like br-56 or br-x-Vann but nothing works...

Cdlt, VIGNERON (talk) 11:59, 23 May 2018 (UTC)

Regarding the first issue: I have added the non-inflected form for some entries. Although I am not entire sure that is the way to go. — Finn Årup Nielsen (fnielsen) (talk) 14:12, 23 May 2018 (UTC)

@Fnielsen, Tpt: the more I test the lexemes pages, the more I convinced it's the right way to go. For instance, there is property who are (will be) specific to forms, so we need all the forms, including the non-inflected one. Also you can link to a specific forms, again, the non-inflected one is needed or the link is impossible. Cdlt, VIGNERON (talk) 09:15, 24 May 2018 (UTC)
Additionally about "non-inflected" form. Some languages use infinitive (Q179230) of the verb as initial form (Q17524785), some first-person singular (Q51929218). So to be able to distinguish them (and which of them is "non-inflected"?) we ought to add all forms. --Infovarius (talk) 12:12, 24 May 2018 (UTC)
I also support to add the non-inflected form. The singular form is chosen to be the lemme (the representation of the lexeme) in English, in French and in many languages but this is not necessarily the case for every language. Tubezlob (🙋) 09:32, 26 May 2018 (UTC)

First 1000 Lexemes

Hi all,

Since there is no search yet and to see where we are already, here is the list of the first 1000 lexemes :

And at least, it allow so see some probable mistake (lemmata used for translations instead of variations), duplicates and a lexeme who probably should be deleted Lexeme:L528 (I'll ask for it).

Cheers, VIGNERON (talk) 16:08, 25 May 2018 (UTC)

Quick comment: nynorsk (L476) and (L900) are technically duplicates, but depending on how it is decided to treat Norwegian here, converting one to Bokmål might be a better solution than merging them or deleting one. So I'd suggest that those two are left for now until things become clearer (pinging @pmt). --Njardarlogar (talk) 16:29, 25 May 2018 (UTC)
Njardarlogar I agree, let's wait. I don't know well Norwegian but clearly it needs some thinking to decide how to deal with it precisely. My amateur feeling is that Bokmål and Nynorsk should have separate lexemes when there is different lemmata and only one lexeme when the lemmata are identical (but as I said, I'm not an expert in Norwegian). For the duplicate, I was thinking about Lexeme:L478 and Lexeme:L467 (already spotted by the creator Danrok. Cdlt, VIGNERON (talk) 16:38, 25 May 2018 (UTC)
Thanks! And thank you all for playing around and experimenting with the Lexemes, as well as for all the useful and interesting discussions on this page <3
If you're interested in following the evolution of the Lexeme creations, here's the link to RC with the right filtering parameters :) (note: the lemmas are not displayed yet, but we have a ticket for that). Lea Lacroix (WMDE) (talk) 16:37, 25 May 2018 (UTC)
Maybe the list could be moved to (a) separate page(s), formatted as table, include a column with the forms and be updated regularly by bot (hourly?). This way one could search the list as long as one can't use internal search on Lexeme namespace.
--- Jura 05:39, 26 May 2018 (UTC)
Note that user:fnielsen wrote a tool to search for lemmas and forms. Lea Lacroix (WMDE) (talk) 07:30, 26 May 2018 (UTC)
Note that the search index in my tool does not yet implemented to update at the moment, so new lexeme items will not be indexed and searchable. Only up to L1229 is indexed. — Finn Årup Nielsen (fnielsen) (talk) 08:42, 26 May 2018 (UTC)
That's helpful, but I think basic way to this on wikidata.org itself is needed.
--- Jura 09:05, 26 May 2018 (UTC)
@Jura1: why not but not entirely sure ; hourly update seems too much (the lexemes doesn't change that much hour by hour, not even day by day, I'll do some update from time to time, formatted as a table is a good idea) and for a subpage, why not but it will be quickly deprecated (the internal search should arrive quite soon enough and in a subpage, the syntax by hand I've done would be replaced by the #Template:L ; latter a SPARQL query would do a better). My idea was just to temporarily give a quick overview and insight, not to build a permanent tool. Cdlt, VIGNERON (talk) 11:37, 26 May 2018 (UTC)

Lexeme features I miss on items

The descriptions built from language and lexical categories are really helpful. Try adding a lexeme as value of a statement, e.g. on Lexeme:L123
--- Jura 08:41, 24 May 2018 (UTC)

AutoDesc does automatic descriptions based on instance of/subclass of, but unfortunately they do not show up in the tooltips. See: Wikidata:Automating descriptions.--Micru (talk) 11:36, 24 May 2018 (UTC)

Type the text of a Form and get the Lexeme ..
--- Jura 07:58, 25 May 2018 (UTC)

@Jura1: Can you please explain the context of this? Type where? Get the Lexeme for what?--Micru (talk) 08:13, 25 May 2018 (UTC)

The handling of language identifiers, especially when not defined.
--- Jura 09:33, 25 May 2018 (UTC)

It looks like that part still needs some work.
--- Jura 09:27, 27 May 2018 (UTC)

Properties

Hi all,

Could you all take a look at Wikidata:Property proposal/Lexemes ? There is a lot of property proposals waiting for more input, please join in.

Cdlt, VIGNERON (talk) 09:24, 25 May 2018 (UTC)

Sure but we are not in a hurry. Now that lexemes can be created, property proposals should include existing examples anyway before further consideration. -- JakobVoss (talk) 07:22, 27 May 2018 (UTC)
Indeed, there is no hurry, we can take our time and we should! But I see a lot of people here or creating lexemes and very few on the proposal. Cdlt, VIGNERON (talk) 09:22, 27 May 2018 (UTC)

I created Wikidata:Lexicographical data/Properties to give an overview of properties used on Lexemes, Forms, and Senses. Unfortunately ListeriaBot does not support properties (?) so we don't have automatic tables yet. -- JakobVoss (talk) 08:01, 27 May 2018 (UTC)

Thank you! Cdlt, VIGNERON (talk) 09:22, 27 May 2018 (UTC)

A Chinese character is Lexeme or Item?

I have proposed properties about Chinese characters (including Japanese kanji and other CJKV characters):

I am trying to decide the domain of these properties, and I am faced with a problem. There are Chinese characters on Wiktionary (ex. wikt:然#Translingual), but can we regard them as Lexemes? Are other alphabetic characters also Lexemes? What is the lexical category for characters? Many characters are used in multiple languages, so if they are Lexemes, do we have to store the same character as different Lexemes for each language? Or should characters (including Chinese characters) be Items? Let me know what you think. Thanks, --Okkn (talk) 11:45, 25 May 2018 (UTC)

 
Links between characters (A come from Α which itself come from 𐤀).
I would say both and anyhow, definitely acceptable as lexeme. It could be useful for producing trees like the one in the image on the right. And there is probable plenty other uses (some characters can have a name/pronunciation, like 1 is "one"@en or Ꝃ is "Ker"@br, or we can add info so people stop confusing homograph letters like A , Α and А). And the Universal POS tags has a tag for "symbol" (see old discussion on Wikidata talk:Lexicographical data/Archive/2018/03). Cdlt, VIGNERON (talk) 11:58, 25 May 2018 (UTC)
PS: an other example more related to Chinese : 东 and 東 are two different characters (with 2 unicode point : U+4E1C and U+6771) but it's one word so I guess we need 3 lexemes : 东 (character, lang: zh-Hans?), 東 (character, lang: zh-Hant or maybe mis-x-Q53764732 ?), and 东/東 (word, meaning "East"@en, lang: zh - here I'm quite sure). Each lexeme could store specific informations. What do you think?

If 东 and 東 represent the same word, but in two different scripts, then I would say this should be a single lexeme? --Denny (talk) 16:52, 25 May 2018 (UTC)

@Denny: yes agreed, the word "East" in Chinese is only one word and should have only one lexeme (with 东 and 東 as lemmatas). But the question here is: should the characters themselves have lexemes on their own? (and by extension, all characters, like Latin letters, Cyrillic letters and so on, for most if not all Unicode characters in fine). Cdlt, VIGNERON (talk) 17:03, 25 May 2018 (UTC)

Hmm, I don't know. And would be good to get more people to chime in on this.

Characters should be in Wikidata somewhere, agreed on that. And the question is, either as Lexemes or as Items. On a bit of consideration, I would think they would fit better as items - 'A' is a letter not of a specific language, but of a specific script, and a number of languages use this. If it were a Lexeme, would it be a different 'A' for English than the 'A' for German? (not that is not the same as the lexeme for the indefinite determiner 'a'). So, I would like to have more discussion this, but for now I would lean towards having letters - and Graphemes in general - as items. But I have to admit that I do not know enough about CJK languages to understand the ramifications of such a choice. --Denny (talk) 17:33, 25 May 2018 (UTC)

@Denny: yes. Lexico-wikidatian please join and give your points of view and references ! Cdlt, VIGNERON (talk) 18:42, 25 May 2018 (UTC)
  Comment "东" (simplified Chinese characters (Q185614), used in zh-Hans) is derived from "東" (traditional Chinese characters (Q178528), used in zh-Hant and ja), and both "东@zh-Hans" and "東@zh-Hant" should be in the same Chinese noun lexeme. We have already have some Chinese characters as items, such as (Q3594986), so I can agree that we store Chinese characters as items. --Okkn (talk) 05:27, 26 May 2018 (UTC)
@Okkn:, ok, then if we are logical and consistent, should we deleted stroke count (P5205) and recreate it for items? Cdlt, VIGNERON (talk) 09:23, 27 May 2018 (UTC)
@VIGNERON: I think there is no need to do so, because in the proposal of this property I had already suggested the domain could be Items before this property was created. Also, if our community will decide to treat CJK characters as Lexemes, we will use this property on Lexemes. The problem is not the matter of this property, but our decision about how to deal with CJK characters on Wikidata. --Okkn (talk) 10:07, 27 May 2018 (UTC)

Grammatical categories

Should we create a property for every constant grammatical category (gender, animacy, countability/uncountability, etc.) or just have only one property, "grammatical category"? I still don't understand how to add information about animacy, I don't agree that it's gender.--Cinemantique (talk) 03:45, 27 May 2018 (UTC)

Cinemantique,
Do you have references that animacy is not a gender? I don't know exactly what is animacy and how it works ; that said, you provided a link to the article animacy which is categorized under Grammatical gender and mentioned several times that animacy is like a gender, at least for some language. And inanimate (Q51927539) is a subclass of gender (ping creator @Tubezlob:).
That said, I don't really see the need for grammatical gender (P5185), it' just a duplicate of the Grammatical features of the forms.
Cdlt, VIGNERON (talk) 08:11, 27 May 2018 (UTC)
@VIGNERON: Constant properties (gender, animacy, countability/uncountability...) belong to lexeme, changeable properties (case, number) belong to forms.--Cinemantique (talk) 10:15, 27 May 2018 (UTC)
@Cinemantique: I'm not a specialist of animacy, but I'm sure that animate and inanimate are genders. What is it otherwise? Tubezlob (🙋) 09:02, 27 May 2018 (UTC)
@Tubezlob: In Russian gender means what set of endings will be used in declension, animacy means what ending will be used in accusative case (like nominative or genitive). So, these are two different categories.--Cinemantique (talk) 10:15, 27 May 2018 (UTC)
It varies in languages. Czech language have three 'genders' (m,f,n) and the masculinum is divided between masculine animate (Q54020116) and masculine inanimate (Q54020181). JAn Dudík (talk) 13:28, 27 May 2018 (UTC)

New filters for Recent changes and Watchlist

With the Lexicographical data, and so many languages that I don't understand, the usefulness of the pages Recent changes and Watchlist have been reduced. For this reason I propose to improve them with new Filters to be able to filter by language or type of data. See Phabricator ticket for more details. --Micru (talk) 09:39, 24 May 2018 (UTC)

Thanks for you suggestion. FYI, a filter showing only Lexemes is already available (click on "Namespaces" and select Lexeme). But it doesn't allow to filter by language yet. Lea Lacroix (WMDE) (talk) 06:38, 28 May 2018 (UTC)

Choice of the language

Hi,

I don't understand why I can't choose the language Old French (Q35222) for a lexeme. The language code fro exists in Wikidata for monolingual datatype but it doesn't appear on the list… Tubezlob (🙋) 09:02, 26 May 2018 (UTC)

Same thing for frm. Cdlt, VIGNERON (talk) 09:11, 26 May 2018 (UTC)

Maybe it's just that nobody added it to the item?
--- Jura 09:52, 26 May 2018 (UTC)

Nope, ISO 639 code (and other codes) has been added year ago. And as far as I know, the code used and stored in the lexeme for the lemma(s) have nothing to do with the items. Cdlt, VIGNERON (talk) 11:13, 26 May 2018 (UTC)
The link between an item and a code needs to be defined somewhere. This might be a list occasionally retrieved from Wikidata.
--- Jura 11:25, 26 May 2018 (UTC)
Yes but that's not exactly the problem here. Only a small fraction of the ISO codes indicated in items are available for monolingual string (is this list available somewhere by the way?) but here, apparently, we have yet an other bug as a code who works in monolingual strings for items (fro or frm at least but probably others too) doesn't work on monolingual strings for lexeme. @Lydia Pintscher (WMDE), Lea Lacroix (WMDE): could you confirm if my assumption and analysis is correct? Cdlt, VIGNERON (talk) 11:30, 26 May 2018 (UTC)
Q20923490 has most things defined, but it can't be used (maybe it actually shouldn't, but that's another problem). It seems to be a built-in list of the drop-down tool.
--- Jura 11:45, 26 May 2018 (UTC)
Hello,
Currently, the suggestions that are made for language of the lemma and representations are based on a separated list, that is not a list of items. It's not the same as the monolingual string list. This list is based on the language codes used by the Wikimedia projects.
We're just starting thinking about how we could move forward with this list, provide more freedom to the users while avoiding mistakes. For more information, see this ticket. Your opinion is welcome to help finding a solution! Lea Lacroix (WMDE) (talk) 08:39, 28 May 2018 (UTC)
I think we already had a ticket for that and discussed it there before. Maybe it's better to have this and future discussions here (it better separates editorial questions from pure implementation of choices made and avoids that our input gets lost in the phabiverse).
--- Jura 08:45, 28 May 2018 (UTC)

Statements

When will you be able to add L-items to Statements? --94.254.240.89 08:43, 27 May 2018 (UTC)

  • Technically, you can add them as statements to other Lexemes, but adding them to other entities (Q-items, P-properties) had to be stopped on Friday. Apparently it currently breaks other uses of items. For many meaningful uses, the "Senses" feature is missing. Not sure what the target date for that is.
    --- Jura 09:25, 27 May 2018 (UTC)
Indeed, if we're talking about adding Lexemes to statements on Items, it's been disabled for now because it caused problems on client (Wikipedia, etc.) See this ticket for more information. So, to answer the question: when this issue will be fixed. I can't give any timeframe right now. Lea Lacroix (WMDE) (talk) 08:19, 28 May 2018 (UTC)

form lookup and lexeme lookup

When trying add forms as values, e.g. at Lexeme:L123#P5189, the tool displays "[grammatical features of form] for [lexeme]: [language] [grammatical features of lexeme]".

When trying to use a form as value for a lexeme, e.g. at Lexeme:L123#P5188, the same input omits [grammatical features of form]. It would be interesting to display this too.
--- Jura 15:40, 27 May 2018 (UTC)

Yes, the value for a property that has for datatype a Lexeme, can only be a Lexeme. So the property suggester will suggest only Lexemes, with the formatting that we have for that (language and lexical category). A Lexeme doesn't own grammatical features. Lea Lacroix (WMDE) (talk) 08:20, 28 May 2018 (UTC)

Grammatical gender in Q-items?

Is it OK?--Cinemantique (talk) 09:04, 28 May 2018 (UTC)

I've put in the languages. You're telling me, please.
Un saludo. --Romulanus (talk) 09:12, 28 May 2018 (UTC)
@Romulanus: But it's for L-items. We can create a lexeme Pacidia@Spanish and use grammatical gender there.--Cinemantique (talk) 09:23, 28 May 2018 (UTC)
Oops. I'm undoing everything I've added. Since there is grammatical gender in Latin, I thought we could add it to the Q-items from Latin. I'm sorry.
Un saludo. --Romulanus (talk) 09:31, 28 May 2018 (UTC)
Just in case, I'm removing it from all the Q-items. i.e.. --Romulanus (talk) 10:22, 28 May 2018 (UTC)

Linking from items to lexemes has been disabled

Hello there,

On Friday, we noticed that adding statements with a Lexeme value on items was causing issues on the clients reusing data from these items (e.g. Wikipedia). As a quick fix, we removed these statements and added an abusefilter, so people can't add Lexemes values in items statements for now.

We will work on fixing the issue so you can eventually start adding Lexemes in items' statements again, but for now and an unknown period, it's not possible to link items to lexemes.

If you want to be updated on technical details, you can keep an eye on the ticket. Thanks for your understanding, Lea Lacroix (WMDE) (talk) 09:33, 28 May 2018 (UTC)

Linking items from lexemes

Eventually we might, I mean, we will have "Senses" and should be able to link items corresponding to a given "Sense" from there. This might be a couple of months away. In the meantime, we might end up with plenty of lexemes.

I think it would be useful to attempt to link QIDs already now, at least for nouns and maybe a few other lexical categories. I think it's especially useful for lexemes where there is currently no alias to an existing item with the same label and language.

I added few links with Sandbox-Item (P369), but maybe we could use item for this sense (P5137) for that.
--- Jura 10:54, 26 May 2018 (UTC)

@Jura1: Sandbox-Item (P369) is a sandbox property, it shouldn't be used except for temporary short test. The senses will not be there for months, so it doesn't seems a good idea to me. item for this sense (P5137) was intended for senses, so it's a bit strange to used it outside senses but why not experiment a little bit (I don't really see the use as it's not on senses but if you see it, you could try).
Finally, you said « at least for nouns » and added it on Lexeme:L20 but that the numeral, not the noun (where the link to 20 (Q40292) would fit better), is it on purpose?
Cdlt, VIGNERON (talk) 08:25, 27 May 2018 (UTC)
I don't understand your question, but I note your disagreement with the suggestion. Thanks for voicing your opinion. For clarity's sake, I think it would be good if you would attempt to add an indicator that your voicing one. "IMO" can do.
--- Jura 09:20, 27 May 2018 (UTC)
It's not really my opinion, I didn't invented the word "sandbox" which means "test". So, my opinion is that testing is testing and should stay testing.
For the question. Let's make it clearer: you said « at least for nouns and maybe a few other lexical categories », then Lexeme:L20 is not a noun. Ergo: do you consider it to be « other lexical categories » where you « think it would be useful to attempt to link QIDs already now »?
Cdlt, VIGNERON (talk) 10:01, 27 May 2018 (UTC)
Hello @Jura1:,
I would strongly recommend to not use properties that are made for Senses, directly in Lexemes. This is not how the data model is made.
We estimate the maximum timeframe for enabling Senses to 3 months. Lea Lacroix (WMDE) (talk) 08:28, 28 May 2018 (UTC)
@Jura1: it's not a whim that we shouldn't link Lexemes to items. You've linked Lexeme:L20 to 20 (Q40292), but it can also have a meaning United States twenty-dollar bill (Q7893199) and probably others. That's why it is a Lexeme and not an Item. --Infovarius (talk) 14:22, 28 May 2018 (UTC)
  • I'm curious what you think of the initial question. Obviously, it's something that needs to be done. Apparently the testing phase without "senses" should only last a couple of weeks.
    --- Jura 07:44, 29 May 2018 (UTC)

English verbs and identical nouns

What's the suggested way of linking them together? I take it that they should be on different items even when spelled identically in one Form.
--- Jura 09:43, 27 May 2018 (UTC)

Not sure but couldn't derived from lexeme (P5191) be used for that? Do you have an example? Cdlt, VIGNERON (talk) 10:09, 27 May 2018 (UTC)
What about wikt:face#English and wikt:present#Etymology_2? --Okkn (talk) 12:29, 27 May 2018 (UTC)
I suppose the problem is to determine if the noun is derived from the verb or the other way round. Perhaps it is better with (a) dedicated property/ies? — Finn Årup Nielsen (fnielsen) (talk) 15:59, 27 May 2018 (UTC)

Not only verbs and nouns. English is adjective, proper noun, noun and verb. This also happens in other languagues. With the same etymology and the same primary meaning, is not this the same lexeme? --Vriullop (talk) 10:17, 29 May 2018 (UTC)

Deletion proposed for prematurely created Senses properties

Wanted to give a heads up that P5190 (P5190) and item for this sense (P5137) are being proposed for deletion at Project:Properties for deletion#P5190 and Project:Properties for deletion#P5137 respectively for being created prematurely. – Pizza1016 (talk | contribs) 22:39, 28 May 2018 (UTC)

Thanks for the heads-up, I have given my view accordingly.--Micru (talk) 07:34, 29 May 2018 (UTC)

How to express scripts?

ⱂⱔⱅⱐ/пѧть (L1031) is the old church slavonic word for five. It is currently given in two scripts - cyrillic and glagolitic. I expected the first script to enter the language code cu-cyrl and for the second cu-glag into the Spelling variant field - but neither of them are accepted. Am I doing it wrong? (For now I kept one as hr, although it is wrong, because if I have two cu it seems to overwrite one, I can't have both simply as cu) --Denny (talk) 05:11, 25 May 2018 (UTC)

Based on what I have seen in Lexeme:L1, I think the script has to be identified with the Wikidata identifier: cu-x-Q8209 and cu-x-Q145625. -- IvanP (talk) 07:24, 25 May 2018 (UTC)
Lexeme:L1 seems to be wrong, at least in my definition. Under "ama" appears mis-x-Q36790 (Sumerian (Q36790)), and it should be at least mis-x-Q8229 (Latin script (Q8229)).--Micru (talk) 07:49, 25 May 2018 (UTC)
For Sumerian, there is the language code sux, so I guess sux-x-Q8229 (Sumerian, Latin script) and sux-x-Q401 (Sumerian, cuneiform). What does Léa think? -- IvanP (talk) 08:23, 25 May 2018 (UTC)
Hello,
The plan is to have language codes that are augmented with the item of your choice, to describe precise dialects or versions of an orthography for example, like sux-x-Q8229. Unfortunately, this feature is not available yet: for now, it is only possible to add the x-item part to the mis code. That's why we entered mis-x-Q8229 in L1 for now. In the future, it will be sux-x-Q8229.
For L1031, in the future it will be cu-x-Q8209 and cu-x-Q145625, but for now if you want you can add misc-x-Q8209 and misc-x-Q145625, which is not very precise but it's the solution we have for now.
I hope that answers your question :) Lea Lacroix (WMDE) (talk) 08:59, 25 May 2018 (UTC)
Hi Léa, would it be possible to link each part of the code to its corresponding Wikidata item? For instance cu-x-Q8209 would look like this: cu-x-Q8209 (And is the -x- really necessary?). The reason is that as a reader I do not have any means to know what each part of the code means, but with links+tooltips I would be able to understand better the codes. As an editor some sort of autocomplete to create the code would be useful too.--Micru (talk) 09:32, 25 May 2018 (UTC)
The interface will definitely be improved over the next months. We know that this raw code is not optimal. Ideally, on the long run, it should not even be available for the user, they could choose an language and an optional item in a list without having to type the Q-id. In the meantime, I'll see if it's possible to add links. Lea Lacroix (WMDE) (talk) 09:58, 25 May 2018 (UTC)
If cu-Cyrl and cu-Glag are really needed, you can request new language codes on Phabricator. (see Help:Monolingual_text_languages#Getting_a_language_code_added) --Okkn (talk) 11:05, 25 May 2018 (UTC)
Hello all,
I'd like to correct my previous statement: "Unfortunately, this feature is not available yet: for now, it is only possible to add the x-item part to the mis code". This is not right. It is already possible to add augmented language codes such as de-x-Q980. The problem with L1 was that Sumerian has no language code listed in Wikidata, so we had to find a workaround. For all the other languages that we have a language code for (the ones you can use for monolingual strings as well), you can build augmented language codes in the spelling variant field.
Sorry for the confusion :) Lea Lacroix (WMDE) (talk) 13:54, 25 May 2018 (UTC)
Léa, thanks for the clarification. One more question, is the -x- really needed? Cannot it be just cu-Q8209 and add the -x- when exporting if it is needed to comply with some standard? --Micru (talk) 14:00, 25 May 2018 (UTC)
The x is there indeed to comply with a standard. We can think about how to hide it in the interface. Lea Lacroix (WMDE) (talk) 14:28, 25 May 2018 (UTC)
@Micru: the -x- come from the IETF language tag (Q1059900), it means we use a private code as no code exists yet. Which is a bit strange here as the code exists cu-Cyrl and cu-Glag. But they are not yet implemented in the Wikidata system, should we ask for all ISO 15924 subtag to be added? or should ask one by one? (as not all combination make sens but in the other hand some exotic and unexpected combination can be need for Lexemes, like the inscription on Bornholm amulet (Q25056639) which is in la-Runr !!) Cdlt, VIGNERON (talk) 16:48, 25 May 2018 (UTC)
@VIGNERON: Ideally we should be able to combine any language code with any script code. For the example of L1 we would need sux-latn and sux-xsux, so I wonder if the software can handle having language codes and script codes, without creating a unified language code+script code. Your suggestion of adding ISO 15924 script codes seems legitimate as long as the software can handle it. Do we have items for all the scripts? Maybe that could be the first step.--Micru (talk) 18:07, 25 May 2018 (UTC)

Agree with Vigneron. It feels weird not to be able to just use cu-Cyrl and cu-Glag, which I expected to be able to use. --Denny (talk) 16:54, 25 May 2018 (UTC)

@Denny, VIGNERON, Lea Lacroix (WMDE): Based on this conversation I have opened this ticked.--Micru (talk) 08:48, 30 May 2018 (UTC)

Homonymous forms

Only singular (Q110786) is listed as a grammatical feature of L11-F1 (visualize) but one would use visualizes for the third-person singular. How to handle this, should we have five Forms with the label visualize (first-person singular, second-person singular, first-person plural, second-person plural, third-personal plural)? Three (first-person singular, second-person singular, plural)?

And again, can forms in a compound tense also be included? If German perfect forms should not be recorded in Wikidata, a property auxiliary verb may be created (Betterknower wanted to suggest it) to indicate which verb is used for perfect forms of a verb (haben or sein). -- IvanP (talk) 22:05, 29 May 2018 (UTC)

@IvanP: isn't this (more or less) the same question as #English verbs? (we should avoid to scatter discussions in different places) Cdlt, VIGNERON (talk) 07:18, 30 May 2018 (UTC)

Formats for grammatical features input field

The usual input fields for items allow pasting QIDs and urls of entities. It would be good if that worked here too.
--- Jura 11:16, 27 May 2018 (UTC)

Yes, it is working, both pasting QIDs and pasting URL, both for grammatical features and lexical category. Once the ID or URL is pasted in the field, type the Enter key and it will be converted. Lea Lacroix (WMDE) (talk) 08:24, 28 May 2018 (UTC)

Encoding issues should now be fixed

Hello,

Many of you noticed some encoding issues on the labels of items displayed on Lexemes, in several languages (French, Hungarian, Korean, etc.)

This issue should be fixed in a few hours. If you encounter any similar problem while watching or editing Lexemes, please let us know on this ticket.

Cheers, Lea Lacroix (WMDE) (talk) 16:48, 30 May 2018 (UTC)

Update: it's now fixed. If you still see a problem, try to purge the page. If it's still there, let us know in the ticket :) Lea Lacroix (WMDE) (talk) 13:31, 31 May 2018 (UTC)

Serbo+-Croatian

What will we do with Serbocroatian language(s)? Depending on point of view there can be from 1 (sh) to 3 (sr+hr+sh) languages. How many will we use? E.g. Lexeme:L2081. There also a problem with codes for script variants. --Infovarius (talk) 13:49, 31 May 2018 (UTC)

You can have a look here. Pamputt (talk) 14:56, 31 May 2018 (UTC)
Return to the project page "Lexicographical data/Archive/2018/05".