Wikidata talk:Lexicographical data/Archive/2016/11

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Wikidata: only for english wiktionary -- or -- do you want to destroy other language versions ?

There is hardly any information about your plans in other language versions (using the respective language) -for example: I coudn't find any information in german ...
So Wiktionarians of these missing other language versions have no chance to react, no chance to tell questions and no chance to illustrate potential problems. Most of them do not agree copying their hard work from cc-by-sa Wiktionary to cc0 Wikidata! If you don't care, you are going to loose most Wiktionarians .... If you don't care destroying Wiktionarys of other language versions, carry on ... --Agruwie  talk   02:27, 28 September 2016 (UTC)

@Agruwie: I completely agree with your points. About German version, most of pages here are translatable. They just wait for someone that could translate in German. Thus, if you speak German, maybe you could translate some of theses pages (it has already been done for French). Pamputt (talk) 22:54, 1 October 2016 (UTC)
As we also had the same feeling, we opened a local discussion page on this topic in French Wiktionary and after a while we tried to summarize our ideas to post them here. Maybe not the best way to process but it helped us at some point. Noé (talk) 09:06, 9 November 2016 (UTC)

Impact on contributors

@Denny: do you have any number about this Wikidata community? I mean, does the number of Wikipédia contributors (let us say English Wikipedia or German one if you know it better) increase or decrease since Wikidata has been launched, 4 years ago? How is evolving the number of Wikidata contributors? If a decrease has been observed, does it decrease more quickly than previously (before Wikidata launch)? The hidden question is to know whether Wikidata "stole" contributors from Wikipedias. Having such stats may be interesting. Pamputt (talk) 21:07, 9 November 2016 (UTC)
We did some analysis back for Wikimania '13 or '14 IIRC (Lydia might remember better). Back then the Wikidata community was about half the size it is today (it has been growing pretty consistently, and still continue to grow - see [1] and [2]. Back then we saw that about half of Wikidata contributors where new, i.e. have not been Wikimedians before that. I would expect that the proportion of new editors, who have not been Wikimedians before, has since then further increased (if there was an effect of contributors moving from Wikipedia to Wikidata, this would be expected in the early days, not later on).
We found no evidence of cannibalization of Wikipedia communities by Wikidata. There was a very measurable effect of decreasing number of edits and size of certain Wikipedias, in particular for the smaller projects - but this was mostly due to the end of the the many edits that were needed to maintain the interwiki system. This can be actually very nicely seen in this chart: [3]. There is this explosion of edits in one month of 2013, which was the removal of many of the interwiki links, and after that the edit rate has decreased visibly compared to before.
I hope that answers your questions. In general it shows that Wikidata was capable of creating its own community, and was not a threat to the existing Wikipedia communities. In fact, it relieved them of certain tasks (in paritcular interwiki link maintenance) and thus allowed the existing contributors in the Wikipedias to be more effective and pursue higher-value goals. I very much expect a similar development for the Wiktionary projects. --Denny (talk) 01:38, 10 November 2016 (UTC)
Thanks a lot Denny. It is exactly what I wanted to know. As Noé, I will translate this analysis in French, if you agree, because some people ask this question on the French Wiktionary. Pamputt (talk) 06:49, 10 November 2016 (UTC)
Thank you Pamputt. No need to ask me to translate - feel free to translate anything I say. --Denny (talk) 17:11, 10 November 2016 (UTC)

Improvement for connectivity

Hi @Lea Lacroix (WMDE), Denny, Lydia Pintscher (WMDE):

Today, in an effort to help you to clarify how cool Wikidata will be for Wiktionary, I am interested in the advantage for connectivity and have three questions   Noé (talk) 09:46, 10 November 2016 (UTC)

  1. Now, there is a lot of links from Wiktionary definitions to Wikipedia articles, but managed by hand. There are mainly in a See also section. Is it possible Wikidata help to maintain those? Through which process? By a link between a L-Item and a Q-Item or with a specific link choose locally?
  2. There is also connections with Wikisource in references for illustrative citations, also added by hand. Should it be possible to auto-detect if a reference is present on Wikisource and add a link in Wiktionary after it was published (by scanning all the database?)?
  3. Will it be possible in Wikisource to have a list of Wiktionary's pages using a specific page, similarly as Commons offering a list of external pages using a picture?
Thanks for your questions. I'm looking for the precise answers and I'll reply as soon as possible :) Lea Lacroix (WMDE) (talk) 10:15, 10 November 2016 (UTC)
I'd say that Wikidata can't help with these. Especially #1. Imagine that a Q-item has many sitelinks in many languages. Each language can have its own spelling of the title (not mentioning synonyms...) and thus its own link to Wiktionary. How all these different L-items would be linked in 1 Q-item? I can imagine this only for disambiguation pages. --Infovarius (talk) 15:49, 10 November 2016 (UTC)
Hello @Noé:, after discussing with my colleagues, here are the answers to your questions.
  1. Yes. In the Lexemes (string of characters), we will have one or several Senses (meaning of the word). In these Senses, we will be able to add a link to a Q-id, that means linking a meaning to a concept. Then, in this item, we already have the external links to Wikipedia articles. With this structure, you will be able to automatically display the Wikipedia articles links related to a definition if you want to.
  2. Short version of the answer: no, it's probably better to do this with a bot. We will provide the structure of the data to allow users to create automatic scripts like this, but we will not provide the feature ourselves for the moment.
  3. It's not possible for now. In our data model, links to Wikisource will be added as a reference in a statement. Currently, we don't have a proper data model for these references, so it's not possible to track them as you suggest. Maybe in the future this will be possible.
If that was unclear or some questions remains, please ask! Lea Lacroix (WMDE) (talk) 15:15, 11 November 2016 (UTC)

Hi @Lea Lacroix (WMDE):. Thank you for your answers.

  1. Good. I hope there is documentation to enhance our template with this somewhere. If not, I beg you to add this at the end of your to-do list. Well, if it works nicely, it is a plus for an integration of Wiktionaries into Wikidata.
  2. Ok. I think it can be a joint development at some point in the future.
  3. I am not sure if you haven't flip flop your answers. I felt 3 was more doable by bot and 2 with a different data modeling. Well, making Wikisource able to track external reuse of its data is not so much linked with Wikidata, maybe it is only MediaWiki development. So, I can just ask for it in the on-going survey héhéhé Noé (talk) 15:08, 16 November 2016 (UTC)
Yes, we are aware that's missing and we're working on providing a better documentation about the data model we plan to deploy. When this will be deployed, there will be also documentation.
Feel free to add requests in the survey, that's what it's made for :) Lea Lacroix (WMDE) (talk) 16:10, 16 November 2016 (UTC)

Hello all,

Interwiki links are the first step of our lexicographical data project. This is actually not related directly to Wikidata: we use two MediaWiki extensions, Cognate and InterwikiSorting, that will allow automatic interwiki links from a Wiktionary to another, based on identical page names.

A testing environment has been created for the extension on Wikimedia Labs made of the following sites:

You can see several examples :

  • EN, FR, DE - Pages showing a regular title that exactly matches other titles
  • EN, DE - Pages with a slightly different character but matching anyway (here, ... and …)
Good job! And interwiki links for Categories, Appendices and other namespaces can have ordinary Q-linking? --Infovarius (talk) 15:50, 10 November 2016 (UTC)
Yes, exactly. --Denny (talk) 18:35, 10 November 2016 (UTC)

How to try in the testing environment:

  • go to one of the URLs above
  • create a new page (change "Main Page" in the URL by something else, then click "create this page", add test content in the page and save)
  • go to another of the test wikis, type the same string in the URL to create an identical page
  • automatic links are created on the left column, you can switch from a language to another

Please feel free to make some tests and give us some feedbacks! We're working on this right now, and if you have special needs or questions, this is the best moment :) Lea Lacroix (WMDE) (talk) 10:15, 10 November 2016 (UTC)

Orthography

Orthography - including majuscule/miniscule, diacritics - is relevant. wikt:en:Appendix:Variations of "tom" is a simple example; is finding these variants useful if it fails to find wikt:en:Thom? - Amgine (talk) 16:59, 10 November 2016 (UTC)
Can you point to how current Wiktionary pages do it that would be done wrong by the extension? wikt:en:tom seems to link to wikt:fr:tom, and wikt:en:Tom to wikt:fr:Tom and vice versa - that would be the behavior of the extension now. None of them link directly to any variations on other language editions. I am probably missing the exact use case you have in mind. Thank you! --Denny (talk) 18:35, 10 November 2016 (UTC)
On en.WT there is a template in section 0 - {{also|Tom|TOM|tóm|tõm|tǫ̂m|t.o.m.|tom'|Appendix:Variations of "tom"}} This is regularly maintained via bots as well as manually.
The exact use case is interwikiing each of these variants and any others for each language wherever they exist. Assuming just the 7 currently listed as variants on en.WT, starting from wikt:en:tom, that is currently 78 links on each page, 40 total wiki languages. I believe the maximum number of alternative unicode combinations for 'tom' is upwards of 3^76 (tṫẗťƭʈƫṭțţṱṯŧⱦȶoòóôõōŏȯöỏőǒȍȏơǫọɵøồốỗổȱȫȭṍṏṑṓờớỡởợǭộǿɔœƍⱺƣmḿṁṃɱ), so the number of links on en.WT and across all languages will likely be higher. Is this haystack of links useful? Should wikt:en:Tom (given name, nickname, large bell) link to wikt:vi:tõm (onomatopoeia of splash)? or should, instead, Tom only interwiki to exactly the same spelling and case on other language wiktionaries? - Amgine (talk) 19:21, 10 November 2016 (UTC)
Hi Amgine. I think what you talk about is not interwiki. To me, interwiki only applies for exact similar written form (except apostrophe and capital letter). So, to answer to your last question wikt:en:Tom must link only to wikt:vi:Tom (and other languages) and not link to wikt:vi:tõm. Pamputt (talk) 09:32, 11 November 2016 (UTC)
Hi Pamputt. The second set of examples linked Foo%E2%80%A6 and Foo..., which as you can see use slight variant unicode characters. This suggests the algorithm may find all 78 of the links I manually counted, and others as well. - Amgine (talk) 01:45, 12 November 2016 (UTC)
@Amgine: I still do not understand the link with interwiki stuff. If we consider your exemple, let us suppose the extension may able to detect that "tom", "Tom" and "TOM" are the same. How will interwiki look like? To me, I expect "tom" in French only links to "tom" in other languages", not with Tom, TOM or whatever else. What you talk about is not interwiki and not related with "Cognate" extension. If developers do what you want, it will be in another extension and will not be related to interwiki. Pamputt (talk) 08:14, 12 November 2016 (UTC)
@Pamputt:: if you look at http://frwiktionary-cognate.wmflabs.org/index.php/bleu%E2%80%A6, you will see there are two interwiki links to de.WT: one is to bleu…, one is to bleu... These are two different titles, one using the '…' (single character ellipsis) and one using '...' (three dots.) This suggests other combining multibyte unicode characters may result in multiple spellings being linked as interwikis. - Amgine (talk) 06:36, 13 November 2016 (UTC)
@Amgine: In my opinion, it should not happen since one project choose to other either "..." (three dots) or '…' (single character). As far as I know, none Wiktionary project use both representation. Usually they choose one and use redirect for the others (in order interwikis work). So in this case, if the French Wiktionary choose to use three dots, and the German Wiktionary use single character, it will match automatically, nothing more.Pamputt (talk) 10:01, 13 November 2016 (UTC)
@Lea Lacroix (WMDE): could the dev plan to add a feature allowing to check such cases? Pamputt (talk) 10:02, 13 November 2016 (UTC)
I'm sorry, I'm not sure I understood what you would like to check. Can you explain please?
Also, in the case of bleu… there are indeed two links to the de:wiki. What would you expect instead? Thanks Lea Lacroix (WMDE) (talk) 10:15, 15 November 2016 (UTC)
@Lea Lacroix (WMDE): That depends on the approach of the Wiktionary. On en.WT the community prefers each unique spelling should interwiki only to each exactly the same spelling, so only one of those should be an interwiki. As EncycloPetey points out below, the ellipsis is not the only combining unicode character, there are others which likely have the same outcome. And different Wiktionaries have chosen different standards as regards interwikis with this specific issue, so for them they should both be present. - Amgine (talk) 16:34, 15 November 2016 (UTC)
@Lea Lacroix (WMDE): what I would like to check is whether there are several links to a same Wiktionary on a given page. In the example of bleu%E2%80%A6 , I expect there exist a page where this page would be present. Basically, at least on French Wiktionary, we do not have any page with two (or more) interwiki links to the same Wiktionary. So if such case exists, I would like to know them in order to fix them (considering the convention we use on our Wiktionary). Pamputt (talk) 18:06, 15 November 2016 (UTC)
Thanks for your answers. So, if I understand correctly, in the example of bleu%E2%80%A6, this should ideally not happen, but will happen sometimes. This is not the extension behaving the wrong way. You need a tool to check these "multiple links" cases and fix them. Is that correct?
What would you like this feature to look like? A general list of all the pages that contain more than one link to a Wiktionary? A special page by Wiktionary page? A gadget to highlight the multiple links, by coloring them in red for example?
To understand better the issue: once you notice that on your main Wiktionary, the French one, on the page bleu%E2%80%A6 there are two links directing to two slightly different pages on German Wiktionary, what will you do then? How would you like to "fix" this? Lea Lacroix (WMDE) (talk) 10:35, 16 November 2016 (UTC)
@Lea Lacroix (WMDE): sorry for the delay. Here is what I would like. Basically, I think that a single page (a global Special page for example) that contains all the page with two or more interwiki from a given Wiktionary. Once we know these pages, we can fix them manually on the other Wiktionary. If we keep the example of bleu..., it is pobable that the German Wiktionary would like to redirect one page to the other. Pamputt (talk) 07:01, 21 November 2016 (UTC)
@Pamputt: No problem. I tried to sumarize what we talked about in this ticket. Please tell me if I missed something! Lea Lacroix (WMDE) (talk) 10:13, 22 November 2016 (UTC)
@Lea Lacroix (WMDE): looks fine to me. Thank you. Pamputt (talk) 14:17, 22 November 2016 (UTC)

Makaf

I'm not sure if you should try to solve this but different Wiktionary sometimes standardise differently. For example the English Wiktionary encodes makafs, a kind of Hebrew hyphen, as <־> (U+05BE). The proper way in my opinion. While the Hebrew Wiktionary uses <-> (U+002D), I think for historical reasons mainly. Similarly for gereshes and gershayims. The solution for this has been to create redirects on both sides, though this is currently done manually so some are missing. Another example is the Hungarian Wiktionary, if I recall correctly, standardised on forms with nikud (i.e. vowel points) while others did not. So entries like "על־יד" vs "על-יד" and "בע״מ" vs "בע"מ" and "חתול" vs "חָתוּל" are the same. You can find many similar examples for different languages. Enosh (talk) 15:33, 12 November 2016 (UTC)

There is a similar problem for certain French and Slovak characters, such as Ľ / ľ and ď. The coding across Wiktionaries is not yet consistent. In some projects these are coded as a single character, but in others as two characters. --EncycloPetey (talk) 16:33, 12 November 2016 (UTC)
@Enoshd: interesting. This is typically the cases that have to be tested. I think we just need to aware devs about that and they could take it into account. This is more or less already the case with the apostrophes. Do you think you could summarise all the couples that exist? Pamputt (talk) 22:17, 12 November 2016 (UTC)
No, I'm probably only aware of a fraction of the cases like this. There are similar issues in Ancient Greek, where the coronis that ends an abbreviated form may be rendered either with the Unicode character for the coronis, or with an apostrophe. The independent coronis was added to the Greek Unicode set rather late, so many people aren't aware the character even exists. --EncycloPetey (talk) 23:25, 12 November 2016 (UTC)
Hmmm, it is a pity. However, it would be really nice to know which couple are used on your Wiktionary? Maybe you could start a discussion on the beer parlour or elsewhere to know them. Pamputt (talk) 23:44, 12 November 2016 (UTC)
Knowing for just one Wiktionary won't help. We have to know where the differences are, and the differences exist between Wiktionaries. The core problem is sixfold: (1) Each Wiktionary makes independent decisions about coding special characters and diacritics. (2) Those decisions are usually made by one or two editors who work in the relevant language. (3) These choices are seldom documented. (4) Editors come and go, so there may not even be anyone at a particular project who currently works in a particular language. (5) These choices may be made without regard for, or knowledge of, what other Wiktionaries are doing for that same issue. (6) The editors themselves may be unaware of the issue or even the possibility of a different approach.
So, even if I were aware of all the possible choices and solutions at the English Wiktionary, that would not help discover all the variations at all the other Wiktionaries. And there may be many more such issues at the English Wiktionary in languages that I did not edit. I think there was a discussion some years ago about a problem with initial letters in Arabic, but it's not a language I read so I didn't follow the details. Also, I haven't been very active on the English Wiktionary for the past several years, and a lot may have changed since then. I just happen to know about the ones that I mentioned because I know some French, some Ancient Greek, and have a passing familiarity with Western Slavic languages. --EncycloPetey (talk) 00:13, 13 November 2016 (UTC)
<nods> This is the same issue I tried to point out above. (privately rages @ EncycloPetey for not being active on en.WT.) - Amgine (talk) 21:28, 13 November 2016 (UTC)
@EncycloPetey, Amgine: Thanks for your answers. As far as I understand, it's not realistic to try to create from scratch a complete list of all the couples of matching characters and all the differences of uses between all the Wiktionaries. Besides, this list would probably evolve in time, regarding the decision made by the communities.
Then, how can we provide a solution that fits to your needs and make the interwiki links as accurate as possible? For example, should we provide a tool/page where you can add/edit the matching characters of your own Wiktionary, to improve the behaviour of the extension. Lea Lacroix (WMDE) (talk) 10:43, 16 November 2016 (UTC)
@Lea Lacroix (WMDE): per-language configuration: something like STRICT ($title === $otherWiktionaryTitle) || TIGHT (one link per foreign wiki) || RELAXED (all interwikis) || PROMISCUOUS (fuzzy matching) with the local community determining which best meets their local needs. - Amgine (talk) 19:20, 16 November 2016 (UTC)
I am afraid that these words with different characters are now not connected together, so this Cognate extension will not make this case better nor worse. So this is feature request, not blocker. JAn Dudík (talk) 21:28, 21 November 2016 (UTC)

Improvement

There is a joke in French saying "you offer a hand they want the whole arm", so I am back for a suggestion to improve this not yet finished extension! We've been chatting with Lyokoi about this extension and it appears it can be even better if the link can have a different color when the other project have a section for the language of the project you are reading. I mean, you are in French Wiktionary entry pain and there is a link to English Wiktionary pain, but is it because the word is also present or because the word is also defined? Better if you can know that without visiting other projects! Another improvement, maybe easier to develop may to indicate in parenthesis the number of language section in the other languages. To keep pain example, I'll be very jealous to discover that English Wiktionary have eleven section and French Wiktionary only four, and maybe I'll become eager to contribute more by copying some (sourced if possible) sections from en to fr! Well, that said, we are not in a hurry. That already tremendously awesome to see dev working on Wiktionary improvement! Noé (talk) 15:45, 14 November 2016 (UTC)

Je n'en attendais pas moins de vous ;) Suggestions are always a good thing, even if we can't promise to achieve all of them.
Thanks for your ideas and the related use cases. I'll let you know if this is technically possible or not. Lea Lacroix (WMDE) (talk) 10:31, 15 November 2016 (UTC)
@Noé: I created two tickets : T150842 and T150841. Feel free to add more details if necessary. After discussing with the devs, I must tell you that this is quite difficult on the technical level. We understand the interest of these features but it will probably not be integrated in the extension for now. Lea Lacroix (WMDE) (talk) 11:17, 16 November 2016 (UTC)
Thank you Léa! I am sure it is tricky to do and I hope it will be of someone interest to challenge it. I think nothing happen if nobody speak about the possibilities, so I am writing and transmitting proposals   Noé (talk) 14:57, 16 November 2016 (UTC)

End of the first testing phase

Hello,

Thanks to all for your bug reports and features ideas. We really appreciate that you took the time to test our very first version of interwiki links and provide us feedbacks. We tracked all of these in Phabricator, and considered all of them. Now, we are moving to a new phase of testing, before the deployment of a first version, on the live Wiktionaries.

The bugs that you reported are now fixed. (phab:T150517, phab:T150514)

The other part of your suggestions were new features to create. We will not be able to work on them for this first version of the extension, because of technical or priorities issues.

I will inform you as soon as we have new testing tools to show you, and when I will know when and how the feature will be deployed on the Wiktionaries. Of course, we are still welcoming all the bug reports and feature ideas that you will send to us - you can do it here and I'll take care of the Phabricator tickets.

Thanks again for helping us to make this feature fit to your needs. Feel free to ask me any question :) Lea Lacroix (WMDE) (talk) 17:12, 23 November 2016 (UTC)

Do we want creation of lexical content directly in Wikidata?

Hi all,

I am quite convinced by the will to store data in a structured database but we discussed several problems induced by the creation of new content in Wikidata, and the potential emergence of a fork of Wiktionaries in Wikidata. I asked above how to deal with conflicts between lexical values in Wikidata and the one in Wiktionaries, and it is as puzzling as it is now with the independency of each projects. Well, I am wondering now: Do we want creation of lexical content directly in Wikidata?

In my opinion, no. Nowadays, through manual or automatic processes, adding data in Wikidata is as boring/hard to learn as adding data in Wiktionaries, considering both interface are not adapted for lexical purpose. And I prefer tech to improve already quite efficient Wiktionaries interface rather than creating from scratch a new one for Wikidata and let Wiktionary wikicode editor, VisualEditor and robust bots perish. I do not want Wiktionaries to become presentation layers. I prefer its stay the only entrance point for new data.

In this perspective, informed data created in Wiktionary can be stored in Wikidata and the storage can be improved by tagging or whatever operation people may imagine. If future Wikidata community want to have more data than available by aspiring Wiktionaries data, they just have to open a new tab to Wiktionary to create new content there. This is a way to keep trained collaborative lexicographers to overlook content creation and to avoid the mess of having machine enthusiasts newbies people with an over-simplistic vision of languages playing with lexical data without having any knowledge on lexicography. Plus, as there will be two projects, Wiktionary and Wikidata, contributors will have anyway to move from one to the other, at least as long as we haven't a proper editor tool, and I feel it is better to ask to the new ones to come to the oldest project rather than asking to the trained contributors to move to a new platform. Finally, I see it as the best way to keep the door open for adjustment and diversity.

Ok, I exposed my point of view. Please, point out the lack in my argument. And, what are the argument pro creation of new lexical data in Wikidata? Noé (talk) 13:03, 24 November 2016 (UTC)

Hello Noé,
Thanks for sharing your thoughts. I understand your point of view, and I'm not going to try to take down your arguments, only to give you my own point of view.
I have the feeling that we're all trying to defend our home-project, without paying attention to the big picture. We are not playing Wikidata vs Wiktionary. We are all part of the Wikimedia projects, a galaxy with the goal to provide free knowledge, with several tools and several communities who work on their expertise domain the best they can.
On Wikidata, we are good at storing and structure data. We have the infrastructure for that, we have perfectionist editors that love to maintain data until it's accurate and complete, to create tools to make it easier and to build crazy queries to ask every unexpected questions about the world.
On the Wiktionaries, we work hard on providing, not only every definitions of every words in every languages, but also provide some context, explanations, quotes, examples. We have passionate people who provide their expertise on linguistics, love to learn new things about languages and share it to the world.
We are not working against each other, and we don't want to. We don't want to steal the content and the communities from a project to another. We want to work together with the expertise we have on both sides : storing data / providing context. If we see our work as a part of the global Wikimedia world, then, it's okay to store data on Wikidata and display it, enriched, on Wiktionaries. And no one will feel dispossessed that way. Providing that we build together some tools to make every editor happy and confident, but I'll talk about that later.
Then, I really think we should not argue about who would "have custody" of the content, and which editors would have to change their habits to edit it.
I know that habits are strong, that each person has his.her favorite Wikimedia project where he.she feels comfortable and useful. I know that Wikidata editors are perfectly happy with their interface, that they find modern and usable, while other wikimedians discovering the project are completely lost because there is no more "edit" button :o I know that Wiktionaries editors feel perfectly comfortable with editing the code, using templates they improved for years, and feel that the best way to add content is to directly access a raw text file. Then I know that both of the communities feel unsure with the idea of opening a new tab and editing the other project. But this is exactly where we need to work together to build a tool that will fit as good as possible with all the people, all the working processes, all the previous and new habits.
That's we (the Wikidata development team) are going to work on, these next years. Trying to understand Wiktionaries, and not only the data structure, but also how the editors work, the differences and complementaries between the language versions, the user habits, tools and gadgets, what they expect from this new structure and the future visual editor on Wiktionary. That will take some time, because we want to to this good. That's also why we are already working on creating the lexicographical data structure on Wikidata, and why, as soon as possible, we will propose you to experiment it on Wiktionary. To understand what you (both Wiktionaries and Wikidata communities) need to make the lexicographical knowledge even better, and what we can provide you to do so.
Like I said, this will take time. I can't tell you right know if and when a visual editor dedicated to Wiktionary and Wikidata content will be created. Because we need: to understand how to do it, to find time and resources to do it, and time again to do it good. I can tell you that we are working on the same kind of tool for Wikipedia. We plan to improve the existing visual editor to integrate better Wikidata data and allow contributors to edit it easily. This is a long run, and we hope we will provide a first prototype before the end of 2016, and the rest will follow.
In the meantime, we will keep working on knowing better the lexicographical data and people working on it, try new features and make you test it to have feedbacks, and try to reduce the fears that some people can have about the idea of cross-project contributing.
Thanks for reading this long text. I'm ready to translate it in French to share it on your page if you want ;) Lea Lacroix (WMDE) (talk) 15:21, 24 November 2016 (UTC)
Merci pour ta réponse :) It would be very gentle to translate your words to French. I try to get the whole picture but it is hard. I really appreciate the work you all are doing. I am very patient, so that's ok if it take years of collaborative work, we have plenty projects for Wiktionary that are not tide with Wikidata. I try not to be in a fighting position but to drive the discussions to help everyone to clarify potential issues. Above, Amgine and I arise the potentiality that Wikidata become a fork of Wiktionary. Since the beginning, we are very concerned by the duplication of data and I feel we have a solution to have a better control on this issue now by controlling where the data are added. To explicit the idea developed above, let's take an English speakers who want to add data about French for some reason. I think there is a chance that this new information is imperfect, because French is deadly irregular, grammatical descriptive traditions are different, the source used by the contributor was not accurate (too old, wrong interpretation, non neutral, etc.) or another reason. If this new information is added in Wikidata, there is almost no chance than French speaking editor checked them. If it is added to French Wiktionary first, there is much more chance for a quick a posteriori check. So, I suggest to not let people create new entries directly in Wikidata. Maybe there is other solution, like improving the watchlist to allow people to follow data creation tagged with a specific language and have notifications in any project they like. Maybe you will prefer this second proposal, but I feel it is hard to develop for plenty reasons: a transwiki watchlist, a recent changes list by language, a creation of a new community discussing in pidgin in Wikidata. And I feel it is less efficient. But, if you have others solutions or arguments, please, you are welcome to take down my argument and provide yours on this question (Lea or someone else, I don't want you to have nightmare or to quit because of my depressive comments). Noé (talk) 09:57, 25 November 2016 (UTC)
Hey,
No problem, as long as we're trying to understand each other and our mutual concerns, these discussions are fine and useful :)
I don't think that Wikidata will become a fork of Wiktionary. Yes, some data will necessarily be duplicated at the beginning, but if we manage successfully to organize data, the info that will be stored in Wikidata and the part that will stay stored in Wiktionary, after some time this will not be a problem.
Thanks for the example, I understand your concern about watching the data related to one language. About that, I have good news: it's possible to do powerful things with the watchlist and recent changes pages. On Wikipedia, it's already possible to display in your Wikipedia watchlist, the changes on Wikidata, related to the Wikipedia articles you follow. (you can test in by activating the features on Wikipedia, via Preferences -> Watchlist -> Show Wikidata edits...) (I write it in English here but it's also available on French Wikipedia). This has been done to meet a need from Wikipedians who worried about something happening on Wikidata, and modifying the content of the articles they follow, without being noticed about. We could imagine the same for Wiktionaries : display on your favorite Wiktionary version, the changes made on Wikidata.
As it's possible to create a lot of filters for the watchlist and the recent changes pages, we could also imagine a gadget filtering the recent changes page, to display only the changes on words (lexemes) that have one or more senses in French. Or only when one of these senses have been modified. Then, editors who want to watch and maintain French data in Wikidata could easily overview what's going on. Would this solution meet the need you expressed?
Can you translate your first message on fr:wkt:Discussion Projet:Coopération/Wikidata then I'll translate mine, and so on? :) Lea Lacroix (WMDE) (talk) 12:20, 25 November 2016 (UTC)
As said, if filter can be easily adapted, it may solve some issues raised, but it still force Wiktionarians to jump to Wikidata often and it may disrupt connections with notes, annexes or other content (contenu enrichi ?) and it had to be checked in Wiktionaries also. So...in my understanding, it is still doubling the revision process.
I haven't seen your message before summarizing my idea in French Wiktionary and in Wikidata:Bistro to ask for others' opinion. Maybe it will be more efficient for us to let you answer there a summarized version of this discussion as well? Noé (talk) 15:19, 25 November 2016 (UTC)
I am sorry, @Noé:, I'll try to argue with you. So do you prefer to enter the same information (pronunciation, declension, translations and so on) for the same word in ~180 Wiktionaries manually? Really, ~180 times? Instead of entering this information once, at Wikidata. --Infovarius (talk) 12:47, 25 November 2016 (UTC)
Hi, @Infovarius:! No, I prefer to enter the information once, French in French Wiktionary, English in English Wiktionary, etc. Then other languages data displayed in each project can be stock in Wikidata or locally, depending on each community. There is more languages than project, so languages without a dedicated projects and protolanguages can be created in Wikidata, it will create fewer conflict than to create entries in language with dedicated project already strong. Noé (talk) 13:22, 25 November 2016 (UTC)
I don't understand how you think that would work. The English Wiktionary does not only contain English words, the French Wiktionary does not only contain French words, etc. For example, the pronunciation of the German word "Katze" has already been entered on the German Wiktionary (and many others) but how does that help French speakers? When they go to wikt:fr:Katze, the pronunciation is still missing. There isn't a way to automatically include the pronunciation from the German Wiktionary page in the French Wiktionary page. However, if the pronunciation were stored in Wikidata, it would be possible (if the community wanted to!) to fetch the pronunciation from Wikidata and/or to add a maintenance category so that someone could check it and then add it manually. - Nikki (talk) 19:12, 25 November 2016 (UTC)
@Nikki: Yes, I agree with this workflow. I understood Wikidata will permit to display data into all Wiktionary dynamically, and I am not oppose to that. I am oppose to let anyone change the data stored in Wikidata directly in Wikidata for language that have a dedicated project. So to get back to your example, let say a French contributor note a pronunciation for Katze during a journey in Switzerland, [ˈkɒ.tzø]. She want to add this wrong pronunciation in fr.wikt, where there is no pronunciation yet. I imagine three scenario:
  1. now: she added it in French Wiktionary and German editors will never know it nor revise it. It will probably stay for a long, because there is no German editor in French Wiktionary (depending of the month, but let's keep this idea, it is true for other languages).
  2. your proposal: she add it to Wikidata (through a future editor tool or in a new page in Wikidata with a risk of losing her in a different project) and it is automatically displayed in all Wiktionaries with a Katze entry but without any checking by German contributors. How it is displayed? With a note describing it? How it is connected with the actual pronunciation? How many conflicts may appears with this indirect way of contributing?
  3. my proposal: she is gently sent to de.wikt when she try to add the information. She can read how data are made, how pronunciations are encoded (local rules for German), and add the data in the specific place where it will not harm existent content. Then German admin will check it and can send her feedback saying they don't do syllable separations in German Wiktionary (it is customary in French Wiktionary) and they can help her to improve the way she contribute, as usual. Well, I do not exclude a possibility that she became afraid by a lot of German writing everywhere around, because of an overestimation of her German skill at first. Then she may decide not to contribute at all. Well, I think the ratio benefice/time is still very better with this option. For the project because all the non-data content is optimally connected with the data, for editors because they edit in a warmer environment with trained contributors to help them, for admins that can have a look at the evolution of the data.
It is quite late, so I hope I clarify my idea with this simple example. Noé (talk) 23:51, 25 November 2016 (UTC)
So, to try to make technical analogies, you seem in favor of some of "flagged data" like there is "flagged revisions" in some wikipedia for articles (changes needs to be approved by someone to be published directly in main), with a subset of people allowed to flag the data ? Let's imagine - just for the sake of argument, I'm not even suggesting to implement something like that - that we split the linguistic datas in wikidata per language, like "french related data" "english related data" and so on, and that there is a "flag" right for each group. For each lang a group of user which are allowed by community to flag such datas before they can be used on Wiktionaries. Would such a process make you happy ? author  TomT0m / talk page 11:16, 26 November 2016 (UTC)
First question: no. Last question: yes. Having specific Recent changes pages by languages seems a good idea. It was not my initial idea and I never push for a "approved before to be publicly displayed". I am interested by the website who is the entrance point for data that will affect Wiktionary and I think having two concurrent place where people can add similar kind of information is problematic. Noé (talk) 15:32, 26 November 2016 (UTC)
I don't understand your answer, my two questions goes together and you can't really answer no and yes in that order, answering "no" to the first question makes the second pointless. I don't really think that could work like that, as french related data can - atm - be entered in the english wiktionary and if fr wikt is the entry point right now, you don't really seem to be interested to control french datas in english wiktionary. It seems to me the requirement you try to accomplish would require an english editor to go through fr wikt to enter datas about french. Second : there is actually other entry points in Wikidata like bot datas from external datasets. What's the point of making them respect the formats of fr wiktionary when where we will have a Wikidata data structure for lexicographical datas they will be able to easily produce datas in that format and those datas will be consumed by every wiktionary ? Side note I don't think that it will be difficult to migrate the tool that import datas for them to migrate datas directly n to frwiki. This migration is likely to simplify all this tool as they won't have to care about wikitext formating. It seems to me quite two contrary goals to be very open and to want to control everything. Stuffs does not really adds up. Especially if the rate of the datas import increase in the future for whatever reason and the size of the community remains stable. author  TomT0m / talk page 16:22, 26 November 2016 (UTC)
Option 3 has the same problem I already mentioned. Assuming your example editor is happy to add the pronunciation to the German Wiktionary instead of the French Wiktionary... how does it benefit users of the French Wiktionary? They still have no information about the pronunciation.
Option 2: It is completely up to the local Wiktionaries how they decide to integrate information from Wikidata. For example:
One local Wiktionary could decide to only automatically display the pronunciation if it has a reference.
Another local Wiktionary could decide to display the pronunciation even if it doesn't have a reference but add "(needs verifying)" after it.
A third local Wiktionary which is sceptical about Wikidata could decide to continue to always store pronunciations locally but use Wikidata to find missing data and to compare existing data. They would never display anything from Wikidata but instead automatically add the maintenance categories "Category:Words with pronunciation data available in Wikidata" and "Category:Words with pronunciations different from Wikidata" to their pages so that editors can add/fix data manually.
Because Wikidata is structured data, it is easy to query. For example, German speakers could set up queries to list German words where the pronunciation is missing a reference.
- Nikki (talk) 11:26, 26 November 2016 (UTC)
I don't get your critic about Option 3. Local Wiktionaries have the same options as the one you wrote, and French Wiktionary can display the information with limitations or not, similarly as with solution 2. I am not talking about including data or not, I imagine where the people will go to add data, and I try to develop a solution to avoid duplication of data by spreading the entrances points in several Wiktionaries rather than to have only Wikidata as a place to add new data. Again, I am not talking about liaison or tagging, I let wikidatians do want they want to do with data, as long as the don't broke the work already made in Wiktionary with information that are not data such as etymology, notes, etc.
Adding maintenance categories is very interesting. I like very much those possibilities, but I hope a maintenance categories "Category:Words with pronunciation data available in Wikidata" will be use in VisualEditor to display suggestions during the edition process. If not, I can't imagine any pleasure in doing this operation manually, by exploring the category. I suppose we may find a couple of people to dedicate time for this, or the Foundation will pay for this. Ahah. Still, this point is not an answer to the problem I am mentioning above. Well, I fear my skills to explain myself in English are maybe too low for this discussion. No matters, I will continue, because we will have many conversations with non-native English speakers in the future   Noé (talk) 15:32, 26 November 2016 (UTC)
The problem with a "Category:Words with pronunciation data available in Wikidata" is that people with bots will dump the data into Wiktionaries. However, pronunciation is not that simple in some languages. In English, the pronunciation of lead is not the same as lead, and there are similar differences between tear and tear, between contract and contract. The pronunciation of a word is tied to a particular meaning or meanings, and not to the spelling of the word. Nor is English the only language that has this feature; I know that Latin has it as well.
And that doesn't consider that pronunciation of words varies also by region, and that regions can be identified differently. If the pronunciation of a French word is given, are Quebec and Ontario listed separately? are they grouped together? are they included as Canadian, or what? There is a whole level of geographical identification support that must be considered because of regional differences in pronunciation.
Neither of these issues is even close to being addressed yet by Wikidata. --EncycloPetey (talk) 16:47, 26 November 2016 (UTC)
Of course it's not. We don't have anything runnig yet. But we'll be abe to make statements with qualifiers and all to model all this on the relevant entities. It's up to community, we'll be able to deal with this as we do with the multiple similar problems we have on items. This is written on the proposals, isn't it ? author  TomT0m / talk page 20:00, 26 November 2016 (UTC)
No, not that I've seen. The coding of geographic, dialectical, or other variation in pronunciation has not been part of any proposal I've seen. --EncycloPetey (talk) 20:28, 26 November 2016 (UTC)
The purpose of this proposal isn't to define exactly how every single piece of Wiktionary data should be entered, it's only determining the underlying structure. Lexemes will support statements and statements will have qualifiers, so for the purposes of this proposal, regional pronunciations are no problem: it will be possible to have pronunciation statements with qualifiers specifying where the pronunciation is valid for. Questions like whether Quebec and Ontario should be listed separately are things the community will need to decide later once we actually have support for lexical data. - Nikki (talk) 17:17, 27 November 2016 (UTC)
I'm not sure why we don't understand each other about option 3. :( If someone else can see what the problem is, please tell us. :) - Nikki (talk) 17:17, 27 November 2016 (UTC)

Making conventions explicit

What I get from the preceding discussion is a more general observation. If we want to improve collaboration, both within the Wiktionary and Wikidata communities and between them, we do not just need the data model as specified. There are many conventions on things like orthography, pronunciation and grammar that need to be applied consistently to get useful results. Often a different convention will work as well, but not if you use it at the same time and place. My favorite analogy here is keeping left or right on the road: either convention will work, but at a specific time and place you want everyone to follow the same rule.

Wouldn't it be possible to create Wikidata-items containing explicit descriptions in of the more important conventions used by a Wiktionary? They could then be used as qualifiers in de lexical data, more or less like referencing a source. This approach could have several advantages:

  1. Checking a new entry against an explicit description is probably easier and leading to less arguments;
  2. These descriptions can help editors (especially new ones) to get it right, whether they contribute on Wiktionary or on Wikidata;
  3. When there are conflicting, but internally consistent views they can both be represented within Wikidata;
  4. A wiktionary can then use just the data consistent with the conventions they want to use by selecting on the qualifier;
  5. These descriptions (and possible differences between them) can be valuable linguistic data by itself.
--MarcoSwart (talk) 22:19, 26 November 2016 (UTC)
It seems useless, if all the conventions are equivalent then we have just to make one not too stupid one once and for all. Then consumer will know this and it will be automatically be possible to translate datas into whatever convension is used in this wiktionary - of course, if all convensions are equivalent, if they are not more thinking is required. But this is way out of scope of this at this state. What exactly do you have in mind ? author  TomT0m / talk page 17:24, 27 November 2016 (UTC)
In a lot of cases a single convention will do. But there are clearly cases where results are not equivalent. I'll give two examples, both relating to Dutch, because that is where I have some experience.
A. Orthography. For the Dutch language with an official international body taking care of orthography it makes sense to have word forms entered conform the standard. This is currently an enforced policy at Dutch Wiktionary: a word form has to follow the official spelling valid during the specified period; other forms are only acceptable for periods predating the official orthographies. If Wikidata would simply start collecting word Forms without specification of orthography, the data collected won't be of much use for the Dutch Wiktionary. I can see how this can be different for other languages, so it would make sense to have some conventions covering just the entries in Dutch. Knowing that we have a history of fierce debates on orthography, at some point in time it might be a valuable solution that Dutch Forms spelled according to a different system could be entered too, as long as there is a convention to describe the spelling used at the Form level. If, for instance, the English Wiktionary prefers to show unofficial Dutch Forms too, they could then simply accept a wider set of word Forms from Wikidata.
B. In a language with much compounding, like Dutch, it can be a challenge to draw a boundary between words often used as the first part of compounds and true prefixes. Dutch Wiktionary has made this distinction based on a few considerations (grammatical, historical) that suit our purpose. So we distinguish between in- as a real Dutch prefix (an intensifier), in- as a prefix derived from foreign languages (2 Senses) and the adverb in- as the first part of many compounds to give the reader a full understanding of the Dutch language. If the main goal is to provide readers in another language with translations, I can imagine that it is easier to present most of it under a single prefix in-. But what clearly will not work is using both conventions simultaneously without a possibility to select the data accoording to your preferred convention.
I am thinking along the lines of what Template Data is trying to achieve for Visual Editor. I think it would be helpful to have clear descriptions of language related policies that help users to use the data model in a effective way and insure the co-operation with Wiktionaries from the start. Describing the conventions presently used by Wiktionaries could take some time, so I feel it is wise to raise this matter now. --MarcoSwart (talk) 01:30, 28 November 2016 (UTC)
It seems exactly something that should be possible in Wikidata indeed. In regular Wikidata we can classify entities with the properties instance of (P31) and subclass of (P279) into classes. I can't see why we could not create such classes like "historical dutch word" and use those properties or other properties to do the same with lexeme and the different usage of the lexeme. author  TomT0m / talk page 18:24, 28 November 2016 (UTC)
Return to the project page "Lexicographical data/Archive/2016/11".