Wikidata talk:Lexicographical data/Archive/2016/09

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Reactions to FAQ

Latest comment: 7 years ago13 comments8 people in discussion

Why will this project be useful for Wiktionary editors? : I think you miss a point here. The way Indonesian (to keep this example) describe morphosyntaxical categories of Estonian is different of the way English describe it or Polish describe it. Because there is no unique ontology of linguistic concept to describe languages and because of simplification. Firstly, sorry for diversity, linguistics is still a young discipline and terminology is still evolving in some part of the world, and being frozen in other part because of tradition. So, operative concepts, name of column if you imagine a figure with a paradigm in it, are different from one language to another, from one description to another. Secondly, simplification. We do not use all the linguistics concepts in Wiktionary, because it appears to be too cryptic. Let's pick the example of optative (optative (Q527205)) that describe a wish or a choice. Good, but most of the readers have never heard this concept! Wiktionary have to adapt it to a more generic and culturally appropriate concept. It is often rendered with subjunctive because it is quite the same idea, but some languages may prefer another choice. How Wikidata can deal with that? Is it recorded in Wikidata or in the Editor tool in Wiktionary?

Second part of this paragraph: Can you provide any suggestion of tools that may emerge thanks to a lexical database? (I have some ideas, but it may be clearer for all participants to have a better picture of the future and to desire it). Noé (talk) 14:31, 13 September 2016 (UTC)

Hello @Noé:, thanks for your feedback.

The Wikidata structure is already ready for complexity and modeling different ways, with a lot of statements and non-restrictive properties. The development team will provide the tools and we will let the linguistic experts decide on which tools to use in which language. We also can store very complex and complete information on Wikidata, and sort it to display only some chosen parts on Wiktionary.

The ideas will come both from Wikidata and Wiktionaries communities, but to give you just one example, having cross-structured data will allow an external language-learning project, like Parley, to parse several Wiktionaries, which is not possible for now. Lea Lacroix (WMDE) (talk) 15:41, 13 September 2016 (UTC)

We already have an example of such a tool on fr.wiktionary: fr:wikt:Wiktionnaire:Recherche avancée. It is difficult to generalize because I extract data from the French xml dump with a parser specific to the French project. If we can use the structured data from Wikidata, we could create a more general lexical search tool much more easily, and for every language. Darkdadaah (talk) 16:22, 13 September 2016 (UTC)

Only if one extracts the data from every langugage. The Wikinews tool, for example, failed due to the high maintenance cost for extracting the data from only a half dozen or so Wiktionary languages. - Amgine (talk) 16:26, 13 September 2016 (UTC)

As an en.wikt admin, I don't feel like our input is a part of this process, and I approach this with much trepidation. We could really use interwikis to be centralised on Wikidata so we can stop relying on bots to update those. However, everything else mentioned seems incapable of accounting for the complexities in how different Wiktionaries handle basic questions like what counts as which language, which languages are supported, how to lemmatise, and more. These are surmountable, but this page seems to be unaware of most of these issues. Noé's question above just scratches the surface, and I find the answer to be very underwhelming in terms of allowing for the autonomous approaches that each Wiktionary chooses to take. Metaknowledge (talk) 20:01, 14 September 2016 (UTC)

@Metaknowledge: I was also very concerned by the knowledge on lexicography Wikidata team have, before we met at Wikimania, in June. You evoke some of the most challenging issues and I agree with you, it will be a very long and puzzling journey, as it is still in each projects. Defining which languages are supported is managing linguistics and politics interests and we still have hard times with glottolog/ethnologue divergences. Nevertheless, my feeling on this matter is not that depressive. I think we are (or will be) the experts on collaborative lexicography, something new and loosely described, because we do not define our own activity so much. So, to be considered as experts collectively, I tried to impulse some work on this direction, like with a Wiktionarian skills list (what we have learn by contributing on Wiktionary). And now, we can teach the people of good will from Wikidata, and help them to formulate the appropriate questions that can drive the discussion. Well, I am not sure to be the right person to do so, but we are plenty and we may find interesting to collaborate on defining the limits and possibilities of Wiktionary future. Your welcome

Noé (talk) 09:07, 15 September 2016 (UTC)

@Metaknowledge: We definitely need to involve the different Wiktionary communities in this process. Unfortunately the discussion/process seems to fragmented in a plethora of mailing lists, phabricator tickets, talk pages on Wikidata and Wiktionaries. Maybe this talk page could serve as central hub?

@Noé: I imagine that there will be another mapping / translation of the data stored in Wikidata to what the end user actually sees (in phase gamma: provide arbitrary access from data on Wiktionary). So optative could be rendered as "subjonctif" in French. Another example would be Portuguese, where the grammatical terminology sometimes varies between countries (conjuntivo in Portugal, subjuntivo in Brazil). We could store both labels and then render the "correct" one based on the user's language settings. – Jberkel (talk) 13:39, 15 September 2016 (UTC)

In my opinion, no, this page cannot be central. This is foreign/hostile territory for most Wiktionarians. The people you see commenting here are the most motivated and vocal members of a community which is, by nature, shy and retiring. The people here are histrionic compared to much of the community, something which does not seem to be getting much notice. - Amgine (talk) 15:42, 15 September 2016 (UTC)

@Amgine: So what could be the central page for this discussion? --Denny (talk) 19:07, 15 September 2016 (UTC)

I do not know of any successful central communications channels within the Wiktionary project. Thus the ongoing inability to begin a discussion to homogenize logo usage across the project. If someone would kindly integrate phpbb again, it would be trivially possible to have a single (or many) forum across the languages, but of course that is NMH. gods help us if we were to use an off-the-shelf solution we don't have to maintain. </rant> Discussions on Meta are just as unlikely to actually reach day-to-day Wiktionarians. Even local discussion pages are largely ignored due to drama, but they seem to be the most effective. (Note: exactly one communication from Wikidata was posted to either en.WT's Beer Parlour or Grease Pit in the past 13 months, and the cross-post to Wiktionary-l was from Nemo_bis. One assumes Wikidata is not actually interested in communicating with Wiktionary based on such history.)

Which I suppose means you have either a massive adoption problem if you implement this from on high without the necessary-to-get-it-right user input, or a massive communications project to communicate at the local level in every active wiktionary. -- User:Amgine
Noé's point on what Wiktionary editors gain is central. From the current state of things I think integration with Wiktionary is impossible, for the reasons described by Amgine. On the other hand, the major premise of this work seems a desire to start from scratch. The biggest opportunity to start involving Wiktionary editors in Wikidata was to use Wikidata for interwiki links just like all other projects, but that ship has sailed.

If some feature comes out of this development that will actually simplify the life of wiktionarians, then maybe we'll see some subset of some Wiktionary slowly adopt some Wikidata data in some years from now, just like Wikidata properties are being slowly put in use by templates on other projects. Otherwise, this new dictionary/ontology will just have a life of its own.

Starting a competitor of Wiktionary within Wikimedia isn't necessarily a bad thing, as long as licensing headaches are avoided, the two keep some communication and Wikidata doesn't steal editors from Wiktionary. Wikidata might become the place where to meet for the many researchers who build ontologies, dictionaries and various other derivatives from Wiktionary. They might start importing some of their representations of Wiktionary data, share code and practices, merge other ontologies like WordNet and slowly reduce the difficulty of such Wiktionary parsing in the future. If even just a dozen users (NLP/bot programmers) use Wikidata this way and all users continue using Wiktionary in the same way, we'll have something valuable. --Nemo 07:53, 16 September 2016 (UTC)
I know my biases, but this sounds like a bad thing for Wiktionary. It's hard to know how to contribute to a project that is unsure whether it wants to support the Wiktionaries or compete with them. Metaknowledge (talk) 01:06, 18 September 2016 (UTC)
To the best of my knowledge, no one is proposing to start a competitor to the Wiktionary projects. Wikidata has not developed into a competitor to Wikipedia, why would it become a competitor to Wiktionary? --Denny (talk) 04:31, 18 September 2016 (UTC)
What do you mean with the ship has sailed regarding using Wikidata for interwiki links just like all other projects? Interwikis are the main topic for Phase Alpha of the project. --Denny (talk) 20:25, 16 September 2016 (UTC)
Senses
Latest comment: 7 years ago5 comments3 people in discussion

I think you don't need to create new entries for senses because there are already millions of explanations on top of most wikidata entries (vice versa thousands of wiktionarians would care about the descriptions in wikidata; many Wiktionaries even have own rules for them). I also think that the other informations related to the meaning like hyponyms and hypernyms are better off in the "Q" entries. --79.201.82.18 19:06, 15 September 2016 (UTC)

I used to think that too, but I got convinced otherwise by reading lexicographical works. A sense can refer to a Wikidata item, but two different words never share a sense. Roughly speaking, Items are too coarse to capture the difference in the senses of "walking" and "going", never mind between different languages. --Denny (talk) 19:59, 15 September 2016 (UTC)
Maybe you misunderstood me: in w:House (Disambiguation) "structure used for habitation by People", wikt:house#Noun "A structure built or serving as an abode of human beings." and Q3947 "structure intended for living in" you can basically find the same explanations. Is it not possible to make a use out of this correspondence? --79.201.82.18 22:34, 15 September 2016 (UTC)

Probably there is, but I would consider that almost as an optimization that should be tackled, but a bit later, once we have the basic structures in place, and we see the first usage patterns emerging. But as I say: it should be tackled. It is similar for flections on verbs. Yes, these thing can be more automatized. But let's not block on getting that right from the beginning. I rather believe we won't get it right anyway, and thus we should move slowly but constantly to an ever-improving situation, where we see what works and what doesn't and adapt the platform based on the emerging usage patterns. Does that make sense? --Denny (talk) 17:52, 16 September 2016 (UTC)
Of course there are more important things - I just wanted to admit my discovery :) And thank you for your engagement! --79.201.65.25 11:53, 17 September 2016 (UTC)
RDF / query for new entities?
Latest comment: 7 years ago3 comments3 people in discussion

Is there a plan for how the new entity types will be represented in RDF, and how the wikidata query service will be able to query them? ArthurPSmith (talk) 18:29, 16 September 2016 (UTC)

I don't think that has been tackled yet, but eventually it should be rather analogous to the way Wikidata items and properties are represented today (in the end, it's all mostly statements anyway, and these are the same as in Wikidata). But yes, it is an open task as far as I can tell. --Denny (talk) 19:34, 16 September 2016 (UTC)
Correct ;-) --Lydia Pintscher (WMDE) (talk) 20:15, 16 September 2016 (UTC)
Language specific properties
Latest comment: 7 years ago5 comments3 people in discussion

I'm pretty convinced that Wikidata can be very valuable to Wiktionary. Of course there are lots of difficulties to overcome. I have read the proposal and will try to point some of these difficulties and possible solutions. Please view my contributions today and in following days as encouragements to go on.

Languages can define lexical categories en lexical properties in different ways:

not all categories and properties apply to all languages, especially more refined categories and properties can be language specific;

some categories and properties have similar, but nog quite identical definitions in different languages;

sometimes there is legitimate discussion between linguists about the validity of categories and properties: sometimes the consensus can change over time.

A few examples.

Spanish has a "pretérito indefinido" and a "pretérito imperfecto", Dutch has just an "onvoltooid verleden tijd" but none of the three are synonymous.

In English "prefix" usually has a wider use than "voorvoegsel" in Dutch, because in a compounding language the contrast with derivation is more important (resulting in a different categorization of Dutch prefixes on English and Dutch wiktionary).

There is a growing consensus that adverbial use of an adjective in Dutch is standard and does not warrant a seperate definition as an adverb.

The proposed data model would be more robust if it took these differences explicitly into account. Would it be possible to have an entity described language, consisting of:

a language taken from Wikidata items

a short name, specifying the particular description of the language

the set of possible lexical categories according to this description, taken from Wikidata items

the set of possible lexical properties according to this description, taken from Wikidata items.

Probably there is a lot more to say about a specific description, so maybe a sitelink would be in order.

A lexeme would than have a described language instead of a language, creating the possibility to have differently defined sets of lexemes for a language.

This would have a few advantages:

a. it becomes easier to present contributors with al list of valid categories/properties

b. wrong attributions are easier to find and correct

c. collecting these data for the existing Wiktionary languages and translating them in the other Wiktionary languages is by itself a useful framework for other translations

d. in case of linguistic differences or developments it is possible to create a separate described language.

--MarcoSwart (talk) 14:04, 18 September 2016 (UTC)

According to the model proposed on Wikidata:Wiktionary/Development/Proposals/2013-08/en (which I believe is still the same), the lexical category of a lexeme links to a normal Wikidata item, which means we can create new lexical category items as necessary and if we use a property like P2439 (P2439) on the items, we can also link them to the relevant languages. Since the language of a lexeme also links to a normal Wikidata item, (a) and (b) shouldn't be difficult without needing any changes to the data model. I don't really understand (c) or (d) so I can't comment on those. - Nikki (talk) 20:10, 18 September 2016 (UTC)
Thank you for showing the present model can accommodate many possible preferences. Still there are some preferences that can co-exist as long as they are not mixed. In traffic keeping left or right are equally valid choices, but it is of vital importance that all drivers sharing roads adhere tot the same choice. When describing languages different valid choices exist too. For a meaningful result, these choices need to be coherent, especially if you want to make the results machine-readable. For instance point 3. mentioned above: we can create a second lemma for each adjective to describe it as an adverb too or we can have a general rule that adjectives can also be used als adverbs. Both are legitimate choices with pros en cons, but if no choice is made there is no result the user can rely on. My proposal gives the community the possibility to "agree to disagree" in an orderly fashion. Is there an easy way to reference a coherent set of entities and rules for using them? --MarcoSwart (talk) 13:41, 19 September 2016 (UTC)
We started experimenting in the field, e.g. has grammatical case (P2989) already gives a nice overview by Tobias1984. I should put some work into the others.
--- Jura 20:32, 18 September 2016 (UTC)
~~This looks neat! --MarcoSwart (talk) 13:43, 19 September 2016 (UTC)~~