Wikidata talk:Lexicographical data/Archive/2014/10

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Problems?

Latest comment: 9 years ago13 comments5 people in discussion

Could somebody tell me the problems that have shown by now? It´s hard to read everything about the Wiktionary here.. Greetings Impériale (talk) 01:16, 6 August 2014 (UTC)

@Impériale: There are two proposals; each has its own issues.

The complex proposal made in Wikidata:Wiktionary/Development centres on storing sense data in one place and transcluding it into various languages. The biggest problems with that are (1) about as many Wiktionarians have expressed opposition to it as non-Wiktionarians have expressed interest in it, and that's because (2) it would be linguistically unsound and maybe unmaintainable. It would be unsound because words other than nouns are not often fully synonymous even on a denotative level — contrast e.g. wikt:de:gehen and English wikt:en:go — and even nouns are often not synonymous on a connotative level, so a definition that was sufficiently abstract to cover words in multiple languages would represent a regression and loss of information from the detailed definitions currently in the various Wiktionaries. And I doubt a repository of abstract sense data would be maintainable because I doubt enough people would leave their native-language Wiktionaries (or come from other places) to work on it. There are also technical barriers, some of which are outlined in Wikidata:Wiktionary/Development.
A simpler proposal has been made that Wikidata centralize access to and management of interwiki links between the Wiktionaries. That would be straightforward, since a main-namespace page on one Wiktionary should always and only link to all identically-titled pages on other Wiktionaries, with a limited set of exceptions, namely the pages for characters which MediaWiki does not allow to be used in pagetitles. (Also, some Wiktionaries use e.g. curly apostrophes where other Wiktionaries use straight apostrophes, but this can continue to be handled in the way the Wiktionaries currently handle it as long as interwiki links can continue to exist between full pages on one project and redirects on another.) Many Wiktionarians have expressed support on this page and elsewhere for that proposal, but it has yet to be implemented. There may be some technical barrier to it(?), and there may be some opposition to it among the people who prefer proposal 1.

-sche (talk) 16:54, 16 August 2014 (UTC)

@-sche: I agree that it would be much easier to start with the simple proposal that is fully endorsed by the Wikitionary community, which would mean to store lexemes (wiktionary pages, all lexemes together regardless of their language) as items containing wiktionary interwiki links. This can be done now without much effort and it would have an immediate benefit.

Later on, when the structure in the proposal is ready ("W123" in the example), it would be possible to link both.

Regarding the concerns than "nouns are not often fully synonymous even on a denotative level" I must say that if two items are not fully synonymous at all levels that would mean that there should be two distinct items. I wonder if Wiktionarians are interested in exploring ontological descriptions as a complement to hard-coded text glosses. Our property vocabulary is improving over time, it just need more people interested in defining items using properties in order to identify which ones are missing.

I would recommend taking a look to recent advances like Editing Wikidata from infobox and w:ru:Википедия:WE-Framework. They are still at its early stages, but it shows that it is possible to edit wikidata without leaving one's wiki.

No idea about the technical barriers that you are mentioning, pinging @Lydia Pintscher (WMDE), Daniel Kinzler (WMDE): to see if they can add some details.--Micru (talk) 14:16, 21 August 2014 (UTC)

I am very hesitant to go for a half-assed solution now if we know we'll do it differently later. We have some plans as laid out here but nothing is set in stone at this time. It'll need a lot more collaboration with the Wiktionary community for example to get this going. We're concentrating on Commons for the next months so from my side this is put on hold until at least beginning of next year. If you want to work out a detailed critique of this proposal as well as a detailed counter-proposal that'd be awesome because I lack insight into some of the details of Wiktionary obviously and will need to catch up on those. Daniel has spent considerably more time and thinking on it already. --Lydia Pintscher (WMDE) (talk) 14:26, 21 August 2014 (UTC)

@Micru: the reason I point out that words are not often fully synonymous is twofold. (Apologies for the long reply!)

It means moving word data to Wikidata would accomplish nothing other than changing where the data was hosted. You observe that if "two items are not fully synonymous at all levels [...] there should be two distinct items" — and it's already the case that there are two items. If the items are e.g. an English verb and a German verb, then en.Wikt has an entry for the English verb (item 1) and de.Wikt has an entry for the German verb (item 2). And en.Wikt has an entry for the German verb with a gloss, and de.Wikt has an entry for the English verb with a gloss, which would just go into the gloss parameter on Wikidata under Proposal 1. Moving the items and glosses to Wikidata would accomplish nothing other than just that: relocating them from the various language editions of the WMF project that is actually devoted to writing definitions and glosses and translations, to another project.
→ Contrast that with what moving interwiki data to Wikidata would do. That would accomplish a reduction of work. The format of a Latvian interwiki link to en.Wikt's entry on go/sea/etc doesn't differ from the format of a German interwiki link to en.Wikt's go/sea/etc (whereas the Latvian gloss/translation obviously does differ from the German gloss/translation, namely in that one's written in Latvian and the other's in German), so a list of interwikis could be centralized. Then, users could add new interwikis to the central list on Wikidata with one edit and be done, whereas they currently have to add them manually or by bot in a hundred edits to a hundred projects.
It's my expectation, based on experience, that > 0 % of the people moving word data would not speak all of the languages whose data they were moving fluently, and so many non-synonymous words/senses would end up being treated as if they were fully synonymous (representing a regression and loss of information).

-sche (talk) 01:05, 22 August 2014 (UTC)

@-sche: You are talking about problems that in the context of Wikidata are non-problems. I will reply point by point:

moving word data to Wikidata would accomplish nothing other than changing where the data was hosted: It would give Wiktionary a platform for the community to enter machine-readable data. For instance, look at this item Q12237236. I don't speak Arabic, but just looking at the item I know what the item is about, and so can any person in any language provided that there are labels for the items linked. Even more, from just that information it is possible to generate a short text (Reasonator does that already). So basically what you would accomplish is immediate intelligibility of a lexeme and its gloss in any language instead of repeating the same text 10, or 100, or 1000 times.
so many non-synonymous words/senses would end up being treated as if they were fully synonymous: probably I didn't explain myself well enough when I said that in Wikidata "two distinct concepts, get two distinct items", what I meant is that each sense of the word "gehen" is a distinct concept, so if there are 23 kinds of "gehen", there could be 23 items representing each item of "gehen" and each one of them can be defined in a way that can be intelligible in all languages. Of course, for convenience in the proposal they are grouped all together (S2011, S1989), but each one can be considered equivalent to a wikidata Q item. And also bear in mind, that whatever structure is internally used in Wikidata is irrelevant for wikitionary, since you can display it in any way you wish.

Summing up, yes you can achieve some reduction of work by centralizing the links, but you can achieve a tremendous reduction of work and data re-usability by using machine-readable descriptions as much as possible.--Micru (talk) 08:02, 22 August 2014 (UTC)

@Lydia Pintscher (WMDE): the linking which Proposal 2 envisions is the kind of linking that currently exists (status quo), the proposal just envisions that the linking be done by Wikidata in one place rather than by a dozen bots on a hundred Wiktionaries. According to Wikidata's main page, centralization of interwiki data is one of Wikidata's stated missions. So, I wouldn't characterize such linking as "half-assed" (if it is what you were characterizing that way), but as fulfilment of Wikidata's stated mission.

The linking envisioned in Proposal 1 is something which it is my impression, based on comments from fellow Wiktionarians, would not be adopted on en.Wiktionary even if Wikidata could offer it. Basically, Wiktionary thinks it can do a better job of being Wiktionary than Wikidata can do of being Wiktionary. (Does anyone think that writing definitions/glosses/translations on Wikidata is not "Wikidata trying to be Wiktionary"?) Hence, it seems to me that Wikidata should offer Proposal 2 linking (because it would be so simple from a technical standpoint, and would fulfil one of Wikidata's stated missions) even if it hopes to offer Proposal 1 later, out of recognition that Proposal 1 is so dramatic a change that it is unlikely that all Wiktionaries will accept it (and there is evidence some will reject it).

-sche (talk) 01:23, 22 August 2014 (UTC)

I think you make a very valid point that either the tool is embraced by the community or it would be all for nothing. When arbitrary access is ready and with the addition of in-wiki editable templates, Wikidata could already support almost all features that Wiktionary needs. The missing one would be "declination tables" and putting all senses together (in Wikidata, because in Wiktionary already could happen), and to be honest I do not know if they are that necessary at the early stages.

So yes, I agree that a partial solution would allow Wiktionarians to start to get acquainted with the tool, to think about how they want to use it and to gather a deeper understanding whether this is the tool they want or if it should be like this or like that. Without gaining that understanding by "touching" the tool, by thinking about the tool, and by making mistakes with the tool and learning from them, all is empty talk, and empty arguments. In Wikidata's knowledge tree we already can make place for a category of a "lexeme in all languages", and little by little the whole structure can be built by Wiktionarians as they wish it. --Micru (talk) 08:58, 22 August 2014 (UTC)

@-sche: - it is not true that the proposal at "Wikidata:Wiktionary/Development centres on storing sense data" - it centres around lexemes. Senses are subordinate to lexemes. Each lexeme is attached to one and only one language, and each sense is attached to one and only one lexeme. Senses can point to each other, but the naive assumption that go@en and gehen@de are the same is nowhere to be found in that proposal. --Denny (talk) 15:07, 21 August 2014 (UTC)

I think we may to some extent be talking past each other and meaning different things by "centres on". Your proposal subordinates senses to lexemes, but suggests a link between "apple" : "(gloss) (en) fruit of the apple tree" — "(gloss) (de) Frucht des Apfelbaumes" : "Apfel", which necessarily is a link between senses. (One could not sensibly link lexemes (on the level of meaning, rather than e.g. homophony or homography) except by linking senses; the basis for listing "Apfel" as a translation of "apple" and not of "kitten" must be that "Apfel" shares a sense with "apple" and not with "kitten".) You don't name "go" — "gehen" as an example of terms someone might try to link, but I'm supplying it as one: it's an example of a set of terms which have semi-synonymous senses (each term is sometimes used to translate the other), but not synonymous senses — an example of an IMO large category of things people would incorrectly try to link (cf point 2 of my reply to Micru). Re "each sense is attached to one and only one lexeme": that is itself interesting, since there can be some technical terms (particularly inside a single language) that are strictly synonymous, like regmaglypt and piezoglypt seem to be. -sche (talk) 01:27, 22 August 2014 (UTC)

@-sche: As said above if two concepts are different (even slightly different), they do not belong together, and this can be reproduced as having different items.

@Denny: After all this time using Wikidata I think it was a very good decision to not provide any constraint at all, so the community can build them, and discuss them, and change them. I think the same can be done for Wiktionary, data is data no matter how you organize it, and linguistic data is no exception. What in the proposal is called "senses" (S2011, etc) already exist as Q items, what is called "form" already exist as labels, and what is called "lexeme" could be another Q item. The most interesting addition would be a "declination tables" (which can be a namespace to store automatically generated declination tables as plain json) and the feature to group lexemes that share the same form (which can be done by being able to link a label with an item representing the lexeme). With those 2 things (or only with the last, which is more important), Wiktionary is already here.

It is a problem not to be able to add qualifiers to labels and that limitation is pushing us to replicate names as statements to be able to add what kind of name it is, and so on: General proposal for names. If sitelinks can have badges, could labels have them too but not limited to a fixed set? If you or anyone else could come up with a solution for this I would be most grateful to avoid reinventing the wheel :) --Micru (talk) 08:58, 22 August 2014 (UTC)

@-sche:, I am sorry, there was a major mistake in my proposal. I have fixed that one. Thanks for finding it! I hope you will also find the others.

As you point out, sense S2011 has two glosses, one in (de), "Frucht des Apfelbaums", and one in (en) "tree of the genus Malus". The German gloss is a gloss of this sense of the English lexeme "apple" (W123). The English gloss is also a gloss of that same sense. In the example, there is a property called "translation" pointing to another sense (here was the mistake - it used to point to a lexeme, but it should point to a sense). Translations in my opinion should be connections between senses.

Even more important - and this was not clear in the example at all - is that 'translation' is a community-defined property. It is per consensus of the community to make it a connection between two senses. They can also decide differently. They could make it a connection between words. They could omit the creation of such a property. They could create completely different properties. It is up to the community. It would be presumptuous of us to decide now which properties should exist and how they should be defined. We only define a minimal structure: lexemes, forms, and senses, and a small set of attributes on them: lexemes have a lemma, a lexical category, a language, a set of forms and a set of senses; forms have a representation and a set of lexical properties; senses have one gloss per language. All three can have further properties as defined by the community.

So if the community agrees with you and thinks that senses should not be linked with the translation property, that is totally OK for the proposal.

In theory you could try to already model that in Wikidata, as @Micru: suggests, but the usability would break down, and a few things would be confusing and not fir really well. But in general, almost all of the structure would be decided upon by the community. The only preexisting structure, as to my proposal, would be the lexeme - forms - senses triangle with their attributes. This is similar to Wikidata, where we decided on a minimal preexisting structure - items, with labels, descriptions, aliases (all in different languages), and sitelinks, and then free statements, and their internal structure (i.e. with qualifiers and references). Everything else is built by the community.

Sorry, I recognize that the example in the proposal was a bit unclear with regards to that, I tried to make it clearer. I hope this makes it a bit more acceptable -- all that the proposal says is that there are lexemes with forms and senses. --Denny (talk) 22:39, 29 August 2014 (UTC)

It's quite hard for me to understand the whole text above me, so i just write down the thoughts i have made about wiktionary and wikidata. The first point is, that we should seperate the process in several phases.

phase: creating the wikidata entries by using the already existing interlangual connections between the different entries (should be no problem, or even easier because we already did this with the wikipedia and we don't have brackets after the lemmas like wikipedia, or?)
phase: connecting the different meanings by creating a new namespace for them. I have no idea what the lemma of them would look like but the description could be importet by the english wiktionary. In this entry all translations and Synonyms should be listed. The problems here is that the order of the different meanings is different from wiktionary to wiktionary. One option would be that all meanings in all wiktionaries have to be sorted like for example the meanings in the english wiktionary so that they could be importet by bot. Another option would be that we have to manually type the number of the meaning from for example the german wiktionary in the entry for each meaning in the wikidata
phase (optionally): adding additional informations for the entries of the words (like in the german wiktionary) by help of the meaning namespace (f.e. word family, synonyms, generic term)

I hope you could understand my ideas. Please respond in an easy english :) --Impériale (talk) 17:54, 9 October 2014 (UTC)

New proposal

Latest comment: 9 years ago4 comments4 people in discussion

I just wanted to let people know that I'm working on an evolution of this proposal. It's currently a work in progress, but I welcome feedback (on the corresponding talk page). Note that, at this point, my new proposal assumes a working knowledge of this current proposal. I look forward to the day when Wiktionary data is stored in a structured way. — GPHemsley (talk) 11:21, 17 October 2014 (UTC)

Thanks! I commented on the talk page. I quite like it. --Denny (talk) 15:42, 17 October 2014 (UTC)

I also left my comments there. Thanks!--Micru (talk) 18:39, 17 October 2014 (UTC)

Thank you! --Impériale (talk) 17:26, 18 October 2014 (UTC)