Wikidata talk:Lexicographical data/Archive/2013/08

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Should glosses/definitions or lexemes/senses be the primary organization entity?

Latest comment: 10 years ago7 comments4 people in discussion

(sense) S2011

    (gloss) (en) tree of the species Malus
    (gloss) (de) Baum der Spezies Malus

I think this is going about it the wrong way. In the original proposal, the glosses were the "senses". With links between glosses forming wider "senses". - Francis Tyers (talk) 15:33, 2 August 2013 (UTC)

Perhaps to clarify the above point:

I don't believe in defining global language-independent "senses" and then tacking definitions onto them. This is looking at the problem in the wrong way. I prefer to use the definitions as the base type, and then make links between definitions. Then the language-independent "senses" could be extracted automatically by looking at the correspondences between definitions. Imagine:

station
  (en:1) A regular stopping place in a transportation route

estació
  (ca:1) Indret on es deturen habitualment els trens

estación
  (es:1) En los ferrocarriles y líneas de autobuses o del metropolitano, sitio donde habitualmente hacen parada los vehículos.

(estación:es:1, estation:en:1) could form one strong bond. They are equivalent, although the definitions are not translations of each other. Note that, in this case these definitions would also do well for (parada:es, stop:en). There could also be bonds between words, so you could make a strong bond between (estació, estación) because they share the same word in all definitions, or a weak one in the case of (estació, station) where there is overlap between the definitions.

Anyway, this is my general idea. In any case, in my opinion, the idea of having a general language-independent set of senses which form the "key" for entries is a mistake. The definitions themselves should be the senses. - Francis Tyers (talk) 13:36, 3 August 2013 (UTC)

I don't agree with Fran, but I do think that translation should also possible at the lemma level: a sense level translation relationship between (dog, hound)@en and (madra, cú)@ga is fine for a coarse-grained gist, but for fine-grained translation, it should be dog/madra and hound/cú. It can even be more fine-grained than that, with some translations only being valid at the level of individual forms. -- Jimregan (talk) 15:10, 3 August 2013 (UTC)

Yes definitely, it should be possible to define a "translation/equivalence link/bond" at any level. - Francis Tyers (talk) 15:17, 3 August 2013 (UTC)

I actually agree with Francis. I have rephrased the proposal to make it clearer that every sense is connected and depends on exactly one lexeme. Senses are not independent of their lexemes, i.e. words. I hope that counters the worry.

Also agreeing with Jim: statements can be made on every level, i.e. on lexemes, forms, and senses. --Denny (talk) 10:31, 6 August 2013 (UTC)

I would advise against this, certain statements should really only be made on some classes and it is preferable to enforce these constraints to help data creators get things right. I would say then that certain restrictions like synonymy should only be made on the sense level to avoid ambiguous data Johnmccrae (talk) 15:07, 7 August 2013 (UTC)

Right. I was talking about translation only, and I think the properties should be different, because they express different phenomena: it's possible for a pair of words to be exact equivalents at the lemma level (sharing all senses(/definitions/glosses)) but it's not the common case, and should be expressed as such. That relationship should be between lemmata, but it would be desirable to have sense-level equivalence inferred. In the more specific case, where translations are restricted to certain forms, the relationship should be at the form level, and lemma- or sense-level translational relationships should not be inferred. -- Jimregan (talk) 16:11, 7 August 2013 (UTC)

I don't think definitions should be dependent on lexemes, rather the reverse, the definition should be the point from which everything starts. Consider "slav" (-er) and "slav" (-ar) in Swedish.


Lexical categories are language dependent
Latest comment: 10 years ago1 comment1 person in discussion

Has it been taken into consideration that a lexical category is language dependent ? Thanks, GerardM (talk) 11:01, 6 August 2013 (UTC)
: If you mean that "the set of lexical categories in Japanese is different than the set of lexical categories in Italian", yes, this is taken into account, because a lexical category can be any Wikidata item, it is not a closed, predefined set. --Denny (talk) 14:08, 6 August 2013 (UTC)

Dependend data on a lexical category
Latest comment: 10 years ago1 comment1 person in discussion

In a language a specific lexical category implies other variables .. eg gender and noun in many languages.. This is also very much language dependent. Thanks, GerardM (talk) 11:03, 6 August 2013 (UTC)
: They can be represented as statements, but the initial implementation will not require them. Since this would indeed be useful, once we have a sufficiently large dataset we can optimize the UX by making the entering of the data easier. This is a similar approach that Nilesh, one of the Wikidata GSoC students, is currently implementing for data, where one property on an item (say "instance of Person") might imply that another property should be filled out (say, "date of birth"), etc., although it is here even stronger than for items. --Denny (talk) 14:11, 6 August 2013 (UTC)

"sense" properties
Latest comment: 10 years ago1 comment1 person in discussion

For each sense we should get its date, frequency of utilization, and also if another sense of the word is a parent (eg: metonymy). This would allow to sort our definitions by popularity or chronological order after. JackPotte (talk) 12:26, 6 August 2013 (UTC)
: Yes, that would be great, and can be represented through the statements. --Denny (talk) 14:12, 6 August 2013 (UTC)

Why?
Latest comment: 10 years ago1 comment1 person in discussion

Once again, this is a proposal not explaining why it is done, what could be the practical added value to wiktionaries. If this added value is not demonstrated, then any such proposal would only have a negative effect, because some talentful Wiktionary contributors would leave Wiktionaries to work on Wikidata fruitlessly. This already happened: e.g. Kipmaster left en.wikt for Omegawiki. Less ambitious proposals such as a common database for a picture dictionary seem more promising. Lmaltier (talk) 16:56, 7 August 2013 (UTC)

: I just choose some entry, bake. On English Wiktionary, it contains the simple past 'baked' or 'book' (dialectal, Northern England). The German Wiktionary contains 'baked'. The French, Italian, Polish, Croatian Wiktionary do not contain any of the simple versions. If it would be available in Wikidata, those Wiktionaries could choose to use this information from Wikidata, if they so wanted. They could provide more and more complete information than they currently do, and at the same time reducing their maintenance effort.
: Also, since the goal of Wikidata is to support the other Wikimedia projects, like Wiktionary, contributors who partially move to Wikidata still contribute to Wiktionary, and potentially in a broader way (by making it accessible to all Wiktionary language editions).
: Again: This structured data from Wiktionary would be stored in Wikidata primarily to support the Wiktionaries. If none of the Wiktionaries choose to use the data, I expect that this data will not be entered into Wikidata.  --Denny (talk) 04:57, 8 August 2013 (UTC)

:: Denny, after the Wikimania meeting, I, too, am still bumping up against how this might benefit the [language] Wiktionary project. This is the element of the proposal I least understand.

:: As I understand the proposal, which may be faulty, this proposal suggests Wiktionaries store separate lexemes in wikidata. These would not be directly editable via the Wiktionary, and would require extensive re-development of a given project's presentation template systems. Editorial control over the entries would be within the Wikidata project, as would curation, although Wiktionaries would have contributor access to the content of course.

:: There seems to be very little benefit to the Wiktionary project, nor does it further its mission (and may actively harm it.) Conversely, it would add extensive structured content to Wikidata, and force wiktionarians to work on that project. (I realize there are many potential real benefits to readers and 3rd party softwares, but I am addressing this solely from the point of view of Wiktionarians as a community.) - Amgine (talk) 14:51, 10 August 2013 (UTC)

::: Sorry, I thought the advantage of structured data for a dictionary would be self-evident. Basically every printed dictionary today is based on structured data, it is an amusing coincidence of history that Wiktionary is not. Here is an example for the advantage of structured data in Wiktionary:
::: Take the Hungarian word “ház” (translated to “house” in English). As you can see on the English Wiktionary, “ház” has 46 different forms, declinations, like the plural allative, the first person singular possessive, etc. These forms are all given in the actual wikitext. The Hungarian, English, and Chinese Wiktionary have all these 46 declinations. Bosnian, French, and Korean contain 34 of the forms, but not the possessives. German has 24 forms, some of them possessives. I did not check whether the actual forms are different, but there is obviously nothing built into the current system to facilitate or enable that – if at all, bots might be checking the consistency of these forms over the different language versions, I do not know if this is happening.
::: In Estonian, Basque, Farsi, Suomi, Ido, Italian, Limburgian, Latvian and 11 further languages, the word “ház” has an entry, but it lacks the conjugated forms of the word completely. They are simply not there.
::: In Japanese, Spanish, Arab, Hebrew, Croatian, Uzbek, and the 126 other languages of Wiktionary the word simply does not have any entry, for readers in those language the knowledge is completely hidden.
::: If Wikidata would support Wiktionary, it could offer the different word forms. In this case, a template like “Hungarian noun conjugation” could, if the community wanted, query Wikidata to get these forms and display them. This means, once the data is in Wiktionary, the Farsi and Italian could simply add that one template to their entry, and would display all forms. Furthermore they would not need to worry that someone comes in and vandalizes a single form on a Hungarian word, which might basically remain undiscovered for a long time. Also it means repeating the 46 different word forms on potentially 158 different wikis.
::: Each Wiktionary decides on each single word whether they want to use Wikidata, and how much of Wikidata’s data they want to use in this case. If the English Wiktionary decides not to use Wikidata for English verbs in order to keep their own control and processes on it, that’s absolutely OK. They might even do something like the English Wikipedia: instead of simply asking Wikidata for a value and display the response, they keep their own data in their local templates, but nevertheless ask Wikidata. They then compare if Wikidata and their local data agree, and if not, the entry is put on a hidden maintenance category. This way an additional layer of quality assurance – both for Wikipedia and for Wikidata – is being created, and the probability of an effective vandalism reduces tremendously. English Wikipedia does that for example for IMDB identifiers.
::: I think the example for “ház” demonstrates the immediate benefit such a system would have for the Wiktionaries.
::: I hope my answer was understandable. As said, I am sorry for not explaining it earlier, but I seem to be already so deep into this stuff that I might imagine some things to be self-evident, when, obviously, they are not. On the other hand you mention that Wikidata might potentially harm Wiktionary and its users, and now that I offered my answer, I obviously would also like to see yours. --Denny (talk) 15:30, 12 August 2013 (UTC)

:::: Well, which part of your proposal would you like me to praise?
::::: The part that you find praise-worthy, obviously. And suggest improvements on the rest. That would be my favorite outcome. --Denny (talk) 18:14, 13 August 2013 (UTC)

:::: If, on the other hand, you would like me to attempt constructive criticism, please let me make the following points (based on my possibly faulty understanding):
::::* This project will draw its data from all the wiktionaries. The wiktionaries can do this themselves, and do.
:::::: This is incorrect: as shown for the "ház" example, they do not. A vast majority of Wiktionaries do not even have an entry. Of those that do, a vast majority lack basic things like conjugations.
:::::: Also, I am not completely sure with what you mean "This project will draw its data from all the Wiktionaries." Nowhere in the proposal is such a thing said, and it is up to the community on which data sources they want to use. --Denny (talk) 18:14, 13 August 2013 (UTC)

::::* This project will curate the data. This is something the wiktionaries currently do, and so it will either duplicate effort or it will divide available contributor efforts. An alternative is to regularly (continuously?) monitor sources for updates, or use dumps as sources (which has the minor drawback of not reflecting the latest status.)
:::::: This is mostly incorrect. Since most of the Wiktionaries lack most of the data, they also do not provide curation for them. Instead, they could join forces and share curation effort and energy if they so choose. This will probably remain irrelevant for the English, French, or German Wiktionary, but I see about a hundred Wiktionary projects who might welcome this form of cross-lingual collaboration. --Denny (talk) 18:14, 13 August 2013 (UTC)

::::* This project will not undertake presentation systems. Wiktionaries already have presentation systems, but will need to develop additional systems to use wikidata's repository.
:::::: This is correct - as they already do with templates for conjugations, eg. --Denny (talk) 18:14, 13 August 2013 (UTC)

:::: Up to this point I believe there is no value to the wiktionaries, but these next points have specific value.
:::::: So what about the conjugations of "ház"? Why are they not of any value to the Wiktionaries? --Denny (talk) 18:14, 13 August 2013 (UTC)

::::* This project likely will develop a database of metadata richer than is available in any one language's content.
::::* This project's metadata will be more 'findable', for display on wiktionaries.
:::::: I do not understand what you mean here with 'metadata'. --18:14, 13 August 2013 (UTC)

:::: The second and third points represent, in my opinion, serious costs to the wiktionary communities, from their point of view. Curation is extremely important to most of the wikis; if you expect the wiktionaries to maintain the data I believe wikidata must provide a flexible interface for them to do so on their own terms (I also assume this is already planned; I'm mentioning it because I do not know how you plan to implement your project.) The cost of developing and maintaining the display templates and modules is very high for projects with small contributor communities. Each has different priorities and layouts, and one-size solutions may not be possible.
:::: Another note about curation: who gets to decide what should, or should not, be included? That is, who has deletion rights? The criteria for inclusion in Wiktionary are not consistent across wiktionaries, and there are standing disputes within at least the English wiktionary. There are strong disagreements regarding what constitutes a distinct language within the project; for example, on the English wiktionary most forms of languages/dialects related to Croatian-Serbian - regardless of script - are collapsed into the unique localization language code 'sh', this despite the existence of Bosnian, Croatian,Serbian, and related wikitionaries. I have no idea how you might handle this question, but it's sure to be raised.
:::: - Amgine (talk) 00:34, 13 August 2013 (UTC)
:::::: Indeed, every Wiktionary would keep complete autonomy over what it chooses to include and what it does not want to include. Most of the display templates are already available, they would only need to be extended to use Wikidata instead of / in addition to local data, if the given community so wishes. The proposal does not suggest to create one template that fits all entries in all languages. The creation of display templates is done completely autonomously by each community. The proposal does not suggest to create entries into a Wiktionary, if they do not have them already. The proposal does not suggest to force any Wiktionary to agree on the existence of four different south slavic languages. The proposal does not suggest that local curation is impossible. The data is available, but it does not force itself into any project - it has to be explicitly included and displayed. Therefore the ultimate question of curation remains with the local Wiktionary community. And I fail to see why you assume any of these problems you raise would be the case given the proposal. The proposal is rather short, and for any of your conclusions I do not see where the proposal would suggest to do so.
:::::: All the proposal does is to suggest a data structure for lexicographic data, which then can be used by the Wiktionaries if they so want. The question is, is this data structure suitable to fulfill this task.
:::::: I am completely disagreeing with you on the cost / benefit calculation of the project for the Wiktionaries, and I repeatedly point to the one example of "ház" and its conjugations. I think that it would be beneficial, especially for smaller Wiktionaries, to provide access to a structured data source offering these conjugations for inclusion into their project. And my one question is: do you think that this would be beneficial too? --Denny (talk) 18:14, 13 August 2013 (UTC) 
::::::: I desperately want you to continue this work. I seem unable to communicate to you that what you are doing will have costs to the wiktionaries, which is for me a fundamental concept to any conversation about this project. And since you said below that I am wrong even though I am right (that is, you said I am right and you said I am wrong), I will stop communicating on this project with you. - Amgine (talk) 19:11, 13 August 2013 (UTC)
:::::::: And I am trying to understand you, but I seem to fail. Maybe we should try on IRC? Also, are you taking into account the costs for the Wiktionaries of *not* doing this project? --Denny (talk) 04:41, 14 August 2013 (UTC)
::::::::: Zero, as that would not alter their status quo. - Amgine (talk) 05:24, 14 August 2013 (UTC)
:::::::::: So keeping the Wikipedia language links the way they were last year would also have a cost of zero (it was the status quo), whereas switching to centralized language links had a higher cost. Or put differently, any change, no matter how beneficial, has costs. Whereas keeping the status quo has no costs (which I find a questionable assumption). Is that what you are saying? --Denny (talk) 19:03, 14 August 2013 (UTC)
:::::Re inclusion criteria: For items connected to Wikipedia articles, there's a general rule that if any Wikipedia judges it includable (ie has an article on it), it's includable in Wikidata. See WD:N. That could probably work for Wiktionary as well. --Yair rand (talk) 05:13, 13 August 2013 (UTC)
:::::: The problem with this model, Yair, is that it excercises de facto editorial control over the wiktionaries. To give an entirely mythical example, if wiki A says possessive inflections are not words, yet they exist on wiki B, they will be included in any output of inflections of a term - including automated population templates on wiki A. If wiki A does not restrict the inclusion of neologisms, they will be found using the search on wiki B as "words which don't exist here yet" - redlinks which are the most common entry point for wiktionary contributors - or which will display with the wikidata content for language A ("would you like to view this word in A?") 
:::::: The simplest way of avoiding this issue is to use the already-present metadata of whether the word exists in the requesting wiki and if not omit it from the results. Unfortunately, that substantially reduces the value of this project for the wiktionaries. Which is why I asked the questions. - Amgine (talk)] 15:48, 13 August 2013 (UTC)
:::::: "To give an entirely mythical example, if wiki A says possessive inflections are not words, yet they exist on wiki B, they will be included in any output of inflections of a term." -> This is simply and completely wrong. And i do not know why you think that. If wiki A does not want to include possessive inflections, it does not have to include possessive inflections. That is like claiming that every Wikipedia article has to include every piece of data that Wikidata has about the associated item. No, this is simply wrong. Any and each piece of data will only be included if it is explicitly requested. If your request is for everything, and everything is display, ah well, yes, then everything is displayed -- but I would be extremely surprised if such a blunt template would survive more than 2 hours on any project.
:::::: I will say it again, and again: every project is completely autonomous regarding the amount of data, and if any, they want to use and display or peruse for any other purpose. Every Wiktionary can decide for every single page how much exactly they want to take from Wikidata and display, if they want to take anything at all. I fail to understand why you believe that the proposal is suggesting otherwise. --Denny (talk) 18:27, 13 August 2013 (UTC)

Userscript which already provides similar functionality
Latest comment: 10 years ago1 comment1 person in discussion

I have already developed a userscript which provides certain interwiki functionality by querying the wiktionary api for interwikilinks. See more at Show interwikis for non-existing words. -- Stratoprutser (talk) 18:45, 10 August 2013 (UTC)

: Thanks, we'll take a look! Although we would need to implement it server-side, but that's a good start! --Denny (talk) 15:44, 12 August 2013 (UTC)

About rhymes
Latest comment: 10 years ago1 comment1 person in discussion

In the proposal we find (statement) rhymes with -> paddles (F404). I don't like this idea, instead of that
* the lexeme could have one or several prononciation forms, which could have IPA and sound sample representations, 
* and the usual written form may have may have statements "is pronounced". 

Rhymes may be infered from prononciation forms, and you could adapt scope with more criteria, for example "what rhymes with the english written word representation "apple" in Quenya? --Psychoslave (talk) 15:18, 12 August 2013 (UTC)

: That is an example of a possible property to use. It's up to the community to actually decide which properties to use. --Denny (talk) 15:45, 12 August 2013 (UTC)