Wikidata talk:Lexicographical data/Archive/2014/03

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Good proposal, some suggestions

After the conversation a few days ago I would like to endorse this proposal as it seems a good starting point for Wiktionarians to share their work with other languages effectively when needed (and wanted) without forcing any rigid structure upon them. From the usability point of view I still have the feeling that there are some unaddressed topics (like how to navigate from one lexeme to the next that has similar forms, or how to link from Wikidata to Wiktionary and back), but that is definitely solvable. On the other hand, I think it is a good idea to leave ethymology and pronunciation out of the initial stages and see how it goes. If the community wants it, it should be possible to add it later on.--Micru (talk) 21:30, 12 August 2013 (UTC)

Those are good questions. Especially since, unlike for Wikidata items and Wikipedia, we won't have sitelinks. This will also be interesting for Wikipedia, as soon as arbitrary item access is enabled. Thanks, we will definitively need to think about that! --Denny (talk) 18:51, 13 August 2013 (UTC)
I have started a new conversation about how to display links to/from sources in Wikipedia. It might be useful to collect some ideas on the topic to be ready when that might happen.--Micru (talk) 13:32, 14 August 2013 (UTC)
Basic ethymology can easily be expressed by statements connecting two lexemes, or sometimes a lexemes and a form. E.g. Dutch: paard developed from Middle Dutch: pard. --46.115.115.74 21:32, 25 March 2014 (UTC)

From time to time, here and elsewhere, people (who are often not Wiktionarians) make the naive, linguistically unsound proposal that Wikidata link senses in one language and edition of Wiktionary (e.g. wikt:en:cat) to senses in another language and edition (e.g. wikt:de:Katze). The flawed nature of these proposals and the opposition they generate means they don't go anywhere ... but neither does anything else. Wiktionarians have pointed out repeatedly that Wikidata could handle Wiktionary's main-namespace interwiki links using the same simple logic that powers the bots which currently maintain the links on each individual Wiktionary: "check which Wiktionaries have pages titled X, and link each one of them to all of the others". Interwiki links between content-having pages and redirects would continue to function, as they currently do, to connect en.Wikt pages which use ' as the basic apostrophe to fr.Wikt pages which use , in the way this table by wikt:User:TAKASUGI Shinji outlines:

fr.Wikt interwiki en.Wikt
Unicode
apostrophe
(U+2019)
aujourd’hui
(page)
aujourd’hui
(redirect)
redirect
ASCII
apostrophe
(U+0027)
aujourd'hui
(redirect)
aujourd'hui
(page)

And interwiki links between non-main-namespace pages could continue to be spelled out the way links between articles in different editions of Wikipedia are spelled out. -sche (talk) 06:56, 7 March 2014 (UTC)

I definitely support this, and I don't understand why it hasn't been done by now. CodeCat (talk) 19:14, 23 March 2014 (UTC)

At least

There is no need to have any magic word as ID in wiktionaries since in ns0 the title is unique and needed. And even if it is a redirect to another article (such as some articles written in x-systemo in eo or the above mentioned aujourd'hui) there is a need to exist since other wiktionaries may have that title as their major and others will redirect there.

(unless we are planning dictatorial to force every wiktionary to have a base opinion about which article names are relevant and which are not)

So the clue here is that wikidata should include all lemmas in ns0 with an ID same as the article's name. No magic "(lexeme) W12389" is needed. That database should at least provide the interwiki id for all wiktionaries that have such an article at every moment.

If any of these articles will become a base morphem (you can name it "lexeme", "wikidata ID" or whatever you like) and will "catch" the above mentioned characteristics is indeed irrellevant at first phase.

And if, for example, polish wiktionary do not want to use that database because of their opinion about russian empty lemmas then let them not use it.

The inclusion of an indication for every lemma (article, database ID all will be the same) if for a specific wiktionary is a redirection, will be the second step.

After all that done we can continue debating which characteristics should be included, how they will be included or named etc.

In a very special case, all wiktionaries, at the end, will have all lemmas, independently of what is included in their article (is a redirection, is a base article etc.)

if et:grain of salt exists but the en:grain of salt redirects to something else is irrelevant.

After all, these "article's names", used as IDs, can easily be ralated to any sophisticated magic "lexeme", "wikidata ID" (or whatever you like) with a simple field addition after. --Xoristzatziki (talk) 06:08, 13 March 2014 (UTC)

Thoughts on Edge Cases

Yesterday I spent some time talking to Purodha about this proposal. Purodha is a matematician, programmer, Linguist and long time Wikimedian, and he had quite a few question and ideas especially about nasty edge cases. I'm happy that he seemed to like the proposal in general. I'll try to give a brief account of what we discussed:

  • It should be possible for a Lexeme to have multiple Lemmas, to cover different spellings and language variants.
    • A Lemma may correspond to a Form - for German verbs, the infinitive form would be used; English verbs may include the "to" prefix in the lemma (or not). Some languages, like ancient greek, or vedic, may use placeholders in the dictionary form, which then does not correspond to any concrete form of the word.
  • For some words, the "senses" may not have a definition, but rather a "functional description", e.g. "this is used by older brothers to refer to themselves in conversation with their younger sister".
  • Such functional descriptions should ideally be machine readable, that is, refer to Wikidata Items.
  • A Form's "lexical property" is really also a functional description (e.g. "second person plural")
  • Bots may generate Forms based on the lemma(s).
  • Forms as well as Senses may be qualified to mark them as restricted to a specific region, time, social group, etc.
  • It's very important to have lexeme-specific senses, and not try to make lexical senses match across languages; Instead, we should line lexical senses between each other (as synonyme/translations) and/or with Q-Items (as "referring to").

What should be treated as "one lexeme" with different forms and sense, and what should be multiple lexemes, is no always clear.

  • As a rule of thumb, words that not only share a spelling but also share morphology should be described by the same lexeme entry - e.g. "sleeper" the spy and "sleeper" the part of a rail way.
  • E.g. "well" the adjective and "well" the noun would not be the same lexeme, because they have different morphology.
  • However, whether "hair" meaning a hair style (as in "I had my hair done") should be covered the same lexeme that covers the individual hair (with "hairs" as the plural), or whether there should be a a separate lexeme for the singulare tantum case, is not entirely clear.
  • Similarly, "die See" (the sea) and "der See" (the lake) in German are both nouns, but differ in grammatical genus (and also, "die See" is singulare tantum, there is no plural).

As a side note, we should try to be compatible with the LEMON model; it seems very close to what we want.

-- Duesentrieb (talk) 13:51, 14 March 2014 (UTC)

Duesentrieb: Bot-generated Forms based on the lemma(s) is exactly what we called "paradigms" back in Jul-13 proposal, however those automatically generated forms could be hard to manage/visualize/access. A new datatype for datasets might be of help here (Bugzilla62555) or at least the big number of tables on wiktionaries suggest that. Overall how the information is stored/distributed in WD it is going to matter much less than simplifying the data presentation/interaction to users on Wiktionary. Has there been any progress on the VisualEditor-Wikidata-Infoboxes/templates front?--Micru (talk) 15:06, 14 March 2014 (UTC)
I agree that a tabular presentation would be nice, but I'm not sure how to best do that without hardcoding assumptions about what forms exist and how they relate to each other into the software. It would somehow be managed on-wiki by the community. Perhaps we can use Lua based templates...
Integration with the VisualEditor is not around the corner, but we are working towards this goal by making it easier to generate infoboxes based on Wikidata. User:Hooman's Capiunto extension should help a great deal with that. I hope we can deploy it soon, and demo and discuss it at the Hackathon in May. -- Duesentrieb (talk) 08:49, 17 March 2014 (UTC)
I believe, Extension:TemplateData could help here. I do not believe that we can or should do without a commn but language specific tabular representation of (some) grammatical forms. Users must be able to override all automatically generated forms, and believe it or not, over the years we will have to accomodate language change without loosing previous informnaton. -- Purodha Blissenbach Discussion  21:52, 25 March 2014 (UTC)
Return to the project page "Lexicographical data/Archive/2014/03".