Wikidata talk:Wiktionary/Development/Proposals/2013-02

Language of item

edit

As far as I understand is each item an item in one specific language. If you want to translate a term (or an item) you need to create a new item. If this is true, I think a required property should be the language itself and there should be a possibility to link items in different languages, which describe the same term. --Faux (talk) 09:30, 18 February 2013 (UTC)Reply

IPA

edit

I don't think including IPA will be very useful. Different Wiktionaries have different needs concerning IPA. For example, on English Wiktionary, English words have broad phonemic transcriptions that are targeted to native speakers, and only rarely include narrow phonetic information. But other languages often have narrower transcriptions that are more suited to non-native speakers. Not everyone might agree on a transcription either. Dutch words on the Dutch Wiktionary usually do not have length marks in the IPA, but on the English Wiktionary they do. CodeCat (talk) 19:48, 18 February 2013 (UTC)Reply

As I understand it, IPA provide two notations, one for accurate between squares brackets [], and one less accurate between slashes //. If one word in one language have serveral used prononciations, we should include them all. Including IPA is not only a extremely useful for wiktionaries, but think about all the applications it will enable : a wikiprononciation focusing on prononciations usage, and of course it would be very useful for developing language recognition/synthese. --Psychoslave (talk) 22:00, 9 March 2013 (UTC)Reply
Phonetic transcriptions - the ones in square brackets - need to be labelled with the IPA version. since IPA has been altered considerably over time, they're hardly usable otherwise.
Phonemic notations - the ones between slashes - are not IPA although they may make use of some IPA symbols. They are in principle completely useless without the accompanying description of the phoneme system plus notational conventions being used, which of course are both highly language specific and quite often vary from one author to another. Even if they are printed in square brackets, dictionaries usually list phonemic notations that should be between slashes.
If we want to include IPA versions or phonemic notations, we need to make sure to also include the needed references. That is the IPA Version (a year afaik) for IPA, or a link to a web page having some prose on the phonemic notation system or conventions used in this transcription.
--Purodha Blissenbach Discussion  00:45, 20 June 2013 (UTC)Reply
Those would be pages like wikt:fr:Annexe:Prononciation/français. Darkdadaah (talk) 14:21, 21 June 2013 (UTC)Reply
In slashes: phonemes, in brackets: phones. The pronunciation in brackets only applies to a specific dialect whereas the pronunciation in each dialect can be derived from the phoneme notation. —PοωερZtalk 02:23, 20 June 2013 (UTC)Reply
Please provide references, the previous posts disagree on the actual use of IPA. ;) Jeblad (talk) 13:46, 21 June 2013 (UTC)Reply
Could you be more specific? IPA is a phonetic alphabet, not a phonemic. —PοωερZtalk 16:37, 22 June 2013 (UTC)Reply
IPA is designed for both uses. As a phonetic alphabet, it is used to precisely describe spoken language in a particular situation by a particular speaker. A different native speaker may say the same thing, but produce a different phonetic sequence in speech. This is easiest to understand in terms of dialects, so that an American and British speaker will pronounce the word water very differently, but it goes beyond that. Even people who speak the same dialect may produce different phonetic forms of a word, and a single indivicual might use different phonetic forms in a different context. For example, see the Egnlish Wiktionary entry on hello, where there are multiple pronunciation files because phonetic pronunciation is context dependent. Now, at the same time, all of these phonetic forms of hello can be represented phonemically be a single common sequence. That is, IPA can (and is) used with the symbols standing in for particular common phonemes (sounds) in the language. How these phonemes sound will depend unpon the particular dialect, speaker, etc., but they're nearly always the same sequence of phonemes regardless of the speaker. This is what the Oxford Pronouncing Dictionary does, and what the English Wiktionary does with IPA, but it is not what all the Wiktionaries are doing. --EncycloPetey (talk) 00:45, 24 June 2013 (UTC)Reply

The more widely we share data, the more important specific standards are. I second Purodha’s suggestions about referring to IPA versions. Also romanization and other transliterations should refer to authorities like ISO, BGN, ALA–LC, UNGEGN, etc., and their specific versions.

Start standardizing, Wiktionaries! Michael Z. 2013-06-22 17:53 z

That's great to propose, but seriously difficult to implement. Romanization and transliteration are dependent upon the language into which they are transcribed. Russian words transcribed for an English reader will not be the same as for a German reader. There are different systems for different languages. --EncycloPetey (talk) 00:45, 24 June 2013 (UTC)Reply

We also need to keep in mind, should this go forawrd, that each pronunciation will be tied to (1) a particular language, and possibly (2) a particular region and/or dialect. Castillian Spanish and Mexican Spanish do not sound the same. Neither do Australian English and Indian English, or Portuguese in Portugal and Brazil, or even the two primary dialects of Croatian, or the various forms of Sicilian or Albanian. There would need to be a clear and consistent method of marking this information. The minimum we've been doing on the English Wiktionary is (a) to mark the accent of the pronunciation (dialect is too narrow a word for this) explicitly in the Pronunciation section, and also (b) to label all ogg audio files with a standard set of prefixes, where the first prefix is the ISO language code, followed (when needed) by the country code of the speaker recording the file. So we have en-us- for English pronunciation files recorded by Americans, pt-br- for Brazilian Portuguese, and la-cls- for Latin files recorded using classical pronunciation. Yes, there are also temporal aspects to consider, and even this set of prefixes can't handle variations like Geordie English. Suffice it to say, a single pronunciation, and single audio file entails a lot of data besides simply the sound and any IPA transcription associated with the recording. --EncycloPetey (talk) 00:45, 24 June 2013 (UTC)Reply

edit

Clearly, the spelling-interwiki-links in Wiktionary are different from Wikipedia links. Essentially, all they require is a lookup in the merged, case-sensitive title list of the union of all Wikitionaries. However, and I think this is presently underestimated, many semantic Wikidata items find their only definition on a Wikitionary page. Wikipedia's are averse to defining "Words" having a defined meaning, but when expressing Wikidata properties, we need the corresponding semantic items. Saying about an object it is "ovate" (2-dim. shape) or "ovoid" are different semantic concepts (= wikidata items), but en.wikipedia redirects both to "Oval". The Wiktionaries define them properly. I think it is desirable to be able to make links from "empty" (no Wikipedia-article linked) Wikidata items to the corresponding Wiktionary entries. Thus the Wikidataitem-to-Wiktionary link is a true link, the Wikipedia-to-Wiktionary may be (but will often be many-to-many) and the Wiktionary-to-Wiktionary interwiki links need not be a concern of Wikidata (unless using Wikidata as a database for a central registry of the above mentioned union of wikitionary titles). --G.Hagedorn (talk) 17:36, 9 March 2013 (UTC)Reply

Suggestions

edit

I think wikidata could help to improve wiktionaries drastically, by unifying not only interlangs links, but also definitions and translations.

More accurately what I mean is that currently you often have, attached to one wiki article you have usually several definitions for each language where the word is used. But often when I seek a non-french word in the french wiktionary, looking at the native wiktionary will bring more definition than what you can find on the french article.

I saw that on the english wiktionary, the interface added a "quick add" feature, which ask user to fill translation for each meaning. That's great and I wish it would be added in all chapters. And I think that we could add even more "hey, what about translating just this little thing" feature across all dictionary by centralizing entries, so that each "word" is associated with one or several meaning by language. Then all meanings could be redistributed to all wiktionnaries, even when no translation is available for a given meaning in the local chapter. In this cas we could have an information box that would say "this word have an other meaning which wasn't yet translated in ${local_language}, if you one of the language in which a translation is available, please help us to improve the wiktionary".

What do think about such a project, could it work with wikidata? --Psychoslave (talk) 22:06, 9 March 2013 (UTC)Reply

That sounds like [www.omegawiki.org]. I believe they are planning to add a link on their "defined meaning" pages to wikidata pages. Filceolaire (talk) 15:29, 10 March 2013 (UTC)Reply
Yes indeed, and there's proposal to adopt it within wikimedia. What about gathering ressources of what seems, to my mind, overlapping subjects ? --Psychoslave (talk) 10:03, 12 March 2013 (UTC)Reply
This is probably way too advanced for now. The first step would be to have a list of entries, not a list of definitions, e.g. for the article in the current fr.wikt article fr:trouble :
  • French, noun 1
  • French, noun 2
  • French, verb flexion 1
  • English, noun 1
  • English, verb 1
But on en.wikt :
  • French, noun 1
  • French, verb flexion 1
  • English, noun 1
  • English, verb 1
In the French Wiktionary those sections are created with templates, which could be useful (mandatory ?) if we want to use Wikidata. En.wikt don't use such Templates, but numerous other wikis use similar templates as fr.wikt. Darkdadaah (talk) 13:10, 11 March 2013 (UTC)Reply
I don't understand the problem, what would be the difficulty to have one entry for each meaning ? And once you have one entry, can't you add as much attribute as you want ? It seems to me that there are two kind of entries we want to treat here :
  1. used orthographies, where one entry should point to one or several meaning entries for each language where it's used
  2. meaning entries, where you should find attributes like etymology, prononciation, definition, orthographies*, synonyms, etc.
Prononciation should be in the meaning entry, because even in one given language one orthography may be pronounced differently for different meaning, see Catégorie:Homographes non homophones en français for examples.
*Guideline should be decided on what kind of alternatives othrographies we would like to include : extrems being "any spelling you may find, even l33t for leet, etc" and "only word you can find in serious printed books and not recognized as an error". A middle way could be to accept any orthography which is broadly used (according to number of result in search engine for example), and having a field enabling to categorize if the word is generaly accepted as a canonic form or a mispelled word, etc. --Psychoslave (talk) 10:03, 12 March 2013 (UTC)Reply
Let's be pragmatic: no Wiktionary project will be using Wikidata as long a they can only rely on templates (at best) for the structure of the pages. It is useless to talk about what Wikidata can do if it can't be used in reality. Interlanguage links are easy. Word entries as I described may be possible, if every project use templates for them. But meanings, etymologies, and everything else would require a complete overhaul of the structure in every Wiktionary, which is completely different.
Ultimately Wiktionaries should use structured data, but simply saying "let's use Wikidata" is not enough. Darkdadaah (talk) 19:01, 12 March 2013 (UTC)Reply
In my opinion (only a layman’s view of the things) Wikidata has a communication problem. Even the WT castle owners seem to me not know what it is really good for. For an old Wikipedean sitting in its comfortable monolingual WT castle (at least in mind without confessing it!), structured data are something incomprehensible. Not only in my opinion they are the only reasonable way out of the obvious, frequently cited WT cross language problems. But in spite of this, we need a compromise. Single ore compound data fields (single fact fields(!)) are absolutely necessary for identifier fields (beyond the currently used lemmas). But it seems to me to be necessary that you have structured data fields containing unstructured data like the current mark-up (of course a worldwide standardised single mark-up). In these boxes you can put all the cultural and linguistic exceptions cited against structured data. The new data structures must be apt to put all current WT data into a single WP Project data pot after restructuring them.
The other thing old Wikipedeans seem not to understand is the difference between data storage and data presentation (I didn’t find something like WT- or Wikidata-data-presentation-style-guide). A good example of this difference between storage and presentation is they pop-up way, translations currently can be added to English WT (even if this needs some improvement).
So, having all this as an assumption in mind, my questions are:
  • Is Wikidata intended to be a kind of second tire (first tire all the current WTs): If you have cross language interests, e.g. translations, go to Wikidata. If not, use the current WTs. In other words: is it a real competitor to the current WTs?
  • Is it imaginable to add to it a single WT Project VIEW, able to extract cross language data from the one pot and the rest from the other(s), so that the user has the feeling to work with one and only one user surface (after all the preliminary data restructuring things are done)? NoX (talk) 17:58, 7 April 2013 (UTC)Reply

As someone who has (1) been around on Wikipedia for a long time, (2) been on Wiktionary for a long time, (3) has worked in several museums and libraries where data retrieval and coordination are essential, I can appreciate what Wikidat has the potential to do. Centralizing the interwiki links on the Wikipedias is a huge step forward, and I no longer have to fight the bots that propogate link errors, nor repeatedly visit all the various language wikipedias any time a link changes.

That said, the potential for assisting Wiktionaries is not as simple, nor has the strong potential one might hope. There are very few Wiktionaries of any size; the English and French are vastly larger than all the others, and some of the "large" Wiktionaries are mere superstructure with entries lacking most (or any) content. The Russian and Vietnamese Wiktionaries in particular have many entries created mindlessly by a bot based on the existence of an entry elsewhere, and lack information because of the way in which they are created. It's also true that interwiki links are extremely straightforward, and we already have bots that create and maintain them. There's no human need to check interwikis because the links connect pages with the exact title match only.

Now, the suggestion to unify definitions and translations presents a huge problem. Wikipedias typically will have a page on a particular topic, and the topic will not change. The page about "geometry" will always be about that topic, and will be about that topic only. At worst, it might be converted into a disambuation page, and the original topic shunted to a new name, but that's easily corrected with the kind of structure Wikidata maintains. Wiktionary does not work like that. Take a look at en:wikt:white whale, for starters. There are three distinct definitions on the same page. Each is tied to particular quotations, some small number of which may appear below the associated definition, but most of which will appear in a separate Citations: namespace where all supporting quotations are housed. Each definition also has associates synonyms, may have antonyms, coordinate terms, as well as their own Translations. Each of these items appears in a separate section on the same page. Wiktionary manages this with rather strict formatting that is alien to Wikipedia editors.

But it gets worse. Now look at en:wikt:palma#Latin, which is the entry on the Egnlish Wiktionary for the Lation "word" palma. There are two separate etymologies, four pronunciations, and eleven different definitions, with some of the definitions being forms of the same sense. There are associated illustrations, inflections, derived terms in Latin, and descendants of the word in other languages. Now the translations here are what are even more radically different. The translations of Latin palma appear in the place that definitions would appear in an English entry. Further, they are not of the same form that would appear on the Latin Wiktionary to translate the word into English. The translations here are more explanatory, and are geared specifically for an English reader.

You run into other sorts of difficulties when you try to translate more distant languages. Did you know that the Navajo word for "year" is a verb? Even though it's a noun in most European languages, that isn't always the case. There is no noun in Navajo for "year" because there is a fundamental difference in the concept. With colors, it becomes even more subjective because most world cultures (modern and ancient) do not divide the color spectrum the same way as Western cultures do. As a result, the definitions and translations are not one-to-one and worse are not symmetric. That is, if you translate "blue" into certain languages, then translate it back to English, you do not necessarily get "blue" back in translation because some langages do not have a word that exactly matches the concept of "blue". And I have yet to see anyone on Wikidata handle this problem of situations where items are either not symmetric or are not one-to-one. --EncycloPetey (talk) 01:10, 24 June 2013 (UTC)Reply

Then we should focus on what is actually doable, we can't expect to move every information from Wiktionaries to Wikidata, that would be impossible (and we don't need to). Moving etymologies, definitions, examples or citations is out of question. The most promising use right now are interwikis. Next would be words ids to match them easily within and between chapters, just like interwikis but more precise. Darkdadaah (talk) 08:33, 24 June 2013 (UTC)Reply

EncyloPetey, the goal is to have entries for every expression and for every sense. So things like the Navajo word for year being a verb is not a problem at all. Also the different separation of concepts into words. Unlike OmegaWiki, we do not expect that there are concepts, and words are merely attached to describe them in every language, and everything important to say is about the concepts. The proposal is completely word based: every word in every language is an entry of its own, and for each sense it has further data where you can attach something. All the examples you have mentioned should fit into this model. --Denny (talk) 15:30, 11 July 2013 (UTC)Reply

Project page on meta

edit

A page was created on meta to coordinate propositions concerning wiktionary future. --145.226.30.43 14:46, 13 March 2013 (UTC)Reply

Wikidata testing

edit

Is there a way we could test those entity types? The best way to find out how well it would work may be to have a sandbox to play with. For example, is it possible to have a local Wikidata, or a Wikidata sandbox? Darkdadaah (talk) 14:43, 19 June 2013 (UTC)Reply

No, they are not implemented yet. We are gathering input *before* we implement it, in order to minimize spending effort on development which was wrong-headed anyway. There will be a demo running once it is implemented, but that is a few months down the line. --Denny (talk) 10:03, 20 June 2013 (UTC)Reply
Well of course, that makes sense. I wonder if it is possible to (easily) play with Wikidata as it is now, though. Darkdadaah (talk) 15:44, 21 June 2013 (UTC)Reply

Meaning descriptions

edit

The proposal suggests that each "meaning" have the definition be added as the meaning's description. I don't think this makes sense. Definitions really need the full wikitext capabilities that are available in the Wiktionaries, including access to the full array of templates and such that are used locally, and the ability to link to other pages, language sections, and meanings. I would suggest using a simple gloss identifier as a description, but that hasn't been widely adopted by any Wiktionaries yet as far as I know. --Yair rand (talk) 19:01, 19 June 2013 (UTC)Reply

I would also suggest using the gloss as the description of a meaning. The gloss is already designed to be a plain text identifier without any links, templates, etc. Glosses are already how we tie definitions to translations, synonyms, antonyms, and derived terms. WT can link directly from one sense to another using glosses (see wikt:フィールド for an example) --Haplology (talk) 02:37, 20 June 2013 (UTC)Reply
So basically you suggest to rename description for meaning to gloss? That sounds good to me. If this understanding is correct, I can change it accordingly. --Denny (talk) 10:27, 20 June 2013 (UTC)Reply
I would say stick with as few conceptual models as possible. Jeblad (talk) 14:40, 21 June 2013 (UTC)Reply
Be warned. There are words that do not convey meaning. They are grammatical, and nothing else. Sometimes, you can translate them to languages having corresponding grammatical words. Grammatical articles, for instance, exist in several European languages such as Dutch, English, French, German, and more. You may thus "translate" French "le, la" to English "the", and vice versa. Yet, this kind of "translation" is highly questionable, since it is not only ambiguous but also does not convey any sense or meaning. It is purely structural. "Translating" grammatical articles to Russian or Chinese yields void. That likely means that a reverse translation needs to do something quite different to "add" articles in the target languages.
When you want to attach definitions to words of this class, you cannot use the same descriptive level as otherwise. If fact, you cannot define meaningless words at all. What you can do, is describe them, describe their function. That is why a subclass of them is often called "function words". A big difference between definition and description is language use. A definition is always object-orientated, even for abstract meanings. A description is necessarily using metalanguage, it talks about the language, the word of which it is describing. As we know from logic and mathematics, mixing metalanguage and object language can lead to severe antinomies. Which we should avoid.
--09:17, 20 June 2013 (UTC)
Words without meaning is perfectly covered by the proposal. A word can have any number of meanings, i.e. it can explicitly also have none. Thanks for your warning! --Denny (talk) 10:27, 20 June 2013 (UTC)Reply

"word" interwikis

edit

A problem with the proposal's suggestion to have one "word" item per language, and that being the basic unit in Wikidata, is that it doesn't have any room to allow interwikis. Interwikis are, in my opinion, the only thing we can be reasonably sure Wikidata will be at all useful for on Wiktionaries, so this is a pretty major gap. --Yair rand (talk) 19:04, 19 June 2013 (UTC)Reply

If I read it the label is not per language. The interwikis are not mentioned but word "a" should (and will) link to en:wikt:a, de:wikt:a etc. No gap there. As all "a"'s should be linked to word "a" it is not really necessary to have interwikis. From en:wikt:a LUA should be able to pick up the properties from the word "a". For automating the interwikis itself I would suggest to have interwikis at the words. HenkvD (talk) 21:31, 19 June 2013 (UTC)Reply
That is wrong, words are -- in general -- per language. So "arm" would exist as an English word and as a German word, and they would be on different pages. --Denny (talk) 11:45, 20 June 2013 (UTC)Reply
Interwikis should be something the software does on its own. All it has to do is to just check which language versions have an entry with the exact same spelling and link it. This can't be too hard to implement, and I see no reason why this isn't done already without Wikidata. —PοωερZtalk 23:10, 19 June 2013 (UTC)Reply
There are some complications with different communities policies:
  • using ' or ’
  • capital letter and period or no punctuation in idioms
  • very long words : truncated title or special page
There may be other cases like that. So it is not as straightforward as it seems.
That said, I would also be really interested in a word interlanguage link, instead of interwikis simply based on orthography. Darkdadaah (talk) 08:30, 20 June 2013 (UTC)Reply
Yair is right with the original problem: due to the different breakdown into pages in Wiktionary and in Wikidata, we could not serve well the use case of interwiki links. I wonder in how many cases the suggestion made by 23PowerZ would not work. If it is infrequent enough, we could go with it, and let the rest be handled by classical interwiki links. Anyone up to do some analysis? --Denny (talk) 11:45, 20 June 2013 (UTC)Reply
Yes, first we should focus on interwikis for pages, not words. The best way to do that would be to analyze the dumps (I tried the replicated database in the Toolserver/Tools Labs but it doesn't seem like interwiki tables are available). Darkdadaah (talk) 14:04, 20 June 2013 (UTC)Reply
Here, some data for 3 wiktionaries (recent dumps):
Type fr de en
Articles 2,337,895 266,965 3,422,938
Redirects 15,119 694 17,218
No interwiki 1,438,154 99,429 1,631,903
Direct interwikis 899,177 167,485 1,790,878
Apostrophe interwikis 784 63 0
Capital interwikis 130 149 54
Other interwikis 938 217 542
Note that en.wikt doesn't care about apostrophes in other Wiktionaries (it just links to redirections). The French version is probably more informative for those, as it uses a different apostrophe. Interwikis different from the title spelling account for roughly 0,2% of all interwikis. If we assume this proportion would not change much later on, then we can consider that those "indirect" interwikis are quite rare. Darkdadaah (talk) 16:04, 20 June 2013 (UTC)Reply
And can entirely be solved by redirects. —PοωερZtalk 16:11, 20 June 2013 (UTC)Reply
This means that 1) we would have two (or more) entities for as much spellings (I don't see a problem with that, although it would be good to link those to know they are all typographical variations of the same spelling), and 2) replace the interwiki bots with redirecting bots on all wikis (albeit only for a fraction of all the articles, of course, but still).
Using classical interwiki links would have the advantage of simplicity for the contributors in their local projects, but I don't know how easy it would be to mix "spelling interwiki links" and "classical interwiki links". Darkdadaah (talk) 22:57, 20 June 2013 (UTC)Reply

Relation between word and meaning

edit

The proposal states "A meaning belongs to one word, and one word only. But each word can have several meanings". I don't think that will work. As mentioned the word arm has at least two meanings 1) part of body and 2) poor (in german). The word poor has also a meaning of poor. To comply with the proposal a 3rd meaning should be added 3) poor. Both 2) poor and 3) poor would ideally have same statements (like Q10294 poverty), duplication of information, resulting in lots of errors. Could this be like "A meaning belongs to one or more words, and each word can have several meanings"? Or is it meant: "A meaning belongs to one word per language, and one word only. But each word can have several meanings"? HenkvD (talk) 21:58, 19 June 2013 (UTC)Reply

Meaning isn't something you can store at Wikidata. —PοωερZtalk 23:07, 19 June 2013 (UTC)Reply
Unless we are talking about some strictly defined terminology, sharing meanings would only recreate the same mistakes as omegawiki. Simply put, synonyms should have independent meaning descriptions, linked with a "synonym_of" relation. This way all we may be able to describe all the subtle nuances that shared meanings can't take into account.
The same goes for translations: it is rare to have a direct one to one translation between two meanings in two different languages (again, unless we're talking about a strict terminology). Different meanings linked with a "translation" relation would work better. Darkdadaah (talk) 08:20, 20 June 2013 (UTC)Reply
Maybe meaning has been a bad term. There is the suggestion to change it to sense. But the proposal follows what Darkdadaah says, i.e. two words never share the same meaning (or sense). This might lead to some duplication, indeed, or maybe to some smarter structures. But the proposal follows Darkdadaah's assumptions that sharing meaning does not work in many cases. One obvious example are usage examples: The sentence "Wilhelm Tell traf den Apfel." would not be an example use for the sense "a fruit" of "Apple", even though it is of the sense "a fruit" of "Apfel", and you could claim they are the same meanings / senses. --Denny (talk) 11:49, 20 June 2013 (UTC)Reply
Sorry, this was a bit unclear - was it still understandable or shall I explain more? --Denny (talk) 11:50, 20 June 2013 (UTC)Reply
It's true that two words should never be considered as having the same meaning (most often, they have almost the same meaning, or there are some subtle differences in their use, etc.) Lmaltier (talk) 06:06, 21 June 2013 (UTC)Reply
I strongly disagree. Synonyms do share a common meaning, that is the sheer definition of the term "synonym". True synonyms are rare in natural languages, though. Omegawiki editors may have made mistakes being not clear enough about near-synonyms. Maybe, we can avoid these by having a (symmetric) relation "near synonym" and encouraging its use. --Purodha Blissenbach Discussion  12:56, 21 June 2013 (UTC)Reply
Yes, the main point is to have different relations for exact synonymy and near-synonymy, whatever name we choose for the relation. Translations would be the similar. Darkdadaah (talk) 14:18, 21 June 2013 (UTC)Reply
From the viewpoint of meaning, there is no difference between languages, and thus no difference between (meaning-related) relations of words of one language and (meaning-related) words of different languages. We have a wide field between "can alway be substituted for/translated as" and some sort of "vaguely being somewhat related" predominantly of interest to poetry makers. So as to map reality appropriately - if we wish so - we could use a similarity measure: likely 100% e.g. for geographical or other proper names, astronomical objects and the like, but decreasing as things become more foggy. I would suggest to use a second measures as well, a ratio to be derived from existing translations. So, if nl:"hoofd" was translated to en:"head" 80% of the known instances, 15% to en:"headmaster", 4% to en:"leader", and some fractional percentages to some 135 other words, you have a pretty clear picture. It may be necessary to define a lower clipping edge for percentages, so as to exclude simple errors and possible but very special cases having no practical potential of reuse. Also, many of those relations heavily depend on domains of use, or context. A teenager calling a friend "cool" is possibly to be understood differently from a meteorologist using "cool" in a weather forecast :-) That holds for word-to-word relations as well. Translations are a special subgroup of those. --Purodha Blissenbach Discussion  15:43, 21 June 2013 (UTC)Reply
Except that 'head (body part)', head (school headmaster), and head (leader) are clearly different meanings/senses and will link to different meanings/senses under the current proposal. (One expression can have multiple meanings/senses). The fact that 'hoofd' has the same collection of meanings/senses is a coincidence.
For me the problem is that, under this proposal, we will have to have to write out a separate meaning/sense for 'head' (school headmaster) and for 'headmaster' rather than referring both of them to the same meaning/sense. --Filceolaire (talk) 23:48, 21 June 2013 (UTC)Reply
What's the problem with writing separate senses for different words? That's what has been done until now. What we need to improve is the relations between those senses. Do not make the same mistake as Omegawiki. Darkdadaah (talk) 13:01, 22 June 2013 (UTC)Reply
I regard that as a feature, not as a problem. An example sentence using 'head' would be attached to the sense 'head' (school headmaster) only if it indeeds use the form 'head', whereas an example sentence for 'headmaster' (school headmaster) would do the same. The only thing repeated would be the gloss itself, no? --Denny (talk) 15:49, 27 June 2013 (UTC)Reply

Interwikis for Wiktionary project space

edit

What about interwiki link handling for Wiktionary project space, e.g. wikt:WT:CP? Is that planned? This, that and the other (talk) 09:26, 20 June 2013 (UTC)Reply

That seems like something the current system could handle perfectly well. We will deploy it to other sister projects first, though. --Denny (talk) 11:53, 20 June 2013 (UTC)Reply
The sooner, the better. Cheers! BD2412 (talk) 21:46, 20 June 2013 (UTC)Reply

Impact on contributors?

edit

The proposal proposes technical developments, but without a word on the possible impact on how contributors work. If there is no impact, why making this proposal? But I feel that the impact would be huge, and that the idea is very similar to OmegaWiki. And OmegaWiki has proved a catastrophic experiment: very few contributors, and probably no user.

The idea is to group some data (most data as I understand it) at a central point. But how will decisions be made? By discussions in English. This is a major point. This would exclude most contributors from most wiktionaries.

Another major point is that it would make contribution less simple, as contributors would have to understand and to accept the general schema. This was the other major reason for Omegawiki failure: even with much good will, most candidate contributors were unable to understand principles. Currently, there is already a real issue as wiktionaries become much too technical, and this would be much worse.

In other words, I think that implementing this proposal would kill wiktionaries, as these same principles were fatal to OmegaWiki.

Other solutions for sharing data are possible and already used, e.g. bots for importing data (when possible), manual imports, etc. These solutions don't change anything to the contributors work, this is why they have no negative impact (when used carefully). The objective should be to improve things, not to indulge yourself with new technical developments, and only wiktionary contributors can help understanding needs (e.g. an actual need is categories sorted according to their language, which is different according to categories). However, Wikidata can be used for interwiki links, as no discussions are required once the general principle (links between pages with the precisely same title) is defined. Lmaltier (talk) 17:58, 20 June 2013 (UTC)Reply

I must add that, if designed very carefully, this kind of proposal could be a good idea for a commercial dictionary, managed by a few people. But not for a wiki. Lmaltier (talk) 18:13, 20 June 2013 (UTC)Reply
I disagree with you on a lot of things and I think we will prove you wrong :) Darkdadaah (talk) 19:25, 20 June 2013 (UTC)Reply
You can read and write English easily. Not everybody. People who should participate to the discussion are those who cannot read English. But they won't. Lmaltier (talk) 20:09, 20 June 2013 (UTC)Reply
The language barrier might be overcome if we accept authority of a language wiki community on words of their own language. —PοωερZtalk 20:47, 20 June 2013 (UTC)Reply
I don't understand how this would be a solution. Wikis are built by contributors, not by developers, and most contributors becoming unable to discuss about decisions (where decisions are taken) would leave. And many decisions don't relate to words of one's language (e.g. which languages are accepted among constructed languages, how to deal with homophones, etc.) It's even normal is some cases that basic decisions about page structure depend on the language (because words such as homophone may have slightly different meanings for different languages). All Wiktionary contributors are concerned by languages (or they would not contribute to the project), therefore this is a major issue. Please, learn OmegaWiki lessons. Lmaltier (talk) 05:51, 21 June 2013 (UTC)Reply
Also don't forget that many contributors focus on a different language (e.g. on fr.wikt, Francophones may specialize on Indonesian, Dutch, etc.) Native languages are one thing, interests are another thing. Lmaltier (talk) 06:02, 21 June 2013 (UTC)Reply
Lmaltier, you are completely right that this is merely the technical proposal of what to implement. It does not explore the possible effects on the editing community of the Wiktionaries. But based on what our growing understanding of how Wikipedia is using Wikidata, I believe that in the long run this proposal would increase the quality of many Wiktionary projects and release editor energy for more high-level tasks. I want to make it absolutely explicit that, just as for the Wikipedias, Wikidata would be merely an offer: it is not meant to replace the current Wiktionaries, it is merely available for the Wiktionaries as a structured data store that might or might not be considered helpful by the projects. My opinion is that especially smaller projects might consider Wikidata useful. Bigger projects often have sufficient editor energy and attention to do many of these tasks manually, and thus also in a higher quality. At the same time they might find that some tasks could be delegated to Wikidata for some entries, and have less bot-activity on their projects. Less bot activity, again for the smaller Wiktionaries, might lead to a stronger sense of community, which is beneficial for the project.
In short, the editors get more tools. We would like to discuss the way we make this tool with the editors who will be using them most. And this is why we have written a proposal and would like to get input on it to see if it makes sense and can be helpful to improve the projects. --Denny Vrandečić (WMDE) (talk) 10:30, 21 June 2013 (UTC)Reply
Perhaps it will be increasingly important with tools on the client wiki itself for entering data the community thinks are within their domain. Something like that will give them "domain ownership" and that is known to be important. This was discussed for Wikipedia but never implemented. Perhaps this is equally important for Wiktionary? Jeblad (talk) 16:28, 21 June 2013 (UTC)Reply
Widata may be used by Wikipedia for a number of data, especially numbers such as country populations, etc., as I understand it, but, nonetheless, Wikipedia pages are written by contributors, outside Wikidata. Therefore, impact on contributors is limited. From your proposal, I understand that you want to implement the core of wiktionaries in Wikidata, and above discussions confirm my understanding. You cannot do the same as Wikipedia (words have no population, there are no numbers associated to words, there are no objective data except obvious ones, such as the part of speech or gender or plural, and even these are, sometimes, very disputed and lead to serious disagreement, which shows they are not as objective as you can think...). Pronunciation recordings are already shared though Commons. Senses, etymologies, pronunciations, synonyms, translations, are not objective data about words, they are more like texts in Wikipedia pages. If you want to take Wikipedia as an example, then wiktionaries have no need for Wikidata. And OmegaWiki is the example of what should not be done.
I'm afraid that many participants to this discussion (e.g. Darkdadaah) are developers and feel excited. Inputs for this discussion should come in priority from normal contributors (not developers), with a long experience of Wiktionary and understanding it well, and preferably not speaking English (or uneasy with English). Of course, this is impossible, and this is a good example of one of the issues I raise. Lmaltier (talk) 21:46, 21 June 2013 (UTC)Reply
Lmaltier, you are way too pessimistic about this. I may be a developper but I'm also (and primarily) a contributor to the project. There are data that can easily be shared, and the most important and interesting would be relations between words, not necessarily data about words themselves (that we can write easily in the individual articles) : relations such as such as synonym_of, exact-synonym_of, antonym_of, translation, hyperonym_of, hyponym_of, meronym_of and so on would be extremely useful to fill all the word-relation sections automatically without having to write and sync all of them by hand. If we have those relations implemented in Wikidata, the contributors should not have any problem to contribute if the interface is properly translated. Darkdadaah (talk) 12:55, 22 June 2013 (UTC)Reply
Lmaltier, I think there is quite a bit of opportunity to share some common structured data between the different language editions of Wiktionary. E.g. the plural of the expresion "goose" is "geese", and that is something that can be taken from Wikidata if the Wiktionaries so want. The actual entries would still be written on the respective Wiktionaries, and should be written autonomously by the existing communities. Wikidata is merely an offer to provide access to some data in a structured form, which can be shared. Also, Wikidata is equipped to deal with disagreements through sources and having several contradicting statements on the same expression.
It is different from OmegaWiki in the main sense that we do not intend to replace the existing Wiktionaries in any way, but to give them more features. It is up to them how and if they use these new features, and for which parts. This is mostly interesting for smaller Wiktionaries. I fully agree that the input to the discussion should come from Wiktionary contributors, and they are more than welcome to join. --Denny (talk) 15:54, 27 June 2013 (UTC)Reply
Saying "X is the plural of Y" is easy in regular English, but isn't always that easy. The plural of octopus is octopus, octopi, octopuses, or octopodes, depending on your source, and in fact all are valid in English. In Latin, there are two patterns for declining tigris ("tiger"), but you can't just mix-and-match the plural forms as you please. If you're using tigrides for the nominative plural, then you use tigridibus for the dative plural; but if you use tigres for the nominative plural, then the dative plural is tigribus. And whichever pattern of forming the plural you use, there are five plural forms, not just one as English has. Unless the data structure can handle full inflection tables, it won't be useful, and even then users in various languages have to be able to figure out what's present and how it's organized. Take a look at en:wikt:tigris (both Hungarian and Latin) for more. --EncycloPetey (talk) 17:04, 27 June 2013 (UTC)Reply
Or de:wikt:Chor for that matter. It has to be thought out very well how we are going to use properties and qualifiers, but that's something I'd postpone until it's clear what's technically possible. —PοωερZtalk 17:44, 27 June 2013 (UTC)Reply
Saying "X is the plural of Y" is not so easy, even in English. Let's take the above example the plural of the expression "goose" is "geese". This is false for at least one sense of goose. And it's not a data, it results from usage only, and usage is not always consistent. cf I hear that there are several gooses who late this afternoon got cooked by my lady judge. (mayorsam.blogspot.com). And many such examples could be given. You misunderstand what language is. You can find data in geography or physics, but linguistics is not geography or physics. See another section below (Wikidata = wiki + data). Lmaltier (talk) 19:48, 27 June 2013 (UTC)Reply
The focus should not be on the few words where this is hard or impossible. These can still be handled in the same way Wiktionary handles it. Maybe not every plural is defined trivially, but if a large set are, this kind of structured data might still lead to less maintenance overhead for the editors in the end, thus allowing them to do more if they want. And saying that there is no data in linguistics exposes a difference in our understanding of linguistics. There is plenty of data in linguistics, and I have published papers on these topics. So I fail to comprehend what you mean with that. --Denny (talk) 15:36, 11 July 2013 (UTC)Reply

I don't see your initial concerns as valid. It is up to each wiktionary to decide whether or not they want to implement Wikidata. The source code of the average wiktionary page already appears pretty mysterious to outsiders; implementing Wikidata in whatever way shouldn't change much in this regard. --Njardarlogar (talk) 09:49, 30 June 2013 (UTC)Reply

Wikidata as a local extension of all wikis

edit

With different wiktionnaries, we have different needs. Many of these needs are purely local to each of them and work within the language for which each wiki is designed. Most of the datas presented will in fact NOT be across languages, but if they are, they will link to articles about foreign words, translated to the local language.

In summary, the use of interwikis (between linguistic editions of Wiktionnaries) are in fact not very useful : we cannot create an interwiki from a local page presenting a foreign word to another page, simply because they are ambiguous:

  1. should it link words that are written "identically" (we've seen the problem of apostrophes). But if we do, the central Wikidata can handle this without changes.
  2. if it links words from the local definition in the local language to another term on another wiki, we fall on the problem that we not only need to link to their pages, but actually to a stable section of this page (i.e. an anchor). Wikidata currently does not handle anchors correctly (and this is general for all wikis, including Wikipedia which does not track sections in a stable way; this is complicated on Wikipedia by the fact that articles are merged and splitted, but this is a smaler issue, the real issue is that one lingusitic edition uses separate articles, when another uses multiple sections in the same article, there are usually more articles on English Wikipedia linking to sections of fewer articles in other languages, simply because of the size scale of EN.WP). For this reason, the problem of anchors between different weekis needs to be addressed (and some support for referencing automatically in WikiData the sections modified on a local wiki, or at least marking in Wikidata the anchors that are no longer valid, or that have been renamed locally on a wiki).

Now there only remains the actual scope of this proposal: organizing the local-only dependencies and links within the wiki. I can call this metadata, which will help some automation of searches and creation of derived pages (like terminologic lists, or the thesaurus, or list of translations shown below definitions, or the automated creation of categories sorted accordding to different linguistic rules:

  • Lua could allow integrating for example the CLDR data and algorithms to derive transliterations, transcriptions, collation keys for sorting, and could also allow users to select their prefered sorting option (for example to sort in French Wikitionnaries the Chinese categories, or to find terms using transliterations
  • e.g. in Chinese: we could search by radical/stroke, or by trandtional dictionary order, or by Pinyin romanization, or by Wades romanization, or by Bopomofo transliteration; we would no longer have to maintain manally lots of pages showing long lists of terms, we would use categories instead for each item, and lists would be generated. For all this we need metadata. An experimetnation for integrating metadata using custom templates (in "/metadata" subpages has been made, it works but was not used due to lack of maintenance and understanding, but also due to lack of integration; with metadata managed locally it would work much better).

So what do we really need in Wiktionnary (and proably on all wikis as well) : the possibility of having a local instance of Wikidata (in its own namespace) to help organize the contents on the local wikis, to help the maintenance of its contents, and to extend the limited possibilies of existing categories.

Wikidata was created by seeing that interwikis were metadata. But all the contents of a wikis also has other metadata that we currently cannot query in a very structured and formalized way, so we have a huge collection of extensions (or worse, external tools and bots) that are trying to fill the hole, but maintained by a few people dictating their rules to every other users of the wiki.

What I propose :

  • a local integration of the Wikidata extension features (currently only installed in a single central wiki, queried by all other wikis) within all standard wikis (Wiktionnary, Wikipedia, Wikinews, Quikiquote... and even Commons whichis interested in organizing many versions of its hosted media, as well as much metadata about them), for their local use. Each wiki will define its own ontology according to its own policy and organization of its contents.
  • wikis will first query their local database before querying the central Wikidata database (more or less like the integration of Commons for querying files first locally, then on Commons.
  • a way to override the data queries to indicate (interwikis?) which dabase we'll query. For querying the central Wikidata database we already have the interwiki prefix code "d:" (e.g. "property:d:Qnnnn"). Without this interwiki prefix code (e.g. "property:Qnnn") will query first the local database.
  • We could as well query the database of another wiki. E.g. :
    • "property:Qnnn" will first query the local wiki, then the central Wikidata; you may skip this step by adding the local wiki prefix (e.g. on the local French Wiktionnary you can use "property:fr:Qnnn". On Commons you can force it to query only the local database with "property:commons:Qnnn"; but you could also reduce it to "property::Qnnn" (using the empty prefix).
    • "property:d:Qnnn" will query only the central Wikidata database.
    • "property:en:Qnnn" to query the English wiki version of the local project.
    • "property:w:de:Qnnn" to query the database of the German Wikipedia.
    • "property:n:Qnnn" to query the database of the Wikinews in the same language as the local wiki project
    • "property:commons:Qnnn" to query the database of Wikimedia Commons (getting metadata about a mediafile, which is not present in the file itself, and unparsable from its description page)
    • "property:species:Qnnn" to query the database of Wikispecies... Note that Wikispecies is intrinsicly a database which is the perfect example of use, where it should be managed mostly by Wikidata (with few contents actually edited in the Wiki code form, and queryable directly from Wikipedia.
    • "property:voyage:Qnnn" to query Wikivoyage (e.g. transportation lines and stops, schedules, infos about transporters).

In fact only interwikis really need to be centralized in the existing Wikidata (but extended to cover not just articles, but also their sections (or anchors generated by utility templates).

This way: no need to fill everything in the central Wikidata database. The local needs for structured data can be satisfied without depending on the central Wikidata policies (but coordination across wikis is possible to see if some common parts of local databases can be deported to the Central Wikidata. This will mean easier maintenance, and easier understanding by local users of each wiki (including Wikitionnary).

Ontologies will be separated by domain of application (i.e. Wikimedia project) and/or by language. Many datatables shown in Wikipedia articles for example should benefit if they could be fed by a local instance allowing the creation of specific tables whose data would come from a database, and presented as a formatted and transformable table in the article, without having to edit complex HTML tables or wiki tables. In fact the Wikidata extension should simpify the creation of custom tables, using a spreadsheet like interface. several transforms would then be generated by local Lua modules (that would compute for example totals, or would sort and subcategorize the contents with subtotals, or where a user can select to show more detailed optional columns directly from the article...

No more need to create multiple wiki tables or specific pages for different lists. Data across articles can be shared and edited in a central point, including by bots importing free data. Lua modules would also automatically generate graphics (e.g. SVG, or timelines), without using complex HTML code that is difficult to make compatible with all browsers with HTML tricks (we could have later a way to define on Commons a graphics "file" which is in fact a template using a Lua module querying data in Wikidata or in a locally edited data table). No more need to create, install and maintain a new MediaWiki extension ! This generated SVG could be cached, and refreshed if it's too old (more than 1 week ago), or if we press a "refresh" button (to reexecure the Lua module, that will perform the local or remote Wikidata queries).

The rendering of pages using Wikidata only for local needs will also be faster. Local wikis will frequently have more detailed ontologies on some subjects, that other wikis don't use or don't understand well (e.g. French-specific topics and classifications are better understood on the French Wikipedia, using French concepts, which have no clear equivalent in English, or just broad equivalents).

And this integration will also allow easier cooperation and exchange of data between all wikis, with less difficulties, while also allowing local experimentations and develoment on a specific wiki, which could mature and start being used in a similar way by another wiki, and later merged by reimporting parts of the ontology later to central Wikidata. In general it is not good to define a common feature designed for a few languages and wikis, without first experimenting them for a local need. Moving data from a local wikidata instance to central Wikidata should be delayed to the point where cooperation between wikis is desired, and maintained on all origin wikis in the same or siilar way : only the similar features will go to Wikidata, and there will still remain additional data specific to each wiki.

And about the current subject (using Wikidata in Wiktionnary, the whole discussion does not require a global Wikidata change. But just some extensions of it first experimented locally in a few Wikitionaries, to see how each experimentation can interoperate and what will go to the central Wiidata database.

Verdy p (talk) 22:02, 20 June 2013 (UTC)Reply

Hi Verdy p. Thanks for your comment, but I am afraid that it is a bit late. Indeed my first proposal of adding structured data to the Wikimedia projects was to add it to each project on its own, back in 2005. But that didn't gather much traction over the years. It was only when we switched to a central data store for all projects and languages that sufficient interest and attention was given to the project to make it happen. So, whereas I fully understand where your argumentation comes from, I simply do not see why this would happen. Locally, the communities have come up with ways how to deal with their structured data. It is only in the interaction with each other, and in sharing this data, that was wholly underserved and that could hugely benefit for the different projects.
The single Wiktionaries will not be told how to use the data, or what part of the data to use. They will decide completely autonomously and independently. But the plural of the Hungarian word "menza" is "menzák". This can be perfectly stored at a central place, and I fail to see the need to have that entered and maintained locally in over 170 different Wiktionaries. And for this kind of data, Wikidata could be useful. --Denny Vrandečić (WMDE) (talk) 10:42, 21 June 2013 (UTC)Reply
I don't think this is too late. The intent for creating a central store is viable when the data has to be shared between multiple wikis. But even within the same wiki, we have lots of data duplication and maintenance problems to update long lists of pages, notably those containing data.
Very frequently, this data is not portable to another wiki, or not integrable the same way because it would require the development of a completely new structure for pages an templates. Local databases can simplify at least the local maintenance. And later it may prepare the way to an export to a central store, if part of the data can be shared and templates adapted to use the central store.
We could gain a lot by allowing the wikidata extension to be used as well locally, without having first to populate the centra store (local data may be exported by automated tools and local templates adapted to use the central store, where appropriate only for the data that has been centralized.
This would reduce also the maintenance and many future conflicts occuring between concurrent schemas that wil lstart appearing very soon in central Wikidata, using different models, and creating new data duplication. E.g. there could be conflicts about figures depending on the source referenced, and new incoherent data if sources are mixed in central Wikidata. Creating an updated data model in central Wikidata that can satisfy all wikis, and discussed in different languages, will take time. We should be able to experiment at least data locally, before they do to central Wikidata using a common data model, without also breaking the existing local data shown on local wikis.
Now that the development is almost terminated, I do not think that my demand is undoable: it just consists in adding a way to create interwiki references for data, using a naming convention on properties. The Wiki extension will know with this interwiki prefix where to query the values instead of assuming that it will always be in central wikidata. The Wikidata DB engine itself is already written, and installed in one wiki (it could be installed to local wikis as well).
And then instead of wrirng complex templates (full of #switches) locally and updating them to populate the data, we would use write the templates once, and data would be separately modifiable using the wikidata interface (without having to worry about the template syntax or difficuties to extend them while preserving the data itself).
And for this reason, I think that the name "Wikidata" itself (for the current central wiki project) should become "Wikidata Commons" (just like we have "Wikimedia Commons" for files and medias, and "Wikidata" would remain the name of the extension hosted by any wiki. No need to change logos for now, or doman names. For our Wikimedia wikis, all these instances of Wikidata installed in existing wikis should continue working the same way, with the same interface, except that there will be multiple stores using possibly distnct schemas, one for each store. No need also to change the ways interwiki links between pages are shared : they can continue using the central store (which is already the best place for them).
Verdy p (talk) 22:39, 30 June 2013 (UTC)Reply
I am still confused by this request. Are you saying that the plural of the English word "city" is different in the German and French Wiktionary? --Denny (talk) 15:37, 11 July 2013 (UTC)Reply

Dictionary entries are not spellings

edit

Arm is actually three separate English words, with three independent etymologies, respectively meaning “upper appendage,” “poor,” and “weapon.”

Print dictionaries have an independent entry for each, which are adjacent for the purpose of alphabetic indexing, but each has its own etymology, one or more senses, and other attributes.

Wiktionaries, unfortunately, use orthographic expression (spelling, punctuation, and capitalization) as the unique key (expressed in URLs and page titles). So not only does wikt:arm contain English, German, Romanian, and other sections, it also has the unsatisfactory kludge of added intermediate headings #Etymology 1, #Etymology 2, etc., under #English.

(This also has the side effect that wikt:labor, wikt:labour, wikt:Labor and wikt:Labour, also wikt:school bus, wikt:school-bus and wikt:schoolbus, are defined on different pages.)

It’s not clear whether or how the proposal accounts for this. A dictionary database’s main key should represent an etymological word, not a spelling shared by several unrelated homographs. Michael Z. 2013-06-22 18:48 z

As I understand it, we should be able to create entities for "words" easily, different from the articles titles and their interwikis. The way the various projects use this data, however, is more difficult to predict. On fr.wikt for example we create the section of each word with a template. This allows the creation of a specific anchor for each word, like wikt:fr:trouble#fr-nom-2, and it means that we could link easily to the Wikidata entity for that word. The entity could take into account what are the various spellings etc. Darkdadaah (talk) 21:25, 23 June 2013 (UTC)Reply
But what is unclear to me: would Wikidata have separate entities for English arm (“upper appendage”) and English arm (“weapon”)? Then one Wiktionary could use these to create two different sections on a page, while another could use them to create two different pages. Michael Z. 2013-06-23 23:34 z
It is also unclear what would happen when it is discovered that what was thought to be a single word/item turns out to be several. That happens with some regularity on the Egnlish Wiktionary, as missing senses/definitions are added. As a particularly nasty example, consider the word cleanly. It's an adverb meaning "in a clean manner" right? That's what most print dictionaries will tell you. But, in fact, it's both an adjective and an adverb with more than three senses of each. (See en:wikt:cleanly.) English adverbs are short-changed in print dictionaries, and it's an on-going effort on Wiktionary to flesh these items out. The result, however, is that we frequently discover that one sense is actually three, four, eight, or even more. How would Wikidata handle that kind of dynamic change? --EncycloPetey (talk) 01:21, 24 June 2013 (UTC)Reply
Just create an additional item. —PοωερZtalk 02:49, 24 June 2013 (UTC)Reply
You don't understand the problem, then. The problem is that a single item has itself become multiple items. It is not that a separate new item was discovered, and the old item is still intact. Rather, the original idea turns out to not exist, and several new separate ideas exist instead. This creates a real problem if the original entry included translations, synonyms, antonyms, etc., because they would now all be invalid. --EncycloPetey (talk) 03:11, 24 June 2013 (UTC)Reply
We already handle this issue with glosses and we manage. If anybody isn't familiar with glosses, see wikt:Help:Glosses. They provide database-like functionality without using a database. I propose that we do. --Haplology (talk) 04:26, 24 June 2013 (UTC)Reply
No, glosses don't help with splits like this. When we split a sense, all the translations go into a "to be checked" section, and are effectively removed from association with any specific entry. --EncycloPetey (talk) 07:59, 24 June 2013 (UTC)Reply
If a thought-to-be-adverb is discovered to be an adverb and an adjective, create another item for the adjective, delete all false data in the original adverb item and transfer it to the new adjective item. Where's the problem in that? —PοωερZtalk 07:24, 24 June 2013 (UTC)Reply
It's not just that; you're still missing the point. The adverb itself has several meanings, each of which will have different translations from other meanings of the adverb, and likewise different synonyms, antonyms, etc. Having a single item for the adverb isn't sufficient; there need to be several. And when we do that, how do we determine which data from the original item is "false" in that situation? Effectively, all the data associated with the adverb item up to that point must be considered suspect and deleted. --EncycloPetey (talk) 07:59, 24 June 2013 (UTC)Reply
Read the page again. That's covered by senses: "The sense is the second new entity type (Word sense). A sense is different from other entities as it is not independent, but each sense completely depends on an expression. A sense belongs to one expression, and one expression only. But each expression can have several senses." As I understand it, the meaning-specific data will be stored at these sense-items. —PοωερZtalk 08:24, 24 June 2013 (UTC)Reply
Clearly you're just not getting it. I give up trying to explain the issue. If this is what Wiktionary-Wikidata development would be like, then it would never work. There's just no communication. --EncycloPetey (talk) 17:01, 24 June 2013 (UTC)Reply
"The adverb itself has several meanings, each of which will have different translations from other meanings of the adverb, and likewise different synonyms, antonyms, etc. Having a single item for the adverb isn't sufficient; there need to be several." That's exactly what senses do, I don't see what there is more to get? —PοωερZtalk
Exactly. I've presented the problem twice, and you still can't see it. Instead you reiterated the proposal to me, which proposal I had already read, and which does not address the problem. If you're not able to grasp the problem, or if I'm unable to express it, then communication can't happen and this sort of collaboration will not work . . . and that's with both of us writing in the same language. --EncycloPetey (talk) 02:16, 25 June 2013 (UTC)Reply
Let me try again: Your problem is not a technical issue, but the amount of work to fix data after a split? —PοωερZtalk 02:34, 25 June 2013 (UTC)Reply
No, the amount of work is a complication associated with this issue, but amount of work is not the problem. --EncycloPetey (talk) 20:30, 25 June 2013 (UTC)Reply
  • Yes, that would be a great advantage if joining database to WT will deal with spelling variants. Observe languages with unstable orthography, e.g. open any entry in AND: a short word can have dozens of variants which can appear in nearby lines of single manuscript. Currently, redirects should be created in each WT section to catch all of these variants; and if spellings are overlapping, they must be marked in special subsections on page (not speaking of similary spelled words of different languages). That would be a nice thing to centralize; for this purpose, main item will be an entity roughly described as etimology. It would have 1 word language property and N spelling properties, and relations to entries on M host languages describing it (i.e., interwikis, linking things really having dictionary sense). Thus, users will be able (after certain technical improvements) to get by query a dictionary extraction they really prefer — e.g. I want to get words spelt estre in Anglo-Norman, get descriptions in Russian, and if they doesn't exist — in English. This will require only linear, bot-possible operations with existing content. Ignatus (talk) 14:28, 24 June 2013 (UTC)Reply

Discussions in other places

edit

Here are a few other places this proposal was discussed at so we can keep these in mind too:

--Lydia Pintscher (WMDE) (talk) 12:22, 25 June 2013 (UTC)Reply

The current Wikidata already provides the basic needs for a dictionary

edit

When you have an existing Wikidata entry like "horse" it is effectively connected to translations in many languages. It has a definition and, it serves pretty nicely from appreciating the difference from another "[http://www.wikidata.org/wiki/Q869595 horse". With some effort all the words with horse can have descriptions that serve as definitions. This is exactly what you expect from a dictionary. The question is therefore why not expand on the dictionary functionality that is already there.

Obviously, you can have synonyms in this way and, we do. We can also have all kinds of other words relating in ways like plural, diminiutive etc. Why have separate technology, we have already everything that is needed? Thanks, GerardM (talk) 12:45, 25 June 2013 (UTC)Reply

Because one is information about a concept, the other is information about a word. A word can have the meaning of a certain concept, but they are fundamentally different things. —PοωερZtalk 14:09, 25 June 2013 (UTC)Reply
I second 23PowerZ : OmegaWiki focused on concepts, but those can only be used roughly in a multilingual environment where a one-to-one translation between languages is needed (something useful in Commons for example). Natural languages are not that simple : we have to focus on words first, as this is how Wiktionaries work. Darkdadaah (talk) 15:02, 25 June 2013 (UTC)Reply
That idea only works if (1) all the articles are titled using the exact equivalent in all the languages, and (2) all the articles are words. Neither of these conditions is true on the Wikipedias. Look at our own entry for the genus Amborella (a small tropical tree) to see some of the inherent pitfalls resulting from (1). There are genus names mixed with species names, native language names mixed with scientific names, etc. You can't just pull the information automatically from the Wikipedia articles. I even work sometimes with published translating dictionaries from major publishing companies and find serious errors. The only Galician-English dictionary I could find when I went looking has about a 10% error rate in translations, spellings, etc., and it doesn't distinguish between senses of words at all. What you're proposing is to take a step further away from that kind of precision by using titles of articles and hoping they mean the same thing. --EncycloPetey (talk) 20:28, 25 June 2013 (UTC)Reply
I have been reading and I am baffled. First of all. I have the strong impression that people do not understand the existing Wikidata. There is nothing that prevents multiple senses of a word. I think this is because most people consider Wikidata a Wikipedia extension. It does not need to be one. When it is, it is not as powerful as it could be/should be. GerardM (talk) 17:17, 6 July 2013 (UTC)Reply
Look at your own example for "horse" to see part of the problem. According to Wikidata, the Asturian for "horse" is "Equus ferus caballus", if we go by what you have said. The inherent problem is that the data items here list titles of Wikipedia pages, and do not necessarily list translations in various langauges. Titles of encyclopedia articles do not always translate in the way one would expect. --EncycloPetey (talk) 18:49, 11 July 2013 (UTC)Reply
I fully agree with 23PowerZ here. Words are fundamentally different things than concepts, and that is also why we have this discussion in the first place: how should these difference be represented? --Denny (talk) 15:41, 11 July 2013 (UTC)Reply

Namespace

edit

Will expressions and senses be stored in a namespace of their own, or do they just get a different prefix than Q? And for senses: Will they be consecutively numbered like everything else? Since they are affiliated to a specific expression, their ID could reflect that. Example: the expression W2808 has the three senses S2808-1, S2808-2 and S2808-3. —PοωερZtalk 17:12, 26 June 2013 (UTC)Reply

Expressions will be in a separate namespace. Sense will be part of the expressions, though - i.e. they are just a section on the expression's page. --Denny (talk) 15:43, 11 July 2013 (UTC)Reply
But a sense can't be associated with just one specific expression in a multilingual project, or we'd have endless repetition. Whatever sense is assciated with the concept of the color red (call it sense S2956), will be associated with the English expression "red" and the Spanish expression "rojo" and the Latin expression "ruber", etc., across all languages. If we didn't make that association with all those expressions for the same sense, then we'd have to endlessly replicate the same sense for each expression that bore that sense.
If I understand your thinking, you conceive the expression as the base center, to which all the senses are connected. That's backwards for a dictionary. The sense is the fundamental nexus, to which all the related expressions are connected. --EncycloPetey (talk) 22:14, 26 June 2013 (UTC)Reply
Senses don't store meanings, but additional information about a particular meaning, e.g. British English, poetic, archaic which is unique to every language. Senses themselves could be linked to regular Wikidata items, e.g. red (Q3142). —PοωερZtalk 22:47, 26 June 2013 (UTC)Reply
Senses are meanings in a dictionary. The kind of information you're describing (British English, poetic, archaic) pertains to a sense in association with a particular language, not with the sense itself in isolation. As I say, you've got the whole data structure inverted. An expression can only be the fundamental unit of a monolingual dictionary. Translations belong to a sense, not to an expression. Synonyms belong to a sense, not to an expression. The only data specific to an expression are etymology and language, and even then the etymology chosen depends in part upon the senses associated with that expression. When you expand across the whole gamut of languages to create a multilingual dictionary, it is the sense that becomes the fundamental unit. In many situations, that sense would be a "regular Wikidata item", as you call it, but in many other situations, such senses would not have a corresponding data item that exists yet.
The whole way that expressions and senses are set up in the proposal is fundamentally flawed. --EncycloPetey (talk) 23:14, 26 June 2013 (UTC)Reply
? I can't think of anything that would be duplicated across languages, other than possibly "refers to". Care to explain? --Yair rand (talk) 23:21, 26 June 2013 (UTC)Reply
That depends on how to envision the data structure. The way it's described in the proposal, each expression (Wiktionary: "term") would have several senses attached under it. OK, that's how a monolingual dictionary is structured. But in a multilingual dictionary like we have, with each Wiktionary setting a different language as the standard, this structure would be repeated across every term in every language. Think then: how would translations be indicated in the database? The translations are a relation between the senses, not relations between the expressions. In effect, you've then duplicated the same sense in every language where that concept occurs, and then have to mutually connect all the senses that pertain to a single defintion across all the languages that contain that meaning. That's a huge number of interconnections to create and maintain, and a ridiculously complex data structure, when you could instead have set up everything based on a single item that contained the unifying sense/meaning, and then have attached attached all the various language-specific expressions and their language data to that single sense. I wish I had access to a good graphics package, as I think a structural diagram would make this much clearer. --EncycloPetey (talk) 00:15, 27 June 2013 (UTC)Reply
I don't think the database is likely to hold translation info at all, actually. Probably not definition data either. Neither of them would be at all useful to store on Wikidata, and I'm quite sure that the idea of "X translates to Y which translates to Z, thus X translates to Z" is simply wrong, and so a unified "sense" that words of all languages connect to wouldn't be useful at all. --Yair rand (talk) 00:57, 27 June 2013 (UTC)Reply
But definitions are exactly what the "senses" in the proposal refer to. Did you not read the proposal that we're discussing? The description of "Sense" links to the Wikipedia article w:Word sense where it is defined as "one of the meanings of a word". That's a definition of definition.
If you don't include definitions, then what eactly can you include besides wikilinks? With no definitions, you can't include citations, synonyms, translations, part of speech, grammatical or usage information, or anything else we've been discussing, and can't actually include any kind of data I can think of except wikilinks. Even pronunciation information is dependent upon knowing which definitions you're talking about. --EncycloPetey (talk) 01:05, 27 June 2013 (UTC)Reply
What I mean is that a sense of a word can be included as an unit on Wikidata without the actual definition, by which I mean the actual line of text or wikitext explaining/defining the meaning of the sense. "A sense [...] has no label or alias, but a description or gloss." A "gloss", as we tend to use the term on the English Wiktionary at least, means "a set of words that uniquely identify a definition"; see wikt:Help:Glosses. Being uniquely identified, the sense on Wikidata can be assigned properties such as synonyms, citations, pronunciation, "refers to" data, and so on. Additionally, the glosses allow them to be associated to their "sections on their respective expression's wiki page", though probably not by the software itself.
At least, that's my understanding of things, which might be heavily flawed for all I know. --Yair rand (talk) 05:15, 27 June 2013 (UTC)Reply
No, Yair rand, I agree with your understanding here. --Denny (talk) 15:46, 11 July 2013 (UTC)Reply
In practice, the gloss on Wiktionary is a shortened form of the definition. Hence, every sense will be equivalent to and correspond with a particular definition. So why do you say the definition can't reside here?. And yes, as I said above (and you repeated), the synonyms, citations, pronunciation, etc. are tied to a particular sense, and not to an expression. Thus, my spiel about how senses are truly the organizing unit, and not the expressions with which they are matched.
Now, once you realize the glosses are actually definitions, then you see that these glosses will be repeated all over the place if we treat the expressions as the primary organizing unit. The same gloss of "inner part of the hand" (for palm [of the hand]) and "tropical tree in the Palmae / Arceaceae" (for palm [tree]) will be repeated all over, as palma in Latin will have these two glosses, as will Slovene palma, etc. Hence, the endless repetition I mentioned before. That's the way the current data model has it, with all the senses sharing a gloss then all mutually interlinked. It makes more sense to have the sense/gloss as the unit, with translations and other language-specific information attached to dependent subsidiaries of the sense. Say, have "inner part of the hand" as an organizing center, then off that would be an English expression palm, with details pertaining to that expression as applied to the particular sense. The same gloss/sense would have similar dependent items for Spanish palma, Catalan palmell, Russian ладонь; it might also have an associated image. This way we don't have to mutually interconnect all the corresponding senses; they would already be a single unit.
The last sentence in your comment has too many unspecified referents for me to follow. --EncycloPetey (talk) 05:36, 27 June 2013 (UTC)Reply
I'm sure it has already been said, but I guess it has to be repeated: we can't use concepts (or glosses or senses) as shared entities between different words. In natural languages, it is rare to have a one-to-one match between senses (except in strict terminologies, like in scientific domains): synonyms are rarely strictly equivalent, as are translations. Again, that's what Omegawiki did, and that's one of the reasons it failed. We can easily model relations between senses (is_synonym, translation_of...) without sacrificing the nuances of each one. Darkdadaah (talk) 12:30, 27 June 2013 (UTC)Reply
So then, Wikidata is proposing to take the more than 5,000,000 definitions on the English Wiktionary, and assign each one an abstract code (even when they repeat each other)? Simultaneously, it has to do the same with the millions of defnitions on the French Wiktionary, and figure out which correspond and are actually the same. And it has to do this again with each of the Wiktionaries? And it has to do this where, over and over, the entries on different Wiktionaries differ? E.g. fr:wikt:chaton has seven senses listed, but the same word on en:wikt:chaton has four. All this correlation would therefore have to be done manually by bilingual, or multilingual, editors. And after all that's done, then we can start using the data in some unspecified way? --EncycloPetey (talk) 16:54, 27 June 2013 (UTC)Reply
yes, this is correct. Although there is no need to wait after this is all done - I very much expect that we start using the data even though it is incomplete and still being built. --Denny (talk) 15:46, 11 July 2013 (UTC)Reply

infinitive and plural are not properties

edit

In the proposal, "plural form" and "infinitive of" are given as example properties, but these are values rather than properties. Indicating a plural ought to be done using the property "Grammatical number", with a value of "plural". "Plural of" becomes meaningless in more highly inflected languages than English. The Spanish adjective flaco (thin) has both a masculine plural and a feminine plural. Latin albus (white) has 18 different plural forms, and 18 different singular forms. Verbs are even worse. The Spanish verb ganar (to earn) has 33 plural forms, and Latin regular verbs have upwards of 45 plural forms. And "plural" is only one attribute that a word might bear. A Latin verb form could be the "first-person plural present active subjunctive" of the verb. A French pronoun form could be the "second-person singular polite" of the pronoun. Thus, the relationship cannot be indicated by a property. Rather, each word form has a set of properties, such as grammatical number, gender, case, mood, tense, voice, etc., with each property having a set of possible properties to which it can be set. The relationship to the "main" form of the word would be best indicated by a property "Grammatical form of", with no details in the relationship concerning the sort of form it is. The sort of form is then indicated by a set of statements on the data item for the form itself. --EncycloPetey (talk) 01:36, 27 June 2013 (UTC)Reply

Sorry for not being clear. So the word "cities" would indeed have the property "grammatical number" with the value of "plural". But there is no need to merely link between "cities" and "city" using "grammatical form of", but it could be more specific like "singular form". Then again, this part of the discussion is already a discussion on the level of how to use the proposed data model, not about how the data model should look like -- both alternatives would be expressible, and it would be up to the community which one to use. --Denny (talk) 15:48, 11 July 2013 (UTC)Reply
So, the way relationships are coded might be different for different langauges, then? How would that make it easier for a speaker of Uzbek to code Gaelic entries, if there is not consistency in the way grammatical form relationships are coded? And how would an Uzbek speaker figure any of this out if they don't read English? --EncycloPetey (talk) 18:55, 11 July 2013 (UTC)Reply
Well, it sure could be -- the software would not enforce anything. It would be up to the community to keep it consistent, just as it is currently.
Regarding the Uzbek reader - he does not need to read English, he can use Wikidata with the Uzbek interface -- Wikidata is inherently multilingual - see here for an example. Unfortunately, this does not extent to talk pages, as discussions are in the language the editor chooses. --Denny (talk) 09:34, 12 July 2013 (UTC)Reply

Wikidata = wiki + data

edit

The objective of Wikidata is to gather objective data, and to make data accessible to all projects.

But, for words, strictly speaking, it's difficult to find data. I can think of:

  • the number of letters of the word
  • the creation date of the word (when known, which is exceptional)
  • its creator (when known, which is exceptional)
  • number of uses by year, according to a reference set of works, or according to Google
  • number of uses in each of some famous works
  • the famous writer using it most
  • its use rank among words of the language
  • and this kind of thing...

But most of what can be found in a dictionary about a word is an explanation of what the word is, what it means, how it's pronounced, where it comes from, its gender, its conjugation, etc. And all of this is disputable, at least sometimes. For example, the part of speech is something conventional (and, currently, wiktionaries sometimes choose different parts of speech for the same word). Even the question "is this a word" is, very often, a very difficult question, and different wiktionaries give different answers. In a word, there is little interest in gathering actual data, and what is interesting is not actual data. Lmaltier (talk) 19:31, 27 June 2013 (UTC)Reply

On parts of speech: although it is true that sometimes different Wiktionaries separate entries differently, I believe it is quite rare, and even in those cases, remember that we don't have to use Wikidata systematically. We just want to link what is equivalent between Wiktionaries, if there is no equivalency, then we won't use Wikidata in those cases (just like interwikis). Darkdadaah (talk) 08:36, 28 June 2013 (UTC)Reply
Note: The fact that a problem is relatively rare does not make it trivial. One area where the various Wiktionaries differ enormously on assigning parts of speech is the numerals, which is a large and very non-trivial group of words. Even on the English Wiktionary alone, we've wrestled repeatedly with how to handle these words, as most grammarians and grammars don't bother with the details.
What are the specifics of the issue? Well, it's really complicated, but to give you a sense of the problem, cardinal numerals (or cardinal numbers. . . even the terminology is a matter of continual debate) are treated as a separate part of speech on Wiktionary, but the ordinal numerals are sometimes numerals and sometimes adjectives. On the French, Italian, and Spanish Wiktionaries, the ordinals are always adjectives, and the cardinals are treated as noun/adjectives. The German Wiktionary follows the pattern of the English Wiktionary for cardinals, but the other Wiktionaries when it comes to ordinals. And that's just the differences among some of the major European languages that are written with the Latin alphabet. You get into Asian languages and there are parts of speech that have no equivalent in the West. --EncycloPetey (talk) 22:29, 10 July 2013 (UTC)Reply
Re:Lmaltier: Some of the statistics you've enumerated can't be used. You can only count the number of letters in a word if the wrod is spelled with letters. Some languages, such as Chinese, have no letters. The number of uses by year can't be tallied from Google unless you have the means to weed out repetitions from the same source, quotations from the same source, quotations from earlier sources, and have an automated means to distinguish homographs and identical spellings from another language. Number of uses in famous works is only possible if you can correlate all the forms of the same word (amo, amas, amat, ...) while again eliminating homographs. And you can see that none of these are very simple statistics at all, but would require a great deal of directed original research to accomplish. People write doctoral theses on things like this. --EncycloPetey (talk) 22:29, 10 July 2013 (UTC)Reply

I understand the point, but this sounds stupid. Language doesn't lend itself well to statistics, I don't see much of a point, and language probably never is objective. The meanings of words are just agreed upon by people who use the particular language and etymologies of words are just educated guesses, the etymology of a word just points to a direction that is dark and we have no way to cast light there. If people agree, reach a consensus, that's reality. An objective approach is too confining. --Hartz (talk) 11:02, 15 July 2013 (UTC)Reply

I don't know how it is in your home wiki, but please refrain from calling each other or their suggestions "stupid".
"Language doesn't lend itself well to statistics" - this is simply wrong. The most successful NLP system are built around the paradigm of statistical analysis of language. Anyway, that wouldn't even be the point here. Wikidata would not add much "statistical" data anyway, but rather a graph of expressions and their connections. And saying that "tables" is the "plural of" "table" is perfectly doable.
Wikidata is well able to handle a property value - e.g. an etymology - with the source that states that etymology. I assume that Wiktionary does not contain original research, in which case verifiability can be used as a criterion for content in Wikidata. --Denny (talk) 13:46, 19 July 2013 (UTC)Reply

-sche thinks trying to host senses or translations on Wikidata is wrongheaded

edit

I agree with many of EncycloPetey's comments. It would be great if Wikidata hosted interwiki links. However, the idea of hosting senses or translations on Wikidata is, frankly, bad. It is naive and/or wrongheaded. Who thinks they can duplicate tens of millions of "senses" from all the Wiktionaries, sort them, link everything, etc? The task would be impractical even if everything else about it was black and white, and it isn't. (I can expand on that last point if desired.)

Besides, if one does copy all senses into Wikidata, one must translate them into all languages if the Wiktionaries are to transclude them... and if one does that, what has one accomplished? If fr.Wikt still displays "mettre (le vin) à température de la pièce" as a sense of chambrer, but anyone wanting to edit that sense must leave fr.Wikt and come here, then all one has done is spend an inordinate amount of time and effort changing no aspect of what is displayed to readers, but creating hurdles for potential contributors of new content, who would need to learn Wikidata's complicated, unintuitive structure (pages have numbers for titles, etc...), and discuss their edits, in an English-speaking environment.

EP remarked in an earlier thread that "[i]f you're not able to grasp the problem, or if I'm unable to express it, then communication can't happen and this sort of collaboration will not work . . . and that's with both of us writing in the same language." Many of the proponents of this idea are not Wiktionarians, and AFAICT all are able to speak English. Most people in the world do not speak English as their primary language. Those of us who are participating in this discussion have obviously been able to learn it, but what becomes of all the people who haven't?

Also: if senses and translations, pronunciation data, etc are copied to Wikidata, does that not reduce the Wiktionaries to glorified "skins" which merely present information in different formats? This proposal is, in its ultimate effect if not in its intention, a effort (by people who seem to have learned nothing from the failure of OmegaWiki) to kill the very successful Wiktionaries and replace them with OmegaWiki 2.0.

-sche (talk) 02:38, 28 June 2013 (UTC)Reply

PS, I know my thoughts are harshly worded: that is because of how bad an idea I feel this proposal is, and how exasperated I am to have seen it (or similar proposals) twice before. -sche (talk) 03:00, 28 June 2013 (UTC)Reply

First, I don't think anyone here thinks that storing the actual content of articles (definitions, etymologies, pronunciations...) can or should be done with Wikidata: this is a far too complicated matter. But we can still use it to store relations: although I agree that using Wikidata for senses would be extremely difficult as of now, I think Wikidata can be used for things other than just interwikis. As I said before, "words" can be quite easily and objectively differentiated: different homographs for example are distinct entries if they have different etymologies or different parts of speech (e.g. wikt:fr:trouble#fr-nom, wikt:fr:trouble#fr-flex-verb, wikt:fr:trouble#en-verb). It would be quite easy to create interwikis for "words" rather than "articles" in this manner. After all, the real elementary units of Wiktionaries are words, not articles!
Note that in this case, a lot of Wiktionaries use templates to create those words sections, so it could even be done automatically. Darkdadaah (talk) 08:52, 28 June 2013 (UTC)Reply
Re "I don't think anyone here thinks that storing the actual content of articles (definitions, etymologies, pronunciations...) can or should be done with Wikidata": that's definitely not the impression I get from reading this page, the preceding discussions, and the discussion right after this.
And if it is true that Wikidata won't host entries' contents, but will host only "words" ... what does that accomplish? What is gained if Wikidata has items for "words", other than (a) the ability for Wikidata's data to fall out of sync with the Wiktionaries' as the Wiktionaries (and perhaps Wikidata) split or merge words, and (b) cognitive dissonance, because e.g. de.Wikt + some references consider the "side of a ship" sense of "board" and the "plank" sense of "board" to be separate words, but en.Wikt + other references consider them the same word? (Wikidata hosting "words" doesn't change interwiki linking, because AFAICT the interwiki linking system can only like page to page.) -sche (talk) 22:05, 29 June 2013 (UTC)Reply
For some Wiktionary editions it might make sense to merely provide "glorified skins" for some words. For example, the Uzbek Wiktionary might decide to provide simple skins to Wikidata content for words in Gaelic. Mostly because the only other option is, due to the size of their community, to provide nothing on these words. So we can make sure that a lot of language communities can get access to much more knowledge than currently. But a large Wiktionary community like the French one, I do not expect them to become merely skins for French or English words, but rather that they will continue to provide their excellent entries on these words where they have enough editor energy to sustain and improve them. This proposal is not to replace the Wiktionaries! It is to provide those parts of Wiktionary that want it a common resource for structured data. Not more. --Denny (talk) 15:54, 11 July 2013 (UTC)Reply

Populating the database

edit

We will have to think about how we populate the Wiktionary part of the database. If we leave it blank, there is a real danger of separate Wikidata items being generated for the same words/expressions a lot of duplicates/triplicates/etc. to solve. If we prefill entries with expressions in a single language, it would be favouritism and people would need to know that language particularly well. For Wikipedia, we have interwiki links which (mostly) are each article connects to only one other version in a specific language. This is the exception for Wiktionary entries. It's not an easy task to automate the conversion in Wiktionary's case. If we are to do this, we're gonna have to reorganise the way the entries are registered so this can be done in an easy/easier way. -Svavar Kjarrval (talk) 10:00, 28 June 2013 (UTC)Reply

If you want to create one item per 'string' (where e.g. 'cat' is a single string stored by en.Wikt in wikt:cat, by fr.Wikt in wikt:fr:cat), the process should be extremely simple — that's why moving Wiktionary interwiki data to Wikidata is such an obvious move. You should only have to watched out for a limited number of specific kinds of duplicates, which you should furthermore be able to find systematically even if you don't identify them until after you've generated/moved the content on/to Wikidata. You also shouldn't have to worry about "know[ing] [a] language particularly well" or even at all. Things to watch out for include:
  1. strings which use different unicode codepoints to represent certain characters, e.g.
    1. apostrophes : en.Wikt has c'est la vie, fr.Wikt has c’est la vie : the Wiktionaries use redirects to ensure that these pages have interwiki links, as outlined here; that should aid you in linking the pages
    2. palochkas : whether the proper symbol 'Ӏ' or the ersatz 'I' is used varies from Wiktionary to Wiktionary, and sometimes varies within a Wiktionary from language to language
    3. clicks vs exclamation marks, lines, etc : whether the proper click symbols (ǃ, ǁ) or ersatz symbols (!, ||) are used varies from Wiktionary to Wiktionary, and sometimes varies within a Wiktionary from language to language
    4. ligatures : some Wiktionaries may use the single-character 'ij', en.Wikt uses 'ij'
  2. idioms which different Wiktionaries lemmatise differently, e.g. "punch one's way out of a paper bag" vs "punch his way out of a paper bag" vs "way out of a paper bag" vs "out of a paper bag"
If, on the other hand, you want to create one item per 'word' (i.e. you want to create six different items for 'cat' as a string used in English, and a separate item for 'cat' as a string used in Irish) ... don't. That's a bad idea; only or mostly bad things will come of it. -sche (talk) 22:33, 29 June 2013 (UTC)Reply
To be clear, I'm not against using Wikidata as a container for Wiktionary items. What I'm concerned about is how we should populate the database, not if we can do it. Since each page in Wiktionary is far from being one article per expression, we don't have as concise meaning for interwiki links in Wiktionary as we have in Wikipedia.
The main problem we have is that it's very hard to reliably connect an expression to the same one in other languages automatically. This means there's a lot of manual work ahead and that can be fairly hard on people in wikis with many words but only a handful of volounteers.
What I was referring to before regarding duplicates/triplicates/etc. was that if we deal with it like with Wikipedia interwiki links in Wikidata, we have a single item with interwiki connections to each language. In Wiktionary's case, an expression connecting to or containing its version in another language. If we just load up the wiktionary with the most entries (currently the English one), we're going to have to assume everybody knows that language because there's no automatic way to connect the other languages to the corresponding English entry. This would also not be fair to the people who don't know English and/or wiktionaries with few translations into English. If we don't automatically load a specific wiktionary, then it's a connection free-for-all. Like if the English word isn't there or not known to a user of the Japanese wiktionary, for instance, might create a connection between the Japanese wiktionary expression with a Chinese one. Then we'd have a duplicate for the same expression and the same problem might be repeated many times over with other languages. People could then be confused and have to deal with the problem of joining entries. Then we have the problem of very-similar-but-not-exactly-the-same expressions problem when joining entries and possibly ending up with words with different meanings being connected to each other. This could of course potentially be solved with manual review.
I suggest that wiktionaries first change the way entries are recorded into a format in which it's easy for a bot to parse and automatically load the entries into Wikidata in the future. The 'translations' section and t* templates in the English Wiktionary are close but wouldn't be reliable if the words in the other languages have many meanings as well. Sadly, I don't have a solution in mind which doesn't involve splitting up pages or a lot of manual work.
There is no uniform automatic solution to this since wiktionaries have a different way of formatting entries/pages. Doing that automatically for every wiktionary would be a headache for any one person.
I really want this to happen eventually since this would make it so much easier to maintain the wiktionaries, both regarding translations and the presentation of entries. Also, much easier to export the data and utilise it.
-Svavar Kjarrval (talk) 11:44, 30 June 2013 (UTC)Reply
You say "there's no automatic way to connect the other languages to the corresponding English entry", but there is: look for pages with the same title, and create a Wikidata item that contains/links all of them (a table of interwiki links, basically). Thus en.wiktionary.org/wiki/cat gets linked to fr.wiktionary.org/wiki/cat, etc.
Perhaps what you mean is that you want to link en.wiktionary.org/wiki/dog#English to fr.wiktionary.org/wiki/chien#Fran.C3.A7ais: this would indeed be very difficult — it would also be a bad idea (see many of the preceding threads).
Re "I suggest that wiktionaries first change the way entries are recorded into a format in which it's easy for a bot to parse and automatically load the entries into Wikidata in the future. [...] I don't have a solution in mind which doesn't involve splitting up pages or a lot of manual work." I doubt that'll ever happen. There've been proposals to split pages on en.Wikt (such that the English word cat was on one page, e.g. cat/en, while the Irish word cat was on another, e.g. cat/ga), but they've always been rejected. -sche (talk) 20:33, 30 June 2013 (UTC)Reply

Inflection/inflection classes

edit

One obvious potential for Wikidata when it comes to Wiktionary is the storing of which inflection class a word, e.g. a noun, belongs to. But even with this information stored, we would still need to have the inflected forms produced somewhere. In theory, this could be achieved via a module stored locally at Wikidata and attached as a property to the item page of the individual inflection class. Words with irregular/unique inflection could have their inflected forms stored directly at the word's item. --Njardarlogar (talk) 10:43, 29 June 2013 (UTC)Reply

Alternative proposal

edit

After reading the comments it is quite evident that there are two opinions that are difficult to reconcile. Either word senses have a common ground or they are too embedded in the language to be considered a separate entity. There is also the reality that to know if two words expressions from two separate languages belong together a person has to have certain knowledge of both languages to establish a link. As EncycloPetey very accurately pointed out before, one of the reasons OmegaWiki failed was that it required English to know to which meaning attach an expression. Also, the structure was too rigid and based on the assumption that everyone would be able to know where to place the links or that the meanings exist across all languages. The reality is, however, that everyone is limited in their language skills and English doesn't need one of them. That is the harsh truth Omegawiki crashed against, and that should be understood to avoid future pitfalls.

A colleague and me, we have prepared an alternative proposal for Wiktionary that attempts to address those issues. --Micru (talk) 15:04, 1 July 2013 (UTC)Reply

Answered there. --Denny (talk) 13:37, 19 July 2013 (UTC)Reply

Woah, mule!

edit

Why are we considering an all-or-nothing proposal to replace all Wiktionaries’ core data at once? Most of us Wiktionarians don’t have any experience using Wikidata, or even understand what it is, so how can we possibly decide how to use it?

Let’s give Wiktionaries some simple supplemental data to play with, first. So editors can figure out what Wikidata is and how it works, and Wikidataists can get introduced to Wiktionary.

How about quotations? I want a database to sort our English-Language and other quotations on en.Wiktionary. Each citation is a short quote and accompanying citation. The text and citation are in their source language, but may also have one or more translations and/or transliteration. Each citation is rich text, but could be broken down into auther, year, chapter title, major work title, major work author/editor, volume, number, page(s), etc. Each quotation appears in an entry's accompanying Citations: page, and can also appear under a sense in an entry. Each has one or more occurrences of a highlighted term, or its variations.

Many quotations can be used to attest other terms, too. I want to be able to reuse a citation in another entry, and have it appear with a different term and its variations highlighted there. Michael Z. 2013-07-02 16:51 z

The plan is to do this step-by-step at the pace each Wiktionary wants to go - just like it is happening on Wikipedia. --Lydia Pintscher (WMDE) (talk) 13:03, 3 July 2013 (UTC)Reply
I’m not familiar with the plans for Wikipedia. What are the steps for this Wiktionary proposal? Michael Z. 2013-07-07 17:25 z
See the Project page for this talk page at Wikidata:Wiktionary. Hope this helps. Filceolaire (talk) 16:11, 10 July 2013 (UTC)Reply

Why a new entity type for senses?

edit

Senses seem very similar to what items are now. The difference I see is that a sense must belong to an expression, and items don't. However, having a separate namespace for senses and distinguishing them from items seems like it will lead to a lot of semantic duplication, which is not good. Why have a sense for "tree" as a "perennial woody plant" and tree (Q10884) which represents the exact same thing? Have you given any thought to integrating senses into the item namespace? Silver hr (talk) 13:43, 12 July 2013 (UTC)Reply

Actually, senses wouldn't be in their own wiki namespace, and they also would not have their own wikipages. There would be always on the same page like their expression, like sections.
Re "Why have a sense for "tree" as a "perennial woody plant""? - because there are some statements you can make only about this sense, like example sentences, since when they have been used, etc. This does not belong to the item, but have to be part of the sense of an expression in a given language. --13:37, 19 July 2013 (UTC)

A more pragmatic approach

edit

I think we would be wiser if we would take a more pragmatic approach. We could begin by just importing interwiki links into Wikidata and after the duplicate interwiki links have been removed from the Wiktionary articles, something more could be considered. That's something that needs to be done, so why not begin from that? Let's start from something easy and let's build on that foundation later on. I think that IPA pronunciation examples and audio files can be included in Wikidata entries quite easily. Also, the "see also" links on top of some Wiktionary entries. These would be low-hanging fruits, and probably appreciated by the communities. I'm wary of including definitions, etymologies and translations in Wikidata entries. --Hartz (talk) 10:25, 15 July 2013 (UTC)Reply

Agreed: let's focus on interwikis first. Darkdadaah (talk) 08:47, 17 July 2013 (UTC)Reply
Disagree. I think the proposal on the project page makes sense. It seems quite pragmatic to me. Filceolaire (talk) 13:29, 17 July 2013 (UTC)Reply
Deploying interwikis first does not exclude working on the proposal. It makes sense to focus on the easiest and most consensual part first. Darkdadaah (talk) 12:05, 18 July 2013 (UTC)Reply
This has been discussed above. I agree that the language links situation should be cleared first. But unlike in Wikipedia, the result of that step would be completely different: what does the common topic of a language link set denote? It is not a single word. It is not a concept. It is not an item. What statements can be made about it? How would that be used? This fails to be a useful structuring measure, unlike for the Wikipedias. So we need, from the start, think about what our units of interest are in Wiktionary. --Denny (talk) 13:40, 19 July 2013 (UTC)Reply
Return to the project page "Wiktionary/Development/Proposals/2013-02".