Wikidata talk:Wiktionary/Development/Proposals/2015-05

Thanks

edit

Thanks. I've been repeating for years that we should start as soon as possible with interwiki links, I hope that will be done in a matter of months. --Nemo 06:06, 7 May 2015 (UTC)Reply

I would hope so too, but it needs someone to do it. --Denny (talk) 04:20, 8 May 2015 (UTC)Reply

Location

edit

Would this be based at www.wikidata.org or at data.wiktionary.org ? In any case, Qids would need to be unique. --- Jura 08:13, 7 May 2015 (UTC)Reply

What benefit would there be in a separate domain? I only see (huge) drawbacks. The mockup shows L23773 as example ID, allowing for an easy distinction. --Nemo 12:21, 7 May 2015 (UTC)Reply
L-IDs .. clever, I had missed that part (It's only in the mockup). --- Jura 13:25, 7 May 2015 (UTC)Reply
The same point seems to be discussed on the mailing list. --- Jura 01:22, 8 May 2015 (UTC)Reply

I put a sentence into the write up to make it explicit that Lexemes would have L-Ids. --Denny (talk) 04:21, 8 May 2015 (UTC)Reply

@Denny: Three already have been some discussions about IDs in other entity types, see phab:T73996. -- Bene* talk 07:27, 8 May 2015 (UTC)Reply
@Bene*:: That discussion talks about different instances of Wikibase. This proposal assumes that the Lexical knowledge is part of Wikidata. Also, different Entity types can easily have different IDs - properties already have P-IDs. --Denny (talk) 15:11, 8 May 2015 (UTC)Reply
Why not having lexeme itself as label for Lexeme instead of L-id? For example See looks good for Wiktionary element. --Infovarius (talk) 07:27, 12 May 2015 (UTC)Reply
Because they are not unique. Is See the German noun for lake or the Luxembourgish noun for saw? --Denny (talk) 21:24, 12 May 2015 (UTC)Reply
What about to use separate namespace for Wikt interwiki: eg. wk:foo for en, et, el, fr, ko, it, mg, pl, vi, yi ans zh foo?
BTW, In Czech Wiktionary there exists proposal for splitting every page to subpages by languages. JAn Dudík (talk) 20:52, 13 May 2015 (UTC)Reply
Still not unique. Is fly the English noun or the English verb? --Denny (talk) 16:08, 16 May 2015 (UTC)Reply
I thought we are going to describe all "See"'s in one page like there are in Wiktionaries now. Or are you proposing to split different languages? or by homonymy? or by polysemy? --Infovarius (talk) 14:03, 16 May 2015 (UTC)Reply
No, the proposal suggests one Lexeme per page. But in the end, if all Lexemes with the same spelling are grouped in one page int he UI or not, does not really matter that much. It would be different than the current Wiktionaries. But the Wikitionaries would not need to change: they would pull the data from the different Lexemes together in one page if they wanted to, or keep them separate, whatever they prefer. The important thing is that the data is structured in a way that allows both use cases, so that the Wiktionaries retain maximum freedom. --Denny (talk) 16:08, 16 May 2015 (UTC)Reply

Sense entity type

edit

Has a Description (multilingual text) and Statements, but no Label or Sitelinks.

@Denny: I'd rather call this a label as well. Having a description without a label might be confusing, while having a label without a description is mentioned at several places on the page. Just leave out the descriptions everywhere, that will make things much clearer. -- Bene* talk 10:39, 7 May 2015 (UTC)Reply

@Bene*:, I agree, it is an unfortunate name. I actually renamed it now to gloss. I am reluctant to say label, because labels sound like names instead of short texts, and I think for the senses we need short texts. I hope that gloss is better. --Denny (talk) 04:23, 8 May 2015 (UTC)Reply

Form/Sense

edit

Why Forms & Senses need to be entities? They include Statements, but so do References - so it is not unknown for other things to include statements. But if we look at the code, basic property of the entity is something externally identifiable - exactly what Forms and Senses lack. If there should be refactoring anyway, maybe refactoring that allows more statement-containing things like References, which aren't entities? --Smalyshev (WMF) (talk) 05:31, 8 May 2015 (UTC)Reply

@Smalyshev (WMF):: References do not include Statements. References are a list of Snaks. Forms and Senses include Statements as per proposal. --Denny (talk) 15:10, 8 May 2015 (UTC)Reply
Point taken, but still my feeling is the difference between Reference and Sense is less than between Sense and Item. --Smalyshev (WMF) (talk) 20:58, 8 May 2015 (UTC)Reply
But if you want to say something about a Sense, you need to be able to make a full-fledged Statement, including possible references, qualifiers, etc. Think about etymology, if it rhymes, if a sense is deemed archaic, etc. -- all of these things are potentially debatable and might thus require references or qualifiers. --Denny (talk) 16:20, 9 May 2015 (UTC)Reply

Etymology

edit

This proposal does not support etymology. This should be added.--GZWDer (talk) 09:59, 9 May 2015 (UTC)Reply

If you look at the mockups, it does show etymology with statements. Popcorndude (talk) 11:30, 9 May 2015 (UTC)Reply
As Popcorndude said, etymology and many other relations between Lexemes, Senses, Forms, and Items are supported through community created Properties, and thus, Statements, which can have their own references, etc. Just as with Wikidata for the other projects, Wikidata for Wiktionary does not claim to be able to upfront decide on all possible properties, but relegates this decision to the community. --Denny (talk) 16:18, 9 May 2015 (UTC)Reply

Sense

edit

Several senses of different lexeme have same or similar meaning. How to express it? Does we need a DefinedMeaning type like OmegaWiki?--GZWDer (talk) 15:59, 9 May 2015 (UTC)Reply

This is left to be decided by the community. The community can create properties that connect, e.g. Senses with Items, and one of them might me meaning, and have qualifiers, etc. There could also be properties connecting Senses with each others. I think that these abilities allow to express all use cases. --Denny (talk) 16:27, 9 May 2015 (UTC)Reply
I suppose that a sense should be a link to Q-id. --Infovarius (talk) 07:48, 12 May 2015 (UTC)Reply
A Sense can indeed link to a Q-Id. There could a Property refers to which connects a Sense with an Item. --Denny (talk) 21:25, 12 May 2015 (UTC)Reply
A Sense (assuming you mean w:en:Word sense) can relate to several Items, and it is often not clear how it shifts from describing one entity to another. A Sense should also relate to other synonyms, and then it is specific word senses of a lexeme that relates and not the lexemes. Jeblad (talk) 14:31, 14 May 2015 (UTC)Reply
The proposal makes no assumption as to what properties will connect senses and items. This will be completely up to the community. I don't think we can get that right up-front, just as we couldn't know up-front for Wikidata which properties will be needed. This has to be given to the community. So if it will be decided that a Sense can relate to several Items, that is entirely fine. --Denny (talk) 16:11, 16 May 2015 (UTC)Reply
There is also a question of connecting Senses with Senses, as in synonyms, and then we need something SKOS-like (wider and narrower scope). I guess it can be left to the community, but seeing the present state I'm not sure I like it. Jeblad (talk) 23:27, 16 May 2015 (UTC)Reply
Of course a Sense can relate to several lexemes (from different languages, different spellings, synonyms). --Infovarius (talk) 14:05, 16 May 2015 (UTC)Reply
According to the given proposal, a Sense is part of only one single Lexeme. Senses then can be connected to an arbitrary amount of Items, other Senses, or Lexemes through properties, but each Sense functionally depends on a single Lexeme. --Denny (talk) 16:11, 16 May 2015 (UTC)Reply

According to its French wikipedia article, you could you the seme concept, which is a linguistic fundamental unit of meaning. A bundle of seme are named sememe, and provide the sense for a lexical unit (lemma). --Psychoslave (talk) 16:36, 28 December 2016 (UTC)Reply

Word form and word sense as pairs?

edit

I can't see how w:en:word form and w:en:word sense can be split in two entities. In my opinion these two are closely related, changing the word form might change word sense and vica verca. In some cases different word senses can share a word form, and different word forms can share word senses. It does not hold in general to say that all word forms hold for all word senses in a lexeme. Perhaps there should be some way to identifiy the pairs that are valid? Jeblad (talk) 14:41, 14 May 2015 (UTC)Reply

Seems like there will be entries for each real lexeme and not for each spelling: "Note that two words in two different languages who happen to be the same (e.g. arm@en and arm@de) are two different lexemes, but also two different words within a language with different grammatical properties are described in two different lexemes (e.g. walk@en as a noun or as a verb)." That makes more sense… Jeblad (talk) 22:20, 14 May 2015 (UTC)Reply
I assume the question is answered. Note that the 13-08 proposal has a better writeup on the data model. The proposal we discuss here concerns more the breakdown of work. --Denny (talk) 16:13, 16 May 2015 (UTC)Reply
Note that the answer om wikidata-l point back to my first posting in this thread and makes me more confused than ever. I am not sure I understand the proposal at all. Perhaps it needs a rewrite of the sections that describes how it is supposed to work. I'll take a look at the previous proposal. Jeblad (talk) 23:13, 16 May 2015 (UTC)Reply
I think I understand what you describe in the examples, you have several "Form" under the "Forms" section and likewise under the "Senses" section, and while there are several listed Forms all must hold for all the listed Senses. If that constraint can't be satisfied (ie the inflection is different) then a new Lexeme must be defined with new Forms and Senses. The problem arises from the plural use of "Forms" and "Senses" and what it implies. Jeblad (talk) 23:52, 16 May 2015 (UTC)Reply
Yes, exactly. That's entirely correct. --Denny (talk) 19:32, 17 May 2015 (UTC)Reply
I agree with this concern. The English verb "to hang" has several senses but not all forms are valid for all senses. "We must, indeed, all hang together or, most assuredly, we shall all hang separately", Ben Franklin (maybe), could be restated after the fact as "We have all hung together so we were not all hanged separately" but not as "We have all hanged together so we have not all hung separately". Does this mean that there has to be two "to hang" lexemes? That seems to go against how "to hang" is specified in dictionaries, including the OED. Peter F. Patel-Schneider (talk) 17:28, 13 September 2016 (UTC)Reply
Interesting case. The OED in fact uses a qualifier ("except in Sense 2") to make this unusual exception - I don't see why we couldn't do the same? In the end, it's up to the community whether we want to have one Lexeme with a bit more qualifiers, or two Lexemes where potentially some data is repeated, but I would claim that the model is expressive enough to allow the community the choice (which is really the goal of the model here). --Denny (talk) 15:04, 14 September 2016 (UTC)Reply
It appears that the model is inexpressive here. As I read it, it requires that the meaning of hang having to do with legally imposed death HAS to be a different lexeme because every meaning of a lexeme has all the same forms, i.e., no exceptions are allowed. If the model allowed for exceptions, then the community could do whatever they wanted for hang, but as things stand now it appears to me that one view is imposed. The model DOES allow for different meanings that share the same form can be collapsed into one lexeme. This appears to go against https://en.wikipedia.org/wiki/Lexeme, but the community could just stick to one sense per lexeme if that is what they want. I think that it would be useful to have comments from actual lexicographers that this is what they need instead of us trying to figure out what is needed. Peter F. Patel-Schneider (talk) 20:32, 19 September 2016 (UTC)Reply

I am not sure that my issue is related to current topic but... One word/lexeme/lemma/sense can still have several representations for specific form (e.g. plural form of ru:"человек" can be "люди" or in some cases "человеки"). --Infovarius (talk) 16:46, 14 September 2016 (UTC)Reply

Does OED stands for Oxford English Dictionary in this section? --Psychoslave (talk) 16:43, 28 December 2016 (UTC)Reply

Linkage of lexemes between projects (task 9)

edit

Words seems to be added to the Wiktionary page according to the spelling used for specific word classes and grammatical categories, and those will be listed in "Lexeme" and "Forms", so the "Lexeme" in Wikidata knows the cluster it should attach to and thereby Wiktionary knows how to find all the Lexeme ids. By checking language and word class (type) in Lexeme at Wikidata the page at Wiktionary knows which section should point to which specific Lexeme at Wikidata. So yes it can be automated, but it should probably be checked manually.

Is it only when entries at Wiktionary is listed according to wrong grammatical category this would fail, or can it fail in other cases too? It should be possible to detect failing cases. That would probably imply that it is better to create (and fill) data from the client (Wikitionary) side to avoid failing cases, and then it would be easy to attach to the correct Lexeme at Wikidata anyhow. Jeblad (talk) 09:19, 15 May 2015 (UTC)Reply

Yes, I agree. I think that would mostly work. --Denny (talk) 16:16, 16 May 2015 (UTC)Reply

Generalized (materialized) Form sections

edit

I miss an option to say that a word is similar to some other word. Very often words from a word class use similar inflection rules, and then all entries in "Form" can be generated. I guess what I want is to generate a Form given a language and word class (type), but I would also want to say that the Form should be like a specific word of that class. When I say generate I don't want this to be a one time operation, but more like a background process. You should be able to override part of the generated form, or remove erroneous parts, but all the remaining stuff should be generated. Perhaps materialized from a rule set would be a better explanation. The rule set could be a FSA in many (most?) cases.

I'm not sure, but the statements sections in "Form" could block or make it difficult to generate the "Form" section. If a previous step flags a grammatical category as invalid, then what do you do with statements for that category? Jeblad (talk) 09:21, 15 May 2015 (UTC)Reply

I assume that this can be considerably helped with more intelligence, but I like to outsource intelligence to the community. What works well in one language might work less well in other languages. So I think that most of what you describe can be done by bots or user scripts. This will be particularly useful in languages where you can have more than 100 Forms for a single Lexeme, not just two or three like in English :) --Denny (talk) 16:17, 16 May 2015 (UTC)Reply
I simply don't think most languages (except the big ones) are doable without machine support for base forms, the community simply does not exist. Jeblad (talk) 22:56, 16 May 2015 (UTC)Reply
I agree. I am just saying that the machines don't have to be on the server side, but can be bots run by contributors. At least, for the beginning. Once the site is up and running we can think of adding more intelligence to it. --Denny (talk) 19:27, 17 May 2015 (UTC)Reply

UX issue

edit

All the spurious "Statements" in the basic example would be quite confusing for the users. The condensed example is better. This is nothing more than a UX matter and should be solvable. Is it really any difference between a lexical property and a "normal" property? Jeblad (talk) 09:22, 15 May 2015 (UTC)Reply

Yeah, good point. One way could be to omit the 'Statements' header when it is empty, and change the [add] button to [add statement].
Regarding the Lexical properties / Grammatical markers, they are different as they are a simple Snak, not a full fledged Statement. Same for Language and Word type / Lexical category. --Denny (talk) 16:22, 16 May 2015 (UTC)Reply
A "statement" in the RDF sense and not the Wikidata sense. I don't like all the invented names in Wikidata, they create confusion. Jeblad (talk) 23:04, 16 May 2015 (UTC)Reply
I thought you were talking about the mockup. You could get rid of the confusion by embracing the Wikidata terminology :) --Denny (talk) 19:28, 17 May 2015 (UTC)Reply
I've written the glossary, but still I'm confused… Jeblad (talk) 23:02, 17 May 2015 (UTC)Reply

Edit on Wiktionary, store on Wikidata

edit

It should be possible to write the "gloss" entries in the Wiktionary page and get them added at Wikidata in parallel. You add a gloss-tag in the subsection on Wiktionary, and when you save it will be stored to Wikidata but are also transcluded back to Wiktionary.

I would really like something like that for template parameters on Wikipedia too! Wasn't this part of the spec for Wikidata? Jeblad (talk) 09:24, 15 May 2015 (UTC)Reply

We had such a gadget in Russian Wikipedia. --Infovarius (talk) 14:14, 16 May 2015 (UTC)Reply
Yes, that would be awesome, but is rather independent of the current proposal. But one day it should be resolved - all the Wikimedia projects should have seamless flows for editing their content and data. --Denny (talk) 16:23, 16 May 2015 (UTC)Reply
Without it you will create two different Wiktionaries with this proposal. It will be a kind of OmegaWiki-light. Jeblad (talk) 23:09, 16 May 2015 (UTC)Reply
This proposal is not about creating a Wiktionary at all. We already have more than hundred of those. It is about offering a common data backend for the current Wiktionaries. And yes, this necessarily means duplication at first, with the goal to eventually reduce the current duplication. --Denny (talk) 19:30, 17 May 2015 (UTC)Reply

Lexeme clusters

edit

The way the clusters for lexemes are constructed are interesting, this has a lot in common with disambiguation pages. Perhaps we should rethink how disambiguation pages are implemented? Major difference seems to be that on Wiktionary we want a cluster to span homonyms on all projects, while on Wikipedia we only want to span homonyms on a single project. Jeblad (talk) 09:29, 15 May 2015 (UTC)Reply

I agree. I think one day, once the dust has settled, we should analyse features like the disambiguation pages, and see whether they make more sense to be rethought and reimplemented. --Denny (talk) 16:24, 16 May 2015 (UTC)Reply
Perhaps the disambiguation pages are an important testing ground as it will involve a larger community? Wiktionary has a tiny community and I'm not sure they have the capacity to handle such a huge techical change. Jeblad (talk) 23:07, 16 May 2015 (UTC)Reply
The Wiktionaries have a thousand active editors. I wouldn't call that tiny, exactly. Also, it is more pressing than disambiguation pages. --Denny (talk) 19:31, 17 May 2015 (UTC)Reply
They are between 5-10% of the Wikipedia communities. At Norwegian Wiktionary; 6 editors with >5, 2 with >100.[1] At Swedish Wiktionary; 38 editors with >5, 8 editors with >100. [2] At Finnish Wiktionary; 17 editors with >5, 7 editors >100.[3] Jeblad (talk) 23:13, 17 May 2015 (UTC)Reply
Still more than Wikiquoters, and they did just fine with moving to Wikidata. I really don't think that this proposal brings immediate huge change to Wiktionary. It will be rather gradual, and entirely at the speed the communities are comfortable with - just with Wikipedia. Wikidata has been around for a few years, and the Wikipedias are slowly and step by step exploring how to use Wikidata. There is no need for Wiktionary to be rushed into anything at any of the steps. --Denny (talk) 05:00, 18 May 2015 (UTC)Reply

Question about Task 1

edit

What is its rationale? Why "create a special case tool"? Visite fortuitement prolongée (talk) 20:24, 18 May 2015 (UTC)Reply

Because it reduces the longterm maintenance costs for the community the most. The following list can be created automatically. Any effort spent in maintaining the list, whether on Wiktionary or in Wikidata, is effort not spent on other tasks. --Denny (talk) 03:46, 19 May 2015 (UTC)Reply

[[af:dog]] [[ang:dog]] [[ar:dog]] [[roa-rup:dog]] [[az:dog]] [[zh-min-nan:dog]] [[bg:dog]] [[bs:dog]] [[br:dog]] [[ca:dog]] [[cs:dog]] [[cy:dog]] [[da:dog]] [[de:dog]] [[et:dog]] [[el:dog]] [[en:dog]] [[es:dog]] [[eo:dog]] [[eu:dog]] [[fa:dog]] [[fr:dog]] [[fy:dog]] [[ga:dog]] [[gv:dog]] [[gd:dog]] [[gl:dog]] [[ko:dog]] [[ha:dog]] [[hy:dog]] [[hi:dog]] [[hr:dog]] [[io:dog]] [[id:dog]] [[zu:dog]] [[is:dog]] [[it:dog]] [[kn:dog]] [[ka:dog]] [[kk:dog]] [[kw:dog]] [[sw:dog]] [[ku:dog]] [[ky:dog]] [[lo:dog]] [[la:dog]] [[lv:dog]] [[lb:dog]] [[lt:dog]] [[li:dog]] [[hu:dog]] [[mk:dog]] [[mg:dog]] [[ml:dog]] [[mt:dog]] [[mn:dog]] [[my:dog]] [[nah:dog]] [[na:dog]] [[fj:dog]] [[nl:dog]] [[ja:dog]] [[no:dog]] [[oc:dog]] [[om:dog]] [[uz:dog]] [[km:dog]] [[nds:dog]] [[pl:dog]] [[pt:dog]] [[ro:dog]] [[ru:dog]] [[sm:dog]] [[sa:dog]] [[sq:dog]] [[si:dog]] [[simple:dog]] [[sk:dog]] [[sl:dog]] [[sr:dog]] [[sh:dog]] [[fi:dog]] [[sv:dog]] [[tl:dog]] [[ta:dog]] [[te:dog]] [[th:dog]] [[ti:dog]] [[tg:dog]] [[chr:dog]] [[tr:dog]] [[uk:dog]] [[ug:dog]] [[vec:dog]] [[vi:dog]] [[vo:dog]] [[wo:dog]] [[ts:dog]] [[zh:dog]]

Thank you for your reply. Visite fortuitement prolongée (talk) 20:10, 19 May 2015 (UTC)Reply

Question about lexemes

edit

I was looking at the proposal and made some tests on a Calc file to check if I understood properly. ;) I chose ancora as a lexeme, because I knew this is a tricky word in Italian since it can be a noun, an adverb or two different declination of the same verb. So, given that we'll have a "lexeme 1" for the noun and a "lexeme 2" for the adverb, my question is: will we have only a "lexeme 3" for the two possible verbs or will we have a "lexeme 3" plus a "lexeme 4", so two separate lexemes for the two different verbs?

Another question: let's think of a "lexeme 5", that would be ancorare, that is the verb for "lexeme 3" and "lexeme 4". How do we connect those lexemes to "lexeme 5"?

Thanks in advance. --Sannita - not just another it.wiki sysop 13:57, 19 May 2015 (UTC)Reply

@Sannita: I am not sure I follow. Correct me if I am wrong: ancorare is the verb. That should be one Lexeme, and if I understand you correctly, it would have two Forms ancora. Furthermore it would be one Lexeme for the noun and one Lexeme for the adverb ancora. So we wouldn't have five Lexemes, but three. Does this make sense? --Denny (talk) 15:52, 4 June 2015 (UTC)Reply

More questions about Task 1

edit

@Denny: thanks for your new proposal, it doesn't seem to add much more, but it is nice presentation. While I like the "automagical" aspects of the component of Task 1, from the usability point of view is a disaster because it hides too much to the user and it departs from the standard procedure used in other mediawiki sites to add interwiki links in Wikidata. Wiktionary pages are just like disambiguation pages that have all the senses together, why not to treat them as such? For instance, let's take wikt:de:Gesundheit, it doesn't make sense to interlink it using health (Q12147), however it makes sense to use Gesundheit (Q5553730) as a location to store the interwiki links. The same applies to all other Wiktionary pages. Therefore it is just a matter of checking if an item is a disambiguation page,

and if that is true, then displaying a new group of interwiki links for Wiktionary. Less development time and more clear to the users! OTOH, instead of so many proposals, why not to go sequentially letting the users decide what they want?--Micru (talk) 21:48, 2 June 2015 (UTC)Reply

I don't agree that Wiktionary pages are just like disambiguation pages. Wiktionary entries directly contain content. Disambiguation pages are Wikipedia internal pages designed to redirect the user. Also, linked Wiktionary entries share titles, and disambiguation pages don't. For example, w:en:France (disambiguation) is linked to w:he:פרנס (פירושונים), but Wiktionary has separate entries for wikt:France and wikt:פרנס. --Yair rand (talk) 21:58, 2 June 2015 (UTC)Reply
@Yair rand: The idea is that Wiktionary pages aggregate content, conceptually that is very similar to a desambiguation page. If an item was created for each Wiktionary page then w:en:France (disambiguation) would not need to be linked with w:he:פרנס (פירושונים), each one could go with their own item, or not, who cares, that is up to each Wikipedia community to decide, and any way it goes I don't see that as a problem. The thing is that
  • Wiktionary could be supported right now with minor development effort
  • The community should have a say in the solution taken
  • If an automatic component would be implemented the function would be different that all other WM sites here in Wikidata for no good reason
Why not to launch a survey about this topic instead of having to accept any proposal just because it sounds technologically better? I don't agree that technical solutions are better than social solutions, like agreeing on using items for storing interwiki links.--Micru (talk) 21:22, 3 June 2015 (UTC)Reply
I agree with Yair rand that Wiktionary pages (the pages about words in Wiktionary's main namespace) are not disambiguation pages, and should not be classified as Wikimedia disambiguation page (Q4167410), but dictionary page in Wiktionary (Q20088089) (if they where linked from Wikidata). Visite fortuitement prolongée (talk) 20:33, 13 June 2015 (UTC)Reply
@Visite fortuitement prolongée, Yair rand: I agree with both of you, dictionary page in Wiktionary (Q20088089) is even better! Denny, do you think we could start a survey to ask frequent users of wiktionary which option do they prefer?--Micru (talk) 09:10, 19 June 2015 (UTC)Reply
I'm not sure where you got the idea that I support dictionary page in Wiktionary (Q20088089). My opinion is that the current Task 1 proposal of not having items for Wiktionary entries is preferable. --Yair rand (talk) 12:21, 21 June 2015 (UTC)Reply
Wikipedia disambiguation pages might nor might not be linked to other Wikipedia disambiguation pages that do not have exactly the same name. This makes them not usable to be merged with Task 1.
As I said above, I don't see the advantage of maintaining the list manually, either in Wikidata nor on the Wiktionaries. The automatic aspect of Task 1 seems to do a better job to minimize the needed work.
With regards to your claim of it being a usability disaster: can you describe a user story or scenario where the user would be confused? --Denny (talk) 16:01, 4 June 2015 (UTC)Reply
There is a standard procedure for linking pages from different wikimedia sites in wikidata, and you want to change it for wiktionary. In my opinion it is not my task to describe if it is confusing or not, it is just different, and that already justifies to ask the community if they want this method instead of the classical one. Do you agree on that or do you prefer to push your solution just because it seems better to you?--Micru (talk) 20:22, 4 June 2015 (UTC)Reply
Oh, but that was never the question! Asking the communities before implementing this is obviously integral, just as we did when introducing the Wikidata-based method to replace the previous language links on Wikipedia. This is not the question here. The question here is to come up with the best possible proposal before going to the communities in a wider way. So you are saying the proposal is already good enough to discuss it with the communities? How does one ask the Wiktionary communities best? --Denny (talk) 20:56, 4 June 2015 (UTC)Reply
What I am saying is that there are different approaches, and some seem better to you, and other better to other people, so it is good to have diversity of opinions, but one should not get stuck into finding the *ultimate* proposal that will be perfect. There have been many proposals already and yes, I think it is time to go to the communities, explain the options, and ask for feedback in a way that can provide better clues than just a conversation with the insiders here (as interesting as it might be! :)). In my opinion the best should be for the community to gather itself or to start a similar process as with Wikisource. Since this also requires a lot of involvement from the wikidata community, what about starting two new user groups? One for wikidata and another one for wiktionary. At least that provides a central venue for the international community to gather and to start enabling some governance mechanisms that allow for community choice.--Micru (talk) 21:46, 4 June 2015 (UTC)Reply

Glossary

edit

Wrote a glossary for Wiktinary to try to figure out if this proposal makes sense, and it seems like it somehow does. There are some places where I don't follow the proposal, where it is not quite clear or where I have misunderstood something. In some places we will need hierarchical Form and Sense, and in other places it is not clear how interlinear gloss (examples) should be handled. I would prefer to have a block with lexical statements (aka a subnamespace), and then put the examples there, instead of using one supernamespace. It is also not clear at all how word steams with complex affixes should be handled, or the affixes themselves, and how they relate to other languages. Jeblad (talk) 19:38, 6 June 2015 (UTC)Reply

On the "gloss"; this name for the description is confusing as it has a dual meaning. It is in no way a showstopper, just use a better name for the text string in Sense. I think I would prefer "annotation".
On the "examples"; those are what is called interlinear gloss. Each one would be a statement in a container, where each one of them would express the same phrase in some alternative way. This cluster of statements would use a specific Form, as a way to express a specific Sense. It could be languages where the word root (stem) from Forms does not align well with the words used in the interlinear gloss. Jeblad (talk) 20:08, 6 June 2015 (UTC)Reply

redundancy

edit

Please explain how labels do not become redundant. Thanks, GerardM (talk) 20:20, 10 June 2015 (UTC)Reply

Which labels? Visite fortuitement prolongée (talk) 20:33, 10 June 2015 (UTC)Reply
It seems to me that all labels are repeated in the Wiktionary extension. I do not understand why have them in the first place.. GerardM (talk) 05:09, 11 June 2015 (UTC)Reply
Which part are you referring to? The simple automated Wiktionary interwikis for entries, according to this proposal (task 1), would not have any labels, nor items for that matter. Interwikis for other namespaces (task 2) need to have labels, since the pages are titled differently in each language, and would therefore would not be redundant. Labels for lexemes (task 3, one label per lexeme as opposed to one for every language) are necessary to indicate what the word is, to link the word to the spelling. --Yair rand (talk) 05:36, 11 June 2015 (UTC)Reply
The Wiktionary articles refer to a spelling. When a spelling is linked to an item, there are multiple spellings that fit one item. At the same time there is often a need to document the spelling according to usage and standardiSation. This linking from a Wiktionary section to a Wikipedia article (ie a Wikidata item) is done quite often. I do not understand the need for "what the word is" because that is exactly what our items do. Hence the redundancy. Thanks, GerardM (talk) 10:46, 11 June 2015 (UTC)Reply
I think you're misunderstanding the proposal. There will not be any Wikidata items linked to Wiktionary entries, which contain all words with a particular spelling in all languages. --Yair rand (talk) 10:52, 11 June 2015 (UTC)Reply
All words have meaning, often multiple meaning and as such there is no difference with what we currently know in our items. All of them may be part of a dictionary and as such there is in my opinion a big redundancy. GerardM (talk) 15:36, 11 June 2015 (UTC)Reply
There is a lot of redundancy in this model, and that was why I asked to which degree we want to duplicate Wiktionary inside Wikidata. In the first few tasks this isn't visible, but when we add Sense it is going to be very visible. That is why I think we need tools to edit once and add it twice. Jeblad (talk) 01:41, 21 June 2015 (UTC)Reply
Do you think that repeting many properties in each of 100+ Wiktionaries is less redundancy? --Infovarius (talk) 15:41, 22 June 2015 (UTC)Reply
Omegawiki was designed to do away with the redundancy that is Wiktionary. It can do most of what Wikidata can do at this time. Thanks, GerardM 00:31, 6 July 2015 (UTC)Reply

Wiktionaries on Wikipedias

edit

There are a few cases where a language's Wiktionary is stored in their Wikipedia in its own namespace instead of being a separate project, e.g als:Wort:Haus (formerly at wikt:als:Haus). It seems like handling these Wiktionary-in-Wikipedia pages would be better done via Wiktionary support rather than the existing Wikipedia support. Is that something that has been considered?

A particularly confusing case is Walloon. It would appear that in some cases, they have namespaced Wikipedia pages (e.g. wa:Motî:Bouzvå - wikt:wa:Bouzvå is blank) and in others they have Wiktionary pages (e.g. wikt:wa:ristitchî) and the Wikipedia pages have Wikidata items. For now they've just been marked as instance of (P31) dictionary page in Wikipedia (Q20088085) (pinging @Multichill:) but a better solution would be nice.

- Nikki (talk) 02:18, 19 June 2015 (UTC)Reply

No clue how to solve this. I just had to mark these pages with something to track them. This wasn't really intended as a permanent solution. Multichill (talk) 16:44, 19 June 2015 (UTC)Reply

The evolution of the words and their siblings

edit

What can be important in my opinion it is also a section of the collection of the words in other languages and dialects. It is important for local version (i.e dialect) but also in the etymological point of view. It could be also important, in the future, to create a tree to look in the evolution of a word, so the parents and the children but also the siblings. An example can be that to study the evolution of the latin word "testa" in the neolatin languages and the semantic difference of this evolution. --Ilario (talk) 11:52, 6 July 2015 (UTC)Reply

That's all seems interesting to me, as long as this evolutions are stated as theories. For example a single lemma might have several etymological theories, some even largely considered as most likely wrong while some time popular. --Psychoslave (talk) 17:08, 28 December 2016 (UTC)Reply

Regional versions

edit

Several languages have regional versions of the same word. An example is this language [4] because there are several regional versions considering. Pledari, which is the main dictionary of this language, works with this regional versions [5]. --Ilario (talk) 12:00, 6 July 2015 (UTC)Reply

How are collocations, idioms, phrasal verbs, etc. going to be handled?

edit

First of all, I'm really excited with this proposal. Thank you for moving this forward. And I have a couple of questions:

  • Where and how would the phrasal verb "break down", collocation "heavy drinker", and idiom "rain cats and dogs" be recorded?
  • What about proverbs?
  • How would translations of the above cases be handled?

Thanks, --Bmansurov (WMF) (talk) 10:59, 10 July 2015 (UTC)Reply

As I understand it, phrasal verbs, collocations, idioms and proverbs would all be lexemes and translations of them would be handled like translations of any other lexeme. e.g. following the example at Wikidata:Wiktionary/Development/Proposals/2013-08#Example_entry for wikt:en:rain cats and dogs:
  • (lexeme) W12350
  • (lemma) rain cats and dogs
  • (language) English
  • (lexical category) verb
  • (form) F12351
    • (representation) rained cats and dogs
    • (lexical property) simple past
  • (form) F12352
    • (representation) raining cats and dogs
    • (lexical property) present participle
  • (sense) S12353
    • (gloss) (en) to rain heavily
    • (statement) translation -> S12354 (which is a sense linked to lexeme W12356 which has lemma "in Strömen regnen", language German)
    • (statement) translation -> S12357 (which is a sense linked to lexeme W12358 which has lemma "aus allen Kannen gießen", language German)
    • (statement) translation -> S12359 (which is a sense linked to lexeme W12360 which has lemma "llover a cántaros", language Spanish)
- Nikki (talk) 12:43, 10 July 2015 (UTC)Reply

Triconsonantal roots

edit

The proposal seems great, at least so far as it relates to languages in Standard Average European (Q471271). I'm curious how it would handle other languages. For example, how would it handle triconsonantal roots? Take the Arabic noun كتاب (kitāb, meaning "book"). Under this proposal, it would be given a specific lexeme, if I understand it correctly. Other words from the same root ك ت ب (k-t-b, relating to "writing") would have separate lexemes, right? So the verb كَتَبَ (kataba, "to write") would have a distinct lexeme. How would then the lexemes be related to each other in the database? Would that be through some property on the lexemes themselves (similar to their etymology)? A common thing for Arabic dictionaries is to order words by their triconsonantal root. How would this root be represented in Wikidata? As an item? A lexeme? How would Wikidata list (and link to) the different forms of a specific root? Gabbe (talk) 07:38, 11 July 2015 (UTC)Reply

... and to clarify, this is not merely a question about language-specific norms regarding collation. All languages have "quirks" when it comes to what is considered the proper way to order words lexicographically. For example, in Swedish "ä" and "ö" are considered separate letters at the end of the alphabet, whereas in German they are not. "W" is considered a distinct letter from "V" in English, whereas in Swedish it is not. And so on. I completely understand that such language-specific conventions will not be an issue here. Gabbe (talk) 08:01, 11 July 2015 (UTC)Reply
I think this is very similar to the question about #Etymology above. This isn't my proposal and other people might disagree with me, but the most likely way we would store the data in my opinion would be to have roots as lexemes (with their own lexical category, e.g. "triconsonantal root") and a "triconsonantal root" property which can be added to normal lexemes, so something like the following for the root:
  • (lexeme) W234
  • (lemma) ك ت ب
  • (language) Arabic
  • (lexical category) triconsonant root
  • (sense) S235
    • (gloss) (en) related to writing
and something like the following for a word based on that root:
  • (lexeme) W236
  • (lemma) كِتَاب
  • (language) Arabic
  • (lexical category) noun
  • (sense) S237
    • (gloss) (en) book
  • (statement) triconsonantal root -> W234
and another word:
  • (lexeme) W238
  • (lemma) كَتَبَ
  • (language) Arabic
  • (lexical category) verb
  • (sense) S239
    • (gloss) (en) to write
  • (statement) triconsonantal root -> W234
I'm not familiar enough with Arabic verbs to be sure where I would put the information about which form it is (to make it possible to make a list like the one on wikt:en:ك ت ب). I can think of multiple options:
  • Have "Arabic verb form I" as a subclass of "verb" and set the lexical category to "Arabic verb form I" instead of "verb"
  • Have a statement linking from the root to the derived word
  • Have a qualifier on the word's triconsonantal root statement
If the form is only information about how it was derived from the root (i.e. interesting from an etymology point of view but otherwise not needed to be able to use the verb), the first option doesn't make much sense. If you do need to know the form of the verb to be able to use it, the last option doesn't make much sense.
- Nikki (talk) 13:08, 14 July 2015 (UTC)Reply
In short, this means that "lexemes" (in Wikidata's sense of the term) would include morphemes like "anti-" and "-ify" as well? For example, a compound word like "blueberry" should obviously be able to have statements for its lexeme entry saying that it is composed of the lexemes "blue" and "berry". But in addition, a word like "indeterminableness" should be able to link to "in-", "-able" and "-ness", all of which would all have their distinct lexemes in Wikidata (even if a linguist would call them morphemes rather than lexemes)? Gabbe (talk) 12:38, 15 July 2015 (UTC)Reply
I would expect so, yes. They do already have their own Wiktionary pages which are structured the same as "normal" words and even my Oxford English Dictionary includes "anti-", "in-", "-able" and "-ness" as headwords (but not "-ify" for some reason), so I don't see any reason why they wouldn't be considered valid lexemes. - Nikki (talk) 17:08, 15 July 2015 (UTC)Reply

Same lexeme, different etymology

edit

How would this proposal handle words within one language that have the same grammatical properties and are pronounced the same, but have different etymologies? For example, the English noun ball in the sense "round object" has a different origin from the word in the sense "dancing party". Would the English noun "ball" therefore be two lexemes or one? If they were one, how would the etymology relate to the different senses? Would it be possible to have the etymology statements under the senses? Gabbe (talk) 14:39, 13 July 2015 (UTC)Reply

In the proposal, lexemes, forms and sense all have statements, so there would be no technical reason that I'm aware of that etymologies couldn't be attached to senses. I think it would make more sense to have separate lexemes though, in the same way that we would have separate lexemes for different parts of speech. Even wikt:en:ball splits by etymology before splitting by part of speech. - Nikki (talk) 17:43, 13 July 2015 (UTC)Reply

Start?

edit

I know there is and there will be still a lot to discuss but isn't it possible to start with the first point? There are no more open questions and the task is quite simple. If just we keep discussing (what is of course not a bad thing) this proposal will maybe end like the others. But iff we start with the realization more eyes will look on this project and more people should participate. Because of that we maybe could simultaneously realize the first point while discussing the clearer becoming following points with a growing number of participating people. We would "just" need a detailed plan for the extension and someone who programms it. What do you think? --Impériale (talk) 21:34, 17 July 2015 (UTC)Reply

The point of the 2015-05 proposal is to present a plan where each step can be done independently (but still sequentially). Now it is up to the Wikidata development team to allow the 2 first step ("Task 1" and "Task 2"). See Wikidata:Contact the development team/Archive/2015/06#Wiktionary interwiki links, Wikidata:Development plan#Access for remaining sister projects and Wikidata:Development plan#Wiktionary support. After all, the first goal of Wikidata is to centralize the interwiki links, isnt it? Visite fortuitement prolongée (talk) 19:39, 18 July 2015 (UTC)Reply
See phab:T987 for phase 0 (interwiki links) and phab:T988 for phase 1 (storing the actual data). - Nikki (talk) 21:58, 18 July 2015 (UTC)Reply
The first point can be started with indeed if anyone wants to pick it up. Right now the main team still needs to focus on a few more important things like unit support, watchlist integration, the UI redesign and starting with the support for Commons. If someone wants to take it on please let me know so we can quickly talk it through. --Lydia Pintscher (WMDE) (talk) 20:19, 20 July 2015 (UTC)Reply
@Impériale:, I agree with Lydia here. Anyone can start with the first task right away! I think it should be a reasonably easy task to do. --Denny (talk) 18:34, 21 July 2015 (UTC)Reply
Thank you all for your answers! If we can start right now we need someone who is able and willed to programm this tool. But what is the best way to find this person? Should we for example ask on pages like w:Wikipedia:Bot requests or w:Wikipedia:Requested_templates? --Impériale (talk) 18:11, 23 July 2015 (UTC)Reply

Hi all, I'm new here and I'll go through all the material you guys posted asap ( Jura1 pointed me here). In the meantime, I wanted to share with you a link to my Wikimedia IEG (Individual Engagement Grant) proposal here. I am proposing "A graphical and interactive etymology dictionary based on Wiktionary". A demo is available here (note that it works best on a desktop) - see a screenshot in the picture below. I am using Dbnary as a framework to extract data from Wiktionary (the DBpedia Wiktionary extraction framework is not maintained currently). I am modifying that framework to parse etymologies and extract etymological relationships between words to create an etymological tree. The output of the framework is an RDF file (see here an extract). As regards etymologies, I'm solving the "Same lexeme, different etymology" problem by attaching etymologies to senses in the same way as Nikki suggested. I just realized my project is doing exactly what Ilario had in mind. Please share your comments on my grant page. Also, I would like to have suggestions regarding how to technically implement my visualization (a d3.js interactive visualization) in Wikimedia (using a gadget, using a collapsible window in Wiktionary, or somewhere else in Wikimedia?). Also I am curious to know whether my RDF output can be easily exported to Wikidata. I am really excited about this project and I'm working full time on this. I am not sure it overlaps with what you guys have in mind. In any case, it would be great to get feedback/suggestions on my grant proposal page or here if you have any.

 
The etymological tree of the English word 'butter' as visualized by etytree

-- Epantaleo

Translations needed

edit

Hi, We all deplore a certain lack of implication from Wiktionary communities but I think it is mainly because everything here is in English and a lot of people do not want or cannot read English. So, to be positive and to make this process on again, please help to translate the different proposals into your own language! Noé (talk) 10:49, 29 June 2016 (UTC)Reply

+ @Denny: if it is possible, can you send me an editable version of the mockup figure to translate it? Thank you in advance. Noé (talk) 10:51, 29 June 2016 (UTC)Reply

It was done in a Google Doc. --Denny (talk) 17:05, 29 June 2016 (UTC)Reply

Some problems and possible solutions

edit

This proposal have two problems:

  • Some page (such as wikt:test) have a "see also" section, there're no place to store such information as it is not related to specific languages.
  • Glosses are part of Senses which functionally depend on Lexemes. So if there're words of apple in 200 languages we should add 200*200=40000 Glosses and every Gloss is duplicated 200 times. This is redundant.

I propose to add two new entity types:


Word entity type has zero or more (usually one) Supersitelinks (language and site independent), zero or more (usually zero) ordinary Sitelinks (to Wiktionaries, Wikipedias or Incubator), and Statements, but no Label or Description. Supersitelinks are ordered (though usually it is meaningless, as usually there're only one Supersitelink). In every Wiktionary pages, The interlanguage link in some language will be the first of:

  • Page linked by ordinary Sitelink, if exists;
  • Page named Supersitelink1 in target Wiktionary, if exists;
  • Page named Supersitelink2 in target Wiktionary, if exists;
  • ...

If the page is not connected to word item, interlanguage link will be automatically generated for existing pages with same title in other Wiktionaries, so word item is not required, but recommended. The first Supersitelink will be displayed as label; if there're no Supersitelinks, ordinary Sitelink in certain languages will be displayed as label, and fallback sometimes applies.

Example:


Sense group (aka Thesaurus) entity type has a Gloss (multilingual text) and Statements, but no Label or Sitelinks. It also have links to several Senses. a Sense group may have more than one link to Senses in one language, but only one is primary; one Sense belongs to only one Sense group. The label of Lexeme of the primary Sense will be displayed as label. If there're no links to Sense in one language, The label of Lexeme of the primary Sense of another Sense group linked by some specific property may be displayed as fallback. Senses will not have Gloss, which will be stored in related Sense group.

Example: both "body" and "person" have one sense belongs to Sense group T111 "individual human being", and the sense of "body" have statements saying it is archaic or informal. "human" and "human being" have one sense belongs to Sense group T222 "a large sapient, bipedal primate, with notably less hair than others of that order, of the species Homo sapiens"; T111 and T222 is linked each other by a specific property to indicate fallback.

Probably Sense group is not the best way to solve redundance problem; Does someone have better way?--GZWDer (talk) 14:58, 22 July 2016 (UTC)Reply

Hi, GZWDer! I read through your suggestions, and I am not 100% sure I understood them, but here's to what I understood:

  • your first suggestion - where you talk about supersitelinks etc. - is about your first point, i.e. it is about how to solve the problem of the See also section, right? I completely agree that the See also section is currently not covered at all. But I would argue that's OK - the goal of Wiktionary in Wikidata is not to replace Wiktionary entirely, but merely to offer some structured data the Wiktionaries can use, if they want to. Wikidata does not have to cover all the information in Wiktionary, and also probably never will. Having said that, yes, an extension as the one you suggest is very much an option which can be built on top of the current model - i.e. it is not precluded by the current proposal in the least. Once the current proposal is in place, I would suggest to assess the situation then and then to consider which further parts of the data are particularly amenable to being supported by Wikidata. If the See also section in one such part, perfect! Right now though, I don't really see it - doesn't the see also section often include manually written notes? It is not just a list of links, no?
  • regarding the second discussion, yes, this has been a frequent and contentious point. I also started with thinking that Senses should be unified over the languages - or grouped as you suggest - but it was mostly Daniel Kinzler and reading a lot on the topic which convinced me otherwise. Yes, the current system has quite a bit potential for redundancy. But this seems to be easier to handle than the potential for discussions and arguments if we don't have this extra entity type. Also, would the current item space fulfill the role you have in mind for sense groups? These, in turn, would indeed be linkable from the Senses.

I hope I understood your proposal sufficiently for my comments to make sense to you. Cheers! --Denny (talk) 17:08, 22 July 2016 (UTC)Reply

Triples ?

edit

How would the new entities look in terms of triples? Looking at Wikidata:Wiktionary/Development/Proposals/2015-05#Example_entry, it seems that the main difference to standard items/property entities are "F" and "S" statement groups (or would that be statements with qualifiers?).
--- Jura 13:41, 13 September 2016 (UTC)Reply

Existing dictionaries

edit

I have noticed on both Wikisource and Commons that PDFs and DjVus of multilingual dictionaries just linger for eternity there, not being proofread or even included in any context. Now that we are starting to move toward Wiktionary support on Wikidata, we should start taking the time to integrate these existing dictionaries, a task which, of course, ought to become considerably easier in the near future.

I have very likely missed some dictionaries in the following list, composed from browsing s:Special:IndexPages on each Wikisource and c:Category:Dictionaries by language. Also there are quite a few more on the Internet Archive, so these should also be taken into account. These are sorted first by the language of the dictionary itself (i.e. in which definitions are given), then by the language whose words are contained in the dictionary:

Bengali

edit

Catalan

edit

Czech

edit

English

edit

There are far, far more in this category.

French

edit

There are far, far more in this category.

German

edit

There are far, far more in this category.

Italian

edit

Japanese (honorable mention)

edit

Latin

edit

There are far, far more in this category.

Persian

edit

Portuguese

edit

Spanish

edit

Tamil

edit

There are several others which may be found using this search, most of which deal with technical vocabulary.

I am certainly hoping to get cracking at the Bengali dictionaries when this Wikidata/Wiktionary proposal goes live. Mahir256 (talk) 22:36, 14 September 2016 (UTC)Reply

I agree that it's important to find ways to import public domain dictionaries. I find it hard to believe that imports would be easier with Wikidata, though. Nemo 07:31, 14 October 2016 (UTC)Reply

CJKV

edit

As a contributor to both the English and Vietnamese Wiktionaries, I have a couple questions about how non-Western languages such as Chinese and Vietnamese would fit within this proposed schema. Apologies if I'm misunderstanding the proposal; the terms it uses are relatively unfamiliar to me.

  • Is it possible to express that a sense is translated into another language as a phrase that isn't a single lexeme in its own right? For example, "cơm" in Vietnamese means "boiled rice" in English, which would never get an entry at the English Wiktionary. Conversely, "simplicity" in English would mean "tính đơn giản" in Vietnamese; "tính đơn giản" would never get its own entry at the Vietnamese Wiktionary. Currently, we deal with these discrepancies by for instance linking to both "tính" and "đơn giản" in the "Translations" section of "simplicity". Under the proposed model, "tính đơn giản" would be a form of "đơn giản", but how would we express that this form is a translation of "simplicity"?
  • The Han character entries (Chinese, Japanese, Korean, Vietnamese, etc.) are the English Wiktionary's most rigidly structured entries, especially the glyph-centric "Translingual" section, yet it seems that the proposed model will severely limit the degree to which Han entries will be able to take advantage of Wikidata. Using "" as an example, "", "もち", and "bánh" are all representations of "餅" that have absolutely nothing to do with each other (and they're all lexemes in their own right). Or would "餅" get one lexeme per language, with the current "Translingual" section duplicated among them? Would the current "(written) Chinese" section be broken up too, with separate Mandarin, Cantonese, Wu, etc. lexemes that differ only in pronunciation?

    The Han entries, unified as they are today, are already a maintenance headache. (There are tens of thousands of entries in which a language-specific definition is missing, warranting Template:defn, but in most cases the reader can fall back to the "Translingual" section.) Duplicating each entry several times over will exacerbate the problem, to say nothing of the challenge of keeping these entries in sync. Maybe what's needed is a separate, top-level grapheme data type that can have a to-many relationship to lemma-less lexemes as well as lexicographical properties of its own. Just about any entry in wikt:Category:Characters by script or wikt:Category:Translingual symbols would be appropriate content for a grapheme, including "A" and "💯".

 – Minh Nguyễn 💬 07:27, 25 September 2016 (UTC)Reply

I have also wondered about how the translingual sections will be handled. They're not limited to CJKV and I don't think it would make sense to duplicate them for each language - it's information about the character itself, not about an actual word written using that character (similar to the letter "a" and the English word "a"). I think it's more likely that we would have a separate object for the translingual section (e.g. perhaps a lexeme with a special language) and a statement linking lexemes for individual languages to the characters.
I suspect we will end up with separate lexemes for each of the Chinese varieties, because while they share a lot, they also have lots of differences which we also need to be able to store in a machine readable way (storing it all as a single lexeme might be possible but it would also increase the complexity - avoiding duplication is not always ideal). To start with, I think that Wikidata's support for Wiktionary will be quite annoying to use for every language, things will be more split up than they are on Wiktionary, there will be missing features, etc. The important thing to remember: The user interface can be improved. Even if we do store them as separate lexemes, that doesn't mean that someone can't come up with tools (or even a whole interface) for editing Chinese that makes it much easier.
- Nikki (talk) 23:39, 25 September 2016 (UTC)Reply
@Mxn: Thanks Minh Nguyễn! Indeed, I do not know how well the model can deal with logograms instead of alphabet-based entries. I have to admit that my understanding of logogram based scripts and their related languages is rather murky - and it can very much be that the data model will have issues with capturing these. Suggestions on how to improve that would be very welcome.
I don't think it is an issue for the whole of CJKV - Korean lexemes in Hangul or Vietnamese entries in the latin script should have no problems. But where the language and the script actually start diverging heavily, I am not sure how to capture that best - i.e. would there be several entries associated with the same logogram / character for each of the languages using that given character? Maybe. Maybe not.
My assumption is that the best way to move forward is to try the current model, as it is expected to work with almost all other languages, and then, once we have collected a bit experience, to see what the pain points are for Chinese and other languages and figure out how to fix it. My hope is that, since it is already in a structured form, most of the work can be reused and transferred to a future better version of the data model that can deal better with logogram based scripts and their languages. --Denny (talk) 22:28, 24 October 2016 (UTC)Reply

Phrases / nonsense words

edit

In the proposal, it seems that Wiktionary entries' WD items would be entirely separate from Wikipedia entries' WD items. Would it make sense to link existing items describing phrases and other articles which aren't about concepts (e.g. Throw under the bus (Q7798526)) with their Wiktionary counterparts (e.g. wikt:throw under the bus)? Jc86035 (talk) 07:20, 29 September 2016 (UTC)Reply

Yes that will be possible with statements. --Lydia Pintscher (WMDE) (talk) 07:25, 29 September 2016 (UTC)Reply
@Lydia Pintscher (WMDE): After reading the proposal more thoroughly, would it be possible for lexemes to have interwiki links to those articles (as in, replacing the existing Wikidata Qitem)? Jc86035 (talk) 10:15, 1 October 2016 (UTC)Reply
Currently we don't plan to have sitelinks on these. And I think it is fine to leave them as items. After all they are concepts that people want to make statements about (as opposed to describing their grammar etc.) --Lydia Pintscher (WMDE) (talk) 11:51, 2 October 2016 (UTC)Reply

Unwritten languages

edit

The model define representation as actual string value realizing a given form. The IPA can provide indication regarding possible pronunciations. What about links to concrete voice records? What about words for which we would only have such a record, and possibly an IPA transcription, but no graphical canonical representation because the speakers just don't have one. As far as I know, that's the case for most languages out there, isn't it? --Psychoslave (talk) 19:03, 25 December 2016 (UTC)Reply

The lexem structure

edit

The lexem structure in the proposal bind a "lexical category" to the lexem. I'm not sure that making this mandatory can fit with every language out there. Speakers/skilled grammarians of concerned languages should be consulted, but I think that it would fit languages where this "lexical category" depend solely of the context where the lexem occurs. I think that mandarin is like that, but I have no certainty about that. Possibly a change of "lexical category" might change (or not) the "form" of the lexem, but at least conceptually you might consider potential languages where it wouldn't be the case: distinction is also possible through syntax, or any other contextual indication. For this reasons, I think that "lexical category" should be optional.

Also, to make sure that I understand well the boundaries of this model, could you tell how it would works in the case of an Esperanto "root" which can switch to any "lexical category" by changing the suffix. My guess is that you would have one lexem for each supported lexical category in the language. But let's go with a concrete example, like "lok/", which carry the idea of location, so you have:

  • substantive loko (location) and its nominative/accusative and singular/puraj declensions lokon/lokoj/lokojn
  • verb loki (to locate) and its various tenses loku/lokus/lokis/lokas/lokos and participles lokita/lokinta/lokata/lokanta/…
  • adjective loka (local)
  • adverb loke (locally)

--Psychoslave (talk) 19:03, 25 December 2016 (UTC)Reply

Example of matching and unmatching usages

edit

You probably want to add examples of sentences using the current lexem. If other lexeme share some forms with this one, you also might want to store in some way examples of sentence of cases where only one of them could be elected in a meaningful way, or ambiguous case where at least two could be elected. --Psychoslave (talk) 19:03, 25 December 2016 (UTC)Reply

Gloss

edit

I'm a little surprised with this term. I would be more expecting en:Seme or w:en:Sememe here. But that's just a suggestion. :) --Psychoslave (talk) 19:03, 25 December 2016 (UTC)Reply

Return to the project page "Wiktionary/Development/Proposals/2015-05".