Wikidata talk:Lexicographical data/Archive/2017/08

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Dealing with linguistic variation

Latest comment: 7 years ago7 comments5 people in discussion

From the development plan:

Note that two words in two different languages who happen to be the same (e.g. arm@en and arm@de) are two different lexemes, but also two different words within a language with different grammatical properties are described in two different lexemes (e.g. walk@en as a noun or as a verb).

This approach is going to crash — not necessarily quite "crash and burn", but definitely going to run into a whole slew of difficultes — as soon as the project expands beyond a small set of widely recognized standard languages. The distinction between "language" and "dialect" is, after all, arbitrary (cf. en:wikipedia:Dialect#Dialect or language), and in many cases, this means that it also cannot be exactly pinned down if a given word belongs to a given language or not.

Two simple examples (one synchronic, one diachronic):

Suppose that wiktionary 1 has opted to work with unified Serbo-Croatian (only recognizing its components as dialects), while wiktionary 2 has opted to encode Serbian, Bosnian and Croatian separately (and not consider "Serbo-Croatian" a language). Will the project be able to e.g. create interwiki links from Serbo-Croatian entries on wiktionary 1 into the corresponding Serbian, Bosnian and/or Croatian entries on wiktionary 2? Will it be able to abstain from creating unnecessary interwiki links in the case of words that do differ between S, B and C?
Suppose that wiktionary 3 has opted to consider Middle French a distinct language variety from both Old French and Modern French, while wiktionary 4 has opted to only recognize the last two, considering the "Middle French" data either a part of Old French or a part of Modern French (depending on some chronological cutoff). Will Wikidata create duplicate entries for the words considered Middle French by Wiktionary 3, once as Middle French and then again as either Old French or Modern French? If yes, will such duplicate entries be somehow linked with each other, or will manual editor work have to be spent on keeping them in sync?

--Tropylium (talk) 00:40, 18 September 2016 (UTC)

One thing that really helped with getting the model for Wikidata right was to ground the discussions in actual data. I.e. not to speak too much about 'suppositions', but actually point to examples about how this is currently handled in the projects. That's a very nice feature of the project :)

So, do the different Wiktionaries indeed make different decisions about how to split a word in different languages? Indeed they do: English has a Serbo-Croatian entry for noga, Italian has one for Croatian and one for Serbian. Is that a problem? Not really. You assume that the interwiki links are being derived from the Lexemes - but this is actually not the case. The Interwiki links connect the two mentioned pages anyway, no matter whether they split into two languages or one. In Wiktionary - unlike Wikipedia - the Interwikilinks and the entities in Wikidata are independent. So that would not be a problem.

In fact, whether we have two, one, or even three or more languages for this case in Wikidata is a discussion that can be decided by the community. It will make some use cases in the local Wiktionaries easier if these decisions are compatible, but none of the Wiktionaries will need to follow the decisions made here, and I am hopeful that here a smart decision will be made that takes into account the requirements of the projects. There seem to be a number of options: 1) having Lexemes in Serbian, in Croatian, and Serbocroatian, and each Wiktionary can choose to use the data from any of these, as per local policy; 2) having Lexemes only in the first two and using the data from one of them to display some Serbocroatian forms in the Wiktionaries that use that where it fits, not using the data from Wikidata at all in the Wiktionaries where it doesn't fit (it is actually likely that the Wiktionaries that have the strongest opinion on this question would not want to simply use Wikidata for that anyway, so there is no problem in the first place); 3) having Lexemes usually only in Serbocroatian and only in Serbian and Croatian where they actually differ (and then use in the local Wikitionaries the data from the Serbocroatian Lexeme but display it as either Croatian or Serbian or both, or not use the data at all again); etc. All of these have obvious advantages and disadvantages and I expect that many of these solutions will have proponents.

Regarding the dialects, I'd like to see examples that you have in mind and that might create problems with the data model. I would love also to see how often these problems occur but I can see that it might be hard to get by those numbers.

Thank you for your questions! These kind of questions and examples are exactly what we need in order to evaluate the data model, and to see if it needs to be improved or redone (in fact, the proposal has been getting questions now for several years, and has been refined accordingly, which is why I am personally rather confident in it). Ideally any deal breakers should be found before the implementation is on the way :) --Denny (talk) 04:56, 18 September 2016 (UTC)

Yes. en.WT collapses Serbo-Croatian into a single language, while others, for example German Wiktionary, do not. The issue is that wikt:sr:нога is also wikt:hr:noga, and I believe there's also an old church script as well. - Amgine (talk) 06:08, 18 September 2016 (UTC)

Yes, the Glagolitic script, used mostly for the Old Church Slavonic language. Currently mostly uncovered in Wiktionary - I couldn't find many glagolitic entries :( But what is the issue with the example you cite? The two pages do not interwikilink to each other, they do not even link to each other as translations (they are woefully incomplete!). Instead the Serbian page has an entry for the Ukrainian word, which points to its Serbian translation being wikt:sr:noga, instead of the entry on the very same page! This is actually a beautiful example of how incomplete and inconsistent the data currently is, and how it could benefit from having a central repository where the data can be accessed from, if so desired. --Denny (talk) 15:02, 19 September 2016 (UTC)

A risk I think I am seeing with having lexemes variously in either Serbian, Croatian or Serbo-Croatian is that if information about variation is not available to whoever creates the Wikidata entry, you end up with entries that claim to be "Serbo-Croatian" while they actually should be just "Serbian". Then later on, someone may create the correct only-Serbian entry anyway (without being able to see that a Serbo-Croatian entry already exists?), producing unnecessary duplication of data.
Worse yet, if you were to create always both Serbo-Croatian and individual standard lexemes, it seems to me that you would eventually end up with literally hundreds of thousands of "parallel" entries, which seem like they would all have to be kept in sync by manual effort with respect to any other data such as inflection? (I.e. rather more likely: that would fail to be kept in sync.)
In theory cases like Serbo-Croatian could be dealt with by using a slightly enriched data model: a word like noga could have both a feature "macrolanguage: Serbo-Croatian" and the features "variety: Serbian, variety: Croatian". In case of actual differences (say, the word for 'salt': Croatian sol ~ Serbian so), the variety features would be set only as applicable. This is also more or less how Wiktionaries that unify certain written language standards into one treat cases of variation: two lemmas are created, and then each marked with comments like "(Croatia)" and put in categories like Croatian Serbo-Croatian. Which is in turn no different from a Wiktionary adding a note such that tyre is a British English spelling, or that gasoline is an American English word. (Of course, since everyone agrees that these are a part of a single English language, it is most likely not necessary for Wikidata to start encoding dialect information of this kind — that's a rabbit hole that goes much too deep.)
I do not know if this approach will run into problems with regard to the encoding of other languages. Would this require that a lexeme in, say, English be also set a macrolanguage feature and a variety feature (even if not filled in)? And would it be possible to enforce that e.g. Serbian lexemes be placed under the macrolang/variety setup, instead of simply as Language:Serbian?
To think out loud a bit more, something similar would incidentally also be possible for languages that have multiple writing systems. E.g. Cyrillic Serbian нога could of course be further encoded as a distinct lexeme of its own, but in the case of languages with multiple relatively stable orthographies, nothing would prevent creating a data model where this is a part of the same lexeme as the Latin-alphabet noga. (Given the Serbo-Croatian issue, perhaps the Latin orthography could be kept as the lemma, versus the Cyrillic orthography treated as an alternate feature such as "Lemma (Cyrillic)"; but I imagine people who actively work with Serbian might have more detailed opinions on this.) On the other hand, since these lemmas in fact 'are distinct (and though they share features like part-of-speech, they also have e.g. distinct sets of forms), perhaps Wikidata would want to keep them separate anyway.
A lot about this depends on if you conceive of "a lexeme" as fundamentally a written entity (≈ an entry created on a Wiktionary) or a spoken entity (≈ a word actually used in a language).
Issues with dialects can come up e.g. when there is an agreement that a set of language varieties should be divided in 2+ languages, but there is dispute over where to draw the boundary. For example, there is no completely clear-cut boundary between High German (the main basis for standard German) and Low German dialects. Creating Wikidata entries for words particular to borderline dialects such as Central German would seem to require either a more detailed data model, or setting a foot down on if they are to be treated as Low German or High German. Presumably much of this can be later adjusted if necessary, though? Language/dialect treatment is often under discussion anyway (cf.en:wikt:Wiktionary:Language treatment). If it turns out that a Wiktionary wants to unify some language varieties that have previously been treated separately — say, Northern Foo and Swampy Foo into a single Foo language — then surely Wikidata editors could do a semi-automatic script or bot job to turn lexemes set as Language:Northern Foo into being set as e.g. Macrolanguage:Foo, Variety:Northern Foo?

--Tropylium (talk) 02:31, 20 September 2016 (UTC)

About dialect/language discussion, I think that the best way to solve it is to manage each language/dialect separately. This is what is done on French Wiktionary. It means that if we have a source for one High German dialect then we should manage it as a separate language. After if we want to know the link with other dialect/languages, we can know it from the item (see for example Alsatian (Q8786) and subclass of (P279) of it). Using this language tree, I think Wiktionaries can do what they want. Pamputt (talk) 05:10, 20 September 2016 (UTC)

How about creating mockups for `noga` that describe how it's supposed to be displayed? That might bring more clarity in what interwiki links there are supposed to be. ChristianKl (talk) 15:13, 20 September 2016 (UTC)

Having read the concern expressed by @Tropylium:, I would like to signal that it's exactly what I want to address in this comment on the model. --13:54, 27 August 2017 (UTC)

Improvement for editors

Latest comment: 7 years ago11 comments3 people in discussion

Hi @Lea Lacroix (WMDE), Denny, Lydia Pintscher (WMDE):

I am quite convinced now that a lexical database can be cool for machine processing and plasticity of presenting the information, but the improvement for the editors are still very vague for me and I feel it need clarification to engage more wiktionarians in this process. So, how the process of contribution will change with this evolution? Let say for a basic contributor that may like diverse aspect of the contribution to Wiktionaries such as adding synonyms and related words, quotations and pictures to illustrate a meaning, audio recordings, new definitions, more accurate translations with slang or a local dialectal word. If the changes for an editor are not for better-easier-fancier, and changes are only planed to be cool for machine processing, please clarify it and prioritize consulting linguists and electronic lexicographers rather than contributors. Noé (talk) 10:24, 9 November 2016 (UTC)

I have recently done a few edits on the German Wiktionary, in order to get a feel how easy the current environment is. And I think there is some space for improvement. It is obvious that current editors are used to the current system, and also that the current system has some awesome features that have been fine-tuned and perfected over the years. In my understanding, there is no intention in replacing all of this with Wikidata. The workflows that currently work will remain, and can be used by the community as long as they like.

In Wikidata I expect a new community to be established. I hope there will be significant overlap with the existing Wiktionary communities, but in the end, there will be new processes, workflows, and community interactions. Each of the Wiktionary communities and subcommunities will decide by themselves how to deal with the respective Wikidata communities and their content and workflows.

To make this more explicit, I would be (very positively) surprised if the French Wiktionary community working on the French lexicon would any time soon start to base their workflows around Wikidata. This is a community I expect to come to Wikidata rather late. Not because I think they are in any sense technophobe or neophobe or whatever, no, just because they are actually a highly successful and functioning project and community. The first communities I expect to come over and embrace data from Wikidata is either communities dealing with a language with many speakers in a Wiktionary not native to that language - e.g. the German Lexicon in the French Wiktionary, or the Spanish lexicon in the Dutch Wiktionary, or similar.

So my expectations are thus that a whole new community will grow on Wikidata - just as it did for the ontological content of Wikidata, the Q-Space. And for them I expect that it will be easier to create Lexemes in Wikidata than it would be to create them in the current Wiktionaries, due to the interface and data model.

Wikidata does not plan to cannibalize the existing communities, but rather to seed and create new ones. I'd define success if there is a significant number of new Wikimedia contributors for lexical data, just as there is now a significant number of new Wikimedia contributors working on the ontological part of the knowledge base. I hope this makes sense.

Obviously, this is my understanding - I am not speaking for the Wikidata team in Berlin. --Denny (talk) 19:54, 9 November 2016 (UTC)

Thank you for your detailed answer. I think this is very interesting. Are you ok if I translate and repost this in French Wiktionary? I think Wiktionaries may challenge this problem internally by translating help pages and create a friendly environment for newcomers that do not speak the main language of a precise wiki. But that's time-consuming. If Wikidata want to handle it, that's cool. I feel we are at the edge of a long path of Help page writing/translation. Kind of scary. Noé (talk) 21:03, 9 November 2016 (UTC)

Yes, I am. I am sorry that my French is not good enough to participate in the conversation directly. --Denny (talk) 01:38, 10 November 2016 (UTC)

That's ok, my English is bad, language learning is tough :) I feel I was quite violent in my question tho. I apologize. Still, I feel your answer is that it will not change so much for the way we contribute. Maybe it is a reason for a low enthusiasm for your plan. Plus, we are afraid that it make contribution harder to manage by dividing the content in two different place of storage. Noé (talk) 09:56, 10 November 2016 (UTC)

No need to apologize. I understand that communication across the language barrier can be hard and fraught with misunderstanding. Let's continue to keep an open mind and assume good faith :)

Indeed, it will not change much the way of the current contributions if you don't want to. Or you can change your workflows and completely rely on Wikidata. It is entirely up to the communities for each of them to decide how much and when they feel comfortable with using Wikidata.

Regarding the duplication, whereas it is correct that we are adding another place of storage, I still think it will reduce replication, and make it easier overall. Why? Well, you are saying that there will be two places of storage in the future. That is incorrect. There will be 174 places of storage in the future. Today there are 173 places. But in the future, we can get rid of some of them thanks to the centralized new place. Even if the French Wiktionary community won't use the data from Wikidata immediately for their French lexicon, the Croatian Wiktionary community might be using it for their French lexicon, and Korean Wiktionary community might be using it for their French lexicon, and suddenly they cooperate in a way the could not before. So, no, it is not an increase from 1 to 2, but from 173 to 174, with the potential to decrease organically after. --Denny (talk) 17:11, 10 November 2016 (UTC)

@Denny: I translated this conversation to French and a sentence bring my attention. I quote you, highlight is mine: "Today there are 173 places. But in the future, we can get rid of some of them thanks to the centralized new place." I am not sure if you are talking about closing some small Wiktionaries here. I am still quite confuse to get if you imagine Wiktionary to become in the future (5 years, 10 years from here) mostly or only 1. a reader interface or 2. an editing interface for data stock in Wikidata.

Well, in my understanding, you imagine people contributing to Wikidata on lexical data without knowing of Wiktionary workflow. I expect debates to pop again on lexicographical decisions already discussed during hours in Wiktionaries and fixed in Conventions pages locally. If some contributions are to be made directly in Wikidata, how Wiktionarians that became - by contributing - experts on collaborative lexicography can communicate with this new community to share its knowledge on lexicography and the wise decisions took after hours of debate? To give a specific example, I spent five months last year to build a solid convention for how to deal with pronunciation in Wiktionary, to provide three field instead of two. Before it was a merely phonological form (something wrong by definition), specific pronunciation linked with a recording and eventually one phonological strict form. New convention is: a consensual non-marked pronunciation inside \\, a strict pronunciation linked with sound inside [] and possibly several phonological propositions yeld by different authors, //. It is much more neutral and we are not destroying the definition of phonological anymore. Ok, so, if someone add French pronunciation in Wikidata, how can we be sure one respect this convention? And if English Wiktionary want to reuse French pronunciation, how can they deal with this convention? I hope this makes sense, despite the different questions raised here. Noé (talk) 10:02, 17 November 2016 (UTC)

@Noé: Denny has made clear on this talk page and elsewhere that Wikidata is forking the Wiktionary project, intends to poach Wiktionary contributors to the Wikidata edition, and to supplant/replace Wiktionary except as a presentation layer. There is merit to his intention; mediawiki is not a good tool for a linguistics dictionary. There is a history of such projects and intentions, and imo they have harmed both Wiktionary and the forking projects, but the communities now are not the same as they were in the past and maybe Denny's effort will have a different outcome. - Amgine (talk) 15:34, 17 November 2016 (UTC)

(line break)

Hi Noé! Thanks for asking for a clarification - I definitely did not mean to close any Wiktionaries. I actually think they will thrive more due to Wikidata. But that is besides the point at hand. When I said "get rid of some of them" I meant that in some of those 173 places we can get rid of some data within specific Wiktionaries. Some of the entries in some of the Wiktionaries can decide locally to replace their local data with data from Wikidata. That is what we can get rid of.

Wiktionary is much more than just the structured data. And the Wiktionaries would still continue to fulfill these additional functions, and could indeed focus on them and thus become more effective.

Regarding lexicographical decisions being reached on a specific Wiktionary: this is multifaceted, but let's take a look at the simplest cases: the fact that you had so much effort in creating this decision speaks for a strong community. As I said before, my expectation is that Wikidata will be particularly useful for the sub-projects of Wiktionary which do not have such a strong community. These projects will often not have such a strong commitment to decisions like these. I would consider it often to be smart to follow the lead of the most active community with such decisions.

In the case were Wikidata and a local Wiktionary project have an incompatible lexicographic decision, that would be unfortunate. I don't think if these cases will happen often, considering the highly flexible data model of Wikidata that allows for a diversity of points of view, but it is possible. So if this happen, that particular community with that decision will be struggling with using Wikidata as their database. But don't forget, they don't have to. And considering that they are likely a strong community, there would be also have fewer advantages from Wikidata in the first place.

In the particular case of the pronunciations, it is possible to model it in Wikidata. OK. But the question that you have is whether the Wikidata community will agree with the decisions from the French Wiktionary, which is unpredictable. But that's the same problem as Wikidata already has for the Wikipedias today. And the solution is: well, Wikidata is not a panacea. It can only help with a certain number of use cases, and it doesn't aim at more than that. The problem you described will become slightly more tractable than today, but it won't be automatically solved. It will cost considerable social investment by all players.

Regarding Amgine's point, there is no plan to fork Wiktionary, there is to the best of my knowledge no plan to poach Wiktionarians, and there is no plan to replace Wiktionary on all but the presentation layer. If you look at the suggested work plan, none of these things are planned. If there is, please point me to it. --Denny (talk) 21:12, 17 November 2016 (UTC)

My only rebuttal to the above:

Ime, there would not be concern raised about copyright license incompatility from the wikidata community if there were not already competing identical content. I do not believe anyone can honestly say a Wikidata representation of a dictionary does not meet the mission "to create open content dictionaries in every language." You cannot be creating a dictionary without forking the mission.
Although there may be no formal plan, above you describe targeted Wiktionarian communities for recruitment, and those you "…expect to come to Wikidata rather late," and the hope there will be overlap between the Wikidata and Wiktionary contributors. This directly contradicts what you said; you do have expectations and hopes of poaching Wiktionarians, and have suggested who to approach.
Again, although there is no formal plan for Wiktionary to become a presentation layer it is the logical result of your project, assuming it is successful.

I am not making value judgements about the above, since in principle I agree with the goal of structuring Wiktionary data. But legerdemain to hide what you are doing makes dupes of your volunteers; the project should be supported by informed peers. - Amgine (talk) 22:47, 17 November 2016 (UTC)

I don't describe targeting Wiktionary communities for recruitment. I describe Wiktionary communities deciding to use the data from Wikidata or not, and I expect that very active subcommunities will come later to Wikidata. I admit to wish and hope that there will be Wiktionary community members who will enjoy contributing to Wikidata in the future, but there are no plans at all for recruiting actively in these communities (to the best of my knowledge).

Let's take for example Wikispecies as a comparable case study. By the size of the active community, Wikispecies, if it were a Wiktionary, would rank about #5 out of 173 Wiktionaries, roughly the size of the Polish and almost the size of the German edition of Wiktionary. The French edition is about two or almost three times as large as Wikispecies. Wikispecies could be considered also a very data-heavy project, and thus provides a perfect use case for how Wiktionary may develop. But the impact on the size of the community - well, I don't see any. And as far as I can tell, there was no recruiting from the community, etc. I really don't see why this would be different for Wiktionary.

There is plenty of content in Wiktionary which is not amenable to Wikidata - e.g. the appendices in the English Wiktionary, pages on grammar, etc. I think the richness of Wiktionary is being underestimated in the conversation here - Wiktionary is much more than what some structured data can capture. And all of that would remain a unique feature of the Wiktionary projects. --Denny (talk) 03:31, 19 November 2016 (UTC)

First @Denny:, I would like to tell you how much I appreciate the work you, and all the Wikidata team related to this Wiktionary project, do. So, please take the following critics with that in mind. My feedback aim to adress concerns, and possibly provide suggestions to resolve them.

These [sub-projects of Wiktionary which do not have […] a strong community] projects will often not have such a strong commitment to decisions like these. I would consider it often to be smart to follow the lead of the most active community with such decisions.

I disagree with what I understand as being stated here. When there is a direction steered, by a lead or whatever force, and that anyone see a problem with the direction proposed, then there should be an attempt to inform this steering force that there is a potential problem. Assume good faith stands well here, but no one is omniscient and there might be some special feature of some language unknown from this steering force members which isn't tracktable within the boundary of whatever super genius model it might promote. Or maybe discussion might show that after all the model suits, but it just wasn't clear how its flexibility might include this special language feature while it does. Or maybe the model can't and a discussion on model shift should aim at providing one.

Sure mimetic might be a good important first step toward skill mastering, but blindly follow current practices is the best way to complete general disaster. :)

Moreover, acting smartly is not enough for our movement. Actually, the word smart appears nowhere near statement related to our values, for example in the Wikimedia 2030 consultancy. That doesn't mean acting smartly is undesirable, but it's not a driving force within our movement.

The motto we see everywhere is not "be smart", but "be bold".

I don't think if these cases will happen often, considering the highly flexible data model of Wikidata that allows for a diversity of points of view, but it is possible.

I agree with that, regarding the knowledge I have of the Wikidata general model. However having reviewed the Lexeme modele, I gave several feedback points which I would personally judge as alarmingly reducing this flexibility. And what I red so far on this page would tend to comfort this impression.

But the question that you have is whether the Wikidata community will agree with the decisions from the French Wiktionary, which is unpredictable.

As I understand it, no one have to agree, as long as you use statements to modeling that. That such or such source, and a Wiktionary practice can be a source, endorse such or such speech practice then it can be documented as a statement.

You might also include it as a form or a representation in the current model, if you put apart the written-language-centrim which seems to pervade the whole model description document.

There is plenty of content in Wiktionary which is not amenable to Wikidata - e.g. the appendices in the English Wiktionary, pages on grammar, etc. I think the richness of Wiktionary is being underestimated in the conversation here - Wiktionary is much more than what some structured data can capture. And all of that would remain a unique feature of the Wiktionary projects.

I think this is missing the point of @Amgine:. If the main space of wiktionaries are to aimed to be covered – at whatever extent – by what will be integrated in Wikidata, and that what is expected is that it will be filled by mostly new community, then at some point it's hard to understand how they could not be in some kind of competition, with all possible drama you might expect. All that can be avoided, but with very poor probability if the only way that it is treated is to elude it completely or suppose it can be can magically disappear with rhetoric spells.

We need clear plan and statements about

how current Wiktionary will be integrated in this projects, or at some point just planned to be thrown away in the obscurity of some historic archive without offend its community,
how the problem of targeted uselessness of Wiktionary mainspace material (at whatever level of extent) will be addressed to ensure maximum inclusion and the feeling of an attitude of listening regarding their possible missing needs.

And no, just saying that it's up to the community whether to include or not query from Wikidata won't resolve everything. If there are some content that is already in a Wiktionary article, that contributors would provide the very same result but using Wikidata queries as much as possible for homogeneous practice motivation or whatever reason, then they will need to move this content to Wikidata. And if the corresponding queriable repository is under CC-0, such a move will be impossible. So yes, contributors will be faced between drop their current material and use Wikidata features, or keep the former and don't use the later.

Yeah, I know, it's always Kassandra which speaks the loudest, but it doesn't mean the silent majority agree with everything and think everything is wonderfully steered. --20:30, 27 August 2017 (UTC)

How I work on Wiktionary, and my important constraints

Latest comment: 6 years ago12 comments5 people in discussion

You want input on how people work on wiktionaries, here is my (long) input.

According to January statistics, I created 1,333,201 pages on fr.wikt + addition of language sections in existing pages. My concerns must be taken into account.

How I work:

to improve a page: I open the page, I modify the text and I save it (nothing special).
to create a page manually (42,088 pages created): I prepare the text on my computer (outside the project), and I copy/paste this text to the new page contents.
to create pages by bots (1,291,113 pages created): I have written many bots. These bots use the pywikibot framework. Available bots have their source code available on fr.wikt. I prepare the bot input manually on my computer (in a file), and I execute the bot. If the page to be created does not exist, the bot creates the page. If the page already exists, the bot stores the text into a file (unless it can see that the work has already been done). From time to time, I open this file, and I copy/paste text at appropriate places in the pages (sometimes, this may require somes chnages to the prepared text or to the existing text).

My requirements and comments:

as you can guess, storing each article as a text (just like a text file) is an absolute requirement. I want simplicity, and I'm not willing to convert all my bots for using smart Wikidata APIs.
Wikidata can propose whatever it wants, but this should not have any effect on contributors preferring to work on text (either with bots or without bots).
A very very strong requirement: the list of languages accepted by a wiktionary project should never been restricted by Wikidata. Otherwise, Wikidata could the ideal place for unacceptable political action from activists or countries. This is an extremely sensitive issue.
Another strong requirement: for flexibility, the list of possible kinds of info about a word should never been restricted by Wikidata.
Some interfaces already exist for contributors willing to contribute, but not willing to try to understand the page format, especially for adding translations. I think that other such tools could be developed, but I don't see how Wikidata could help.
We recently discovered that the list of interwiki links of each page would become unavailable, not only on pages themselves, but also in dumps. This is a very serious drawback of Wikidataization of interwiki links (an apparently innocuous move).
I feel that many of Wikidata basic ideas are the same as OmegaWiki ideas (structured storage, thus much less flexible, single storage project, thus excluding many contributors in practice). These ideas might seem very sound from an intellectual point of wiew, but they forget the human factor. OmegaWiki proved to be a failure and this should be a lesson. I think that main reasons of the failure are as follows:
1. A wiki needs a discussion language (there are few exceptions, e.g. Commons), and it is very clear that this language was English, thus excluding many potential contributors and readers: when I access this site, I see a mix of French and English.
2. It is close to impossible to understand its concepts.
3. Readers were forgotten.
4. A basic idea is that the same definition (in any language) can be shared between words of many languages. Actually, this may be more or less true in some cases but, very often, there are subtle differences in meaning or in use between words of different languages, and this makes translations (and synonyms) only approximate. This issue was not considered.
5. There were restrictions on contributions: you had to create an account, you had to tell which languages you know (and your level)... When somebody wants to help, any restriction is likely to dissuade him.
The only negative effect of OmegaWiki on fr.wikt was that it took one of the major early fr.wikt contributors. But, if the link between Wikidata and wiktionaries is stronger and has an effect on contributors, consequences might be much more serious. An example (let's hope it won't never happen): if some data about an Aramaic word happens to be updated on a wiktionary because it has been updated on Wikidata by a contributor speaking only Albanian and knowing a little Aramaic, and that a contributor speaking only Kazakh and a little Aramaic disagrees, how could they discuss about the issue? But, if this data is available on Wikidata, and a contributor of the wiktionary project chooses to import it (thus duplicating it), there is no such issue, because it has validated the info, and nobody from outside the project can change it. If contributors see that info is changed from outside the project and they cannot do anything, many will leave the project. It might even kill some successful Wiktionary projects.
Even without Wikidata, Wiktionary projects can be a help to each other though bots: imports of new data, but also detection of inconsistencies between projects (with manual correction).
I see Wikidata as a toolbox and a database made available to wiktionaries (and this database cannot be built from wiktionaries for license reasons).

Lmaltier (talk) 19:16, 15 March 2017 (UTC)

I think Lmaltier is maybe a bit too pessimistic (especially for communication, there will always be bridge people and translation tools) but there is point that I 100000 % agree with : list of languages! When I see that something as obvious and trivial as adding fr-ca is on hold for 4 months T151186, I fear it could cause problem for L-items. We should definitely have have the possibility to add and use whatever languages the sources give. This is, indeed, a very very strong requirement. Cdlt, VIGNERON (talk) 21:42, 15 March 2017 (UTC)

L-items? I don't understand. My requirement was about wiktionaries: Wikidata should not be a constraint. I would personally disagree about the addition of fr-ca, but this is another issue. Lmaltier (talk) 21:49, 15 March 2017 (UTC)

Lmaltier En français pour l'explication du jargon : actuellement, sur Wikidata on a des éléments commençant par Q pour décrire des concepts (et non des mots) reliés par des propriétés P. Dans la dernière version de l'ébauche (et ce n'est encore qu'une ébauche, c'est pourquoi - même un peu pessimiste - tes commetaires sont très importants), il est imaginé d'avoir des éléments commençant par L (surnommée L-items) pour décrire les mots (plus précisément les lexèmes, d'où le L) et pour correspondre aux besoins des wiktionnaires. Pourquoi es-tu contre avoir un type de données indiqué comme étant en français canadien ? Comment procéderais-tu pour dire que « épinette » est le « nom commun d’une famille d’arbres résineux nommée épicéa en Europe. » en français du Canada ? (comme sur wikt:fr:épinette#Nom_commun_2) Cdlt, VIGNERON (talk) 22:35, 15 March 2017 (UTC)

Je ne suis évidemment pas contre français du Canada comme caractéristique d'un mot. Je suis contre en tant que langue à part entière (et le commentaire portait sur list of languages). Quant au vocabulaire abscons (et en anglais), ça me fait fortement penser à OmegaWiki (le vrai sens de "defined expression" est impossible à comprendre par une personne normale qui arrive...) Lmaltier (talk) 06:44, 16 March 2017 (UTC)

Thanks Lmaltier for translating your concerns in English. Be sure that we're aware of each of them and we're working on providing answers.

About languages: on Wikidata, we chose to rely on the Language Committee to accept or refuse inclusion of new languages. This process is not the perfect solution, still we trust this group of people who are aware of language issues to take the best decisions for the Wikimedia projects. Lea Lacroix (WMDE) (talk) 10:18, 17 March 2017 (UTC)

Wikidata may choose any list, this is not my concern. My concern is that Wikidata should not in any case restrict each Wiktionary list of languages, and each Wiktionary should be free to add new languages to its list, just as it's the case now. I'd like to be reassured on this issue (and on all other issues listed above). I don't want Wikidata to kill a project I spent so much time in. I issued an alert for OmegaWiki contributors in due time, without any result: they chose to bury their heads in the sand, don't do the same. About the Language Committee: is it a committee for introducing projects in new languages? Even for Wikidata, the issue here is 100% different, as all wiktionaries in each language want to describe at least 7000 languages (probably more): our mission is all words of all languages. Already more than 4000 languages are present on fr.wikt. Nobody can imagine that there will be Wikimedia projects in 7000 languages. Lmaltier (talk) 18:06, 17 March 2017 (UTC)

Also a personal comment (for what it's worth): in my job, I build complex web systems like the one you try to build, I understand them well. A few years ago, I imagined a kind of wiki with a structure database adapted to the subject, and I worked on the database structure. My ideas seemed sound to me (it was much simpler that the one you work on, and the subject made this idea much more feasible than for languages), but I finally realized that I was in the wrong way, and that a classical text format was much better, because much more flexible. Lmaltier (talk) 18:34, 17 March 2017 (UTC)

Hi, @Lmaltier:, thank you for the comments. Just to make one very strong reassurement: Wikidata will not constrain in any way any of the Wiktionary projects. Wikidata will be an offer which the Wiktionaries can choose to use, but there is absolutely no requirement to do so. Whatever list of languages is available in Wikidata, no Wiktionary project will be constrained by it. Any adoption of anything Wikidata offers by the Wiktionaries will be decided by the local Wiktionary communities following their own local rules.

Regarding your other points, some of them are entirely unreconcilable with the goals of Wikidata - i.e. entities in Wikidata, including lexical entities, will be structured data, not wikitext - others are not applicable - many of the points you rise about OmegaWiki do not apply to Wikidata, partially because we took the experiences of OmegaWiki into account - which would make a point-to-point reply tiresome to write and read. I'd be happy to answer specific points though, if you let me know which. --Denny (talk) 16:34, 6 April 2017 (UTC)

It's possible I was not very clear: my requirements were related to the Wiktionary/Wikidata relationship only (my final personal comment was not a requirement). Let's assume that a Wiktionary wants to take advantage of Wikidata data: it should be able to do so with my above requirements met, this is what I was meaning. If taking advantage of Wikidata would lead either to less flexibility for Wiktionaries, or to automatic external changes to current Wiktionary data, or to a mandatory visual interface for Wiktionary contribution, then Wikidata should never be used by wiktionaries. Lmaltier (talk) 20:09, 6 April 2017 (UTC)

Thanks for the clarification, @Lmaltier:. Even if a Wiktionary project chooses to use data from Wikidata, the pages in Wiktionary will always remain pure wikitext. Wiktionary pages will not become a mix of structured data and unstructured data - these pages remain wikitext.

Wiktionaries will never be less flexible because of Wikidata. Wiktionaries will retain all and every possibility that they currently have. Nothing will be taken away from Wiktionaries. Wikidata is about adding new possibilities, not about removing existing ones.

If a Wiktionary chooses to query data from Wikidata and display it on the Wiktionary page, then, yes, changes to Wikidata will then propagate to the Wiktionary. I assume that is why they query Wikidata, instead of copying the data manually or through a bot. They can do the latter too, by the way, if they prefer.

Editing Wikidata data will have to be edited in the Wikidata knowledge base - but there are many possible UIs for that, either through the Wikidata UI, or through some UI integrated into Wiktionary, or through some other UI tools (such as one of Magnus Manske's many tools for Wikidata).

From the side of the Wikidata project, there will never be a mandatory workflow that any Wiktionary contributor has to add their data through Wikidata. We didn't have that for Wikipedia, and it won't be there for Wiktionary. It is entirely up to the Wiktionaries to decide how they want to structure their workflows, how deeply they want to integrate with Wikidata, etc. No one in Wikidata can have the authority to say "you must not have the plural of mouse in your local Wiktionary". Local Wiktionaries will retain the autonomy to decide how much of their data they want to keep locally, how much of it they prefer to outsource to Wikidata, and how they like to structure the data quality story around the new tools and possibilities.

That's my point of view. I hope that helps with some of the concerns you mentioned. --Denny (talk) 20:59, 6 April 2017 (UTC)

Language should definitely not be a first class property of the central class of the model. That is, their should be no "language" field in "Lexeme". At most, a "languages" record, allowing to also add source and qualifiers should be used instead. They are people out there who are writing paper on "Chinese English", and lexical features exhibited in that kind of paper should be storable in Wikidata, but surely you wouldn't like a different lexeme for each "en-zh" possible entry. Therefor I think using statement for relating relation between languages and lexemes, languages and forms, languages and grammatiacl features, would be more appropriate. -Psychoslave (talk) 06:35, 29 August 2017 (UTC)

Wikidata:Lexicographical data/Notability

Latest comment: 6 years ago6 comments3 people in discussion

I have drafted this page. Comments welcome.--GZWDer (talk) 12:26, 3 August 2017 (UTC)

I have also drafted (but currently incomplete) Wikidata:Lexicographical data/Layout.--GZWDer (talk) 14:49, 3 August 2017 (UTC)

Thanks for your work here. About the model, did you have a look at the technical data model? This shows how the data will be structured in the Lexeme page, and still lets a lot of space to the community to decide how they want to organize the information. Lea Lacroix (WMDE) (talk) 15:17, 3 August 2017 (UTC)

Of course I have read the technical data model. In my opinion Wiktionary support should not be deployed before the layout is decided, otherwise we will run in a mess.--GZWDer (talk) 16:22, 3 August 2017 (UTC)

Thanks for your work GZWDer. I will add translate tag on Wikidata:Lexicographical data/Notability even if it is a draft (which means the page could evolve quickly) because I think it is really useful (and needed) to get comment from everybody (even people who do not read or write English). I will translate it into French but you can do it in other languages. Pamputt (talk) 18:09, 3 August 2017 (UTC)

I also alerted the Tremendous Wiktionary User Group, the global use group of Wiktionary users, about this discussion in order to get the maximum of feedbacks. Pamputt (talk) 18:56, 3 August 2017 (UTC)

Bot

Latest comment: 6 years ago7 comments3 people in discussion

Hi, does someone know what is the status of the categories import from the various Wiktionary projects? I know that Nikki, JAn Dudík and probably others added somes interwiki links by hand. I remember I saw a bot who imported a lot of interwiki links from the English Wiktionary project but I do not remeber its name. It seems the work is not finished for en.wikt since Category:Swahili non-lemma forms has no Wikidata item yet. So is there a central place where we can follow the progress in the Wiktionary interwiki link import? If not, we could create a sub-page Wikidata:Wiktionary/How to help/Interwiki link import. Pamputt (talk) 22:09, 7 August 2017 (UTC)

@Pamputt: My bot imported almost all categories with interwiki links from all wiktionaries and all categories from cs+sk wiktionary. Categories without links are not mass imported yet. I can easily import all categories from every wiktionary, but there are aslo duplicates in some wiktionaries and would be good, if some local user can say which categories to import and which to delete. JAn Dudík (talk) 07:30, 8 August 2017 (UTC)

@JAn Dudík: thanks for your work. I think your bot can import all categories from the French Wiktionary. I think most of them have interwiki if they exist. In addition, I currently work on languages and I merge a lot of categorie and language pages. If all categories from Wiktionary projects are here in Wikidata, it could help to find et merge them more easily. Pamputt (talk) 08:38, 8 August 2017 (UTC)

@Pamputt: OK, bot imports categories from frwikt. JAn Dudík (talk) 08:58, 8 August 2017 (UTC)

@Pamputt:

Done, all french categories except the newest and some with interwiki conflicts are imported. Which language next? JAn Dudík (talk) 08:41, 15 August 2017 (UTC)

@JAn Dudík: I have reviewed all categories on ca.wikt. Remaining ones with no Wikidata item can be mass imported. Thanks. --Vriullop (talk) 13:13, 15 August 2017 (UTC)

@Vriullop:

Done, categories on ca.wikt are imported. JAn Dudík (talk) 07:11, 16 August 2017 (UTC)

Talk at Wikimania and demo page

Latest comment: 6 years ago2 comments2 people in discussion

Lexicographical data on Wikidata, Lydia Pintscher, Wikimania 2017

Hello all,

During Wikimania, Lydia presented the status of the lexicographical data on Wikidata. You can find the slides here.

We're also happy to announce that there is now a demo system ready, where you can try structured lexicographical data as it will appear on Wikidata. Please note the following:

The system is not persistent for now, the information are not stored and will disappear if you reload the pages
The structure of the pages is based on the data model, but the content and the properties will be decided by the community in the future. We created a few for the demo, feel free to create others.
The design of the page is also expected to change, this is not the final version

Feel free to try it, give us feedback or ask questions. Thanks for your support! Lea Lacroix (WMDE) (talk) 17:31, 13 August 2017 (UTC)

This was quick!

Seems usable so far.

My first request for "counts per million words" property d1g (talk) 03:25, 14 August 2017 (UTC)

We need clear boundaries

Latest comment: 6 years ago8 comments4 people in discussion

We need clear boundaries to decide what should go in Q-items and what should go in the Wiktionary datatype items. As an example, see 雨 (Q3595028). It's described currently much like a lexeme would be, so what happens when we get the 雨 lexeme? Will there be a duplication of data? What metadata should go in what datatype? ~★ nmaia ^d 16:38, 17 August 2017 (UTC)

This is a beautiful example, thanks. I don't know whether 雨 is a lexeme. I know that characters of scripts often have items representing them, and for alphabets and syllabaries I don't expect those to show up as lexemes. D (Q9884) is not a Lexeme. But in ideographic and hybrid scripts the situation is different. The item describing 雨 you linked to is certainly not sufficient to be a lexeme. I would expect senses, pronunciations, etc., which are all not there. It describes the character, not a word. The linked Wikipedia article is also only about the character. The Wiktionary entry for 雨 covers its usage as a Lexeme, though, listing senses, pronunciations, etc.

To be honest, I really don't know how the Wiktionary data model will deal with languages written primarily in an ideographic script. When designing the data model, this was explicitly deferred. Not because it is not important, but because it is too easy to get wrong. Our hope was that once we have a good and working model for languages written primarily in alphabetic scripts, we can see how the data model works out for the ideographic script languages. I expect it to be a non-perfect match, and that improvements will be needed. The Wiktionary4Wikidata proposal has allocated explicitly a development phase for exactly that.

But until then, making hard rules and clear boundaries at the given time seems premature. I would rather encourage some experimentation, and then to see what works and what doesn't. I don't think any one of us is smart enough to figure out the best answer in advance. This will have to grow with the software, the data, the community, and I expect that part to possibly shift quite a bit. --Denny (talk) 15:16, 18 August 2017 (UTC)

@Denny: If you say, that D (Q9884) is not a Lexeme, how would we model wikt:d? Link wiktionary pages to Q9884? --Infovarius (talk) 10:24, 21 August 2017 (UTC)

@Infovarius: Good point. I retract my statement - I wasn't aware Wiktionary had entries for individual letters. Given that, I am not sure what to do about that, whether these should be Lexemes or Items, and I guess that warrants a deeper discussion. Intuitively, I would think that letters should be Items, not Lexemes, but I can see it go either way. Thanks for pointing this out. --Denny (talk) 14:48, 21 August 2017 (UTC)

I would probably model letters in wikt as Lexemes (they are described as graphical representation), but then what to do with Wikipedia articles about them? Link as senses? --Infovarius (talk) 09:50, 22 August 2017 (UTC)

Maybe. In fact, this won't be the only example where we already have Items for a word. en:Category:English_words has plenty of Wikipedia articles which are English words, and many of those really seem to be about the word, and thus Lexeme. So we will need a property to link equivalent Lexemes and Items together, anyway, it seems, and that same property should also do the trick for letters. --Denny (talk) 14:50, 22 August 2017 (UTC)

OK, let's go with a property. --Infovarius (talk) 09:52, 23 August 2017 (UTC)

@NMaia, Denny, Infovarius: 雨 is a very good example of an extreme case, it is (at least) :

a character is at least 3 Asian languages (Japanese, Chinese, Korean) and probably others (Chinese lectal variations, Vietnamese when it was using the Chữ Nôm writing system)
a word in these 3 languages (the same word « rain »)
a rare given name (only 32 persons so far, apparently no Q-item yet but it probably should be created, maybe a specific L-item should be created too)
the name of several texts of poetry, qv. the disambig page on zh.ws : s:zh:雨 (and the corresponding Q-items : Q18886487, Q17368698, Q17366417, Q18886482)
probably other concepts...

Clearly, I think we need (and already have) several Q-items and at least one L-item (or several?), but I'm not sure how to model, structure and link them. Cdlt, VIGNERON (talk) 16:41, 27 August 2017 (UTC)

Multilingual dictionary

Latest comment: 6 years ago13 comments3 people in discussion

Will we be able to have a multilingual dictionary such as OmegaWiki? I think it is more useful for some languages. For example, I work with First Nations in Canada and now I would need to add their words in both French and English Wiktionaries, and all others separately for that matters which is very time consuming just to say that "amiskw" is a "beaver" for example. I mean it's still a beaver in all languages version of Wiktionary. I understand that it's not that simple for a lot of words that have nuances in meaning, but for the vast majority of words it is that simple. Why is there no way to automatically have the word in a multilingual dictionary so somebody speaking Spanish for example would see that "amiskw" is a "castor" without me having to go add it manually in the Spanish Wiktionary since the words "beaver" and "castor" should be already linked? I think this is something we should explore as a community. It would also allow us to automatically create visual dictionaries in different languages. For example, a graphic showing body part names in English could be automatically switched to all other languages that have the names for those body parts linked to the English words. What do you think? Amqui (talk) 13:14, 24 August 2017 (UTC)

That's pretty much the idea here. You would create the entry for "amiskw" only once, and it would be available for the different Wiktionaries at their pleasure. Now there is a major difference to OmegaWiki though, which is that Wikidata for Wiktionary is centered around Lexemes and OmegaWiki is centered around Defined Meanings - i.e. Wikidata for Wiktionary will be more semasiological, whereas OmegaWiki is more onomasiological. And there's a valid debate on which one is the right way for a dictionary in general. For a collaborative dictionary, though, my assumption is that a semasiological approach leads to more results with less potential for blocking debates: it is much easier to agree that a certain word appears in a language, than that a certain meaning is expressed. It is much easier to keep words separate than it is to keep meanings separate. Again, this is not to say that the semasiological approach is always better, but for a collaborative setting with minimal oversight it seems that it is indeed the more promising route. But in the end, you can, with a query, turn the thing around. So your use case will still be possible - but it won't be the primary way to store the data. But in the end, who cares? I am pretty convinced that within two years of its launch, we will have interfaces who do that inversion for you. I hope that makes sense. --Denny (talk) 18:29, 25 August 2017 (UTC)

Amusingly, by the way, your example would totally not work for the two languages I am a native speaker of: in Croatian, "ruka" does not cleanly map to "Hand" in German ("hand" in English), neither does "noga" map to "Fuß" ("foot"). Nor does the German "Hals" really translate to anything in English, it's a combination of "throat" and "neck". The map of body parts is in fact different in the different languages - it is not just different words for the same concepts. --Denny (talk) 18:29, 25 August 2017 (UTC)

(Edit conflict) @Amqui: I wanted to say similar. Why to take English as first source? It has no, for example, different words for 2 parts of arm, so how do you plan to model them with English? Translation is not always 1:1, so it is not possible to derive transitive translations from any chain of translations. --Infovarius (talk) 18:49, 25 August 2017 (UTC)

I was using English as an example... we would need to use "concept" and not words for a multilingual dictionary. Some words in English may correspond to several "concepts" that in different languages has different words for each of those concepts, or even no word for it. That's normal and this isn't a problem. See how omegawiki is handling that, it works fine. Amqui (talk) 18:53, 25 August 2017 (UTC)

@Denny: Why can't we have both a semasiological dictionary and a onomasiological dictionary? Amqui (talk) 18:56, 25 August 2017 (UTC)

@Amqui: A dictionary can be primarily only built one way or the other - one has to decide how to build the data model. With a structured electronic dictionary we have the luxury of providing either view, though. So in practice this difference will hopefully eventually become moot, and allow users to use the lexical data in their preferred way - be it centered around words or around meanings. And the proposed data model allows for that, so, yay! :) --Denny (talk) 16:06, 28 August 2017 (UTC)

@Denny: I understand that a dictionary has to be built one way or the other, but I don't see why we are limited to a dictionary... sky is the limit. Amqui (talk) 16:10, 28 August 2017 (UTC)

@Amqui: Sure, that's why we have OmegaWiki which explores the concept centered around Defined Meanings, and the Wiktionaries, which traditionally are centered around Words. --Denny (talk) 16:12, 28 August 2017 (UTC)

@Denny: I don't know who you refer to by "we", but OmegaWiki is not a Wikimedia project. So "we" do not have a project for a dictionary around "defined meanings"/concepts. Amqui (talk) 16:15, 28 August 2017 (UTC)

@Amqui: We - you and me - are both contributors to both projects, so I guess that qualifies as a pretty good "we" - you and me. But besides that, I might also have referred to the wider collaborative open knowledge movement, which is another "we". The third possible "we" is all of humanity with access to the Web. Wikimedia also doesn't have a project to create maps, and yet I would say "we" have such a project. So, feel free to choose whichever of these three levels make most sense to you :) --Denny (talk) 16:24, 28 August 2017 (UTC)

@Denny: We do have "Wikimedia Maps": [1]. As far as I know, we, as the Wikimedia movement, do not have anything for a multilingual dictionary based on "concepts" rather than "words". That being said, now with Wikidata, we could use the description of each item to create such a dictionary I guess but we are lacking a real interface to do so. Furthermore, as far as I know, we have no projects making us of the content of OmegaWiki, nor any real links from Wikimedia projects pointing to that website, so it is far more outside of the Wikimedia realm than OpenStreetMaps is. I think we, as the Wikimedia movement, need something like OmegaWiki or to better include it within our existing projects. For example, from an article on Wikipedia, we can link to a word on the Wiktionary, but we cannot link OmegaWiki. We do have a property to do so on Wikidata, but, as far as I know, it remains unused on other Wikimedia projects, while it would be really useful to at least link it from the different Wiktionnaries. Amqui (talk) 16:30, 28 August 2017 (UTC)

@Amqui: That's something the other individual projects need to decide and have it discussed there. As you point out, Wikidata already provides the link. --Denny (talk) 17:51, 28 August 2017 (UTC)

Syllabification?

Latest comment: 6 years ago5 comments3 people in discussion

Are there any concrete ideas for how to model the syllabification of a lexeme? I have a few ideas, but don’t like any of them.

A single string statement with some separator character between syllables, e. g. “syllabification: syl‧la‧bi‧fi‧ca‧tion”. Not nice because we’d be encoding structured information in the string (so any query would have to do string manipulation and be slow as hell), though at least there’s a dedicated Unicode character for it (U+2027 HYPHENATION POINT), so I think it wouldn’t be ambiguous.
Multiple statements, one for each syllable, with series ordinal (P1545) qualifiers. Requires duplicating references on each statement, and doesn’t leave room for encoding several syllabifications (e. g. phonetic vs. written in English, or in case of disagreement between sources). I’m also not sure if the values (individual syllables) should be strings, items, or lexemes, though I suppose strings make the most sense.
A single statement with one qualifier for each syllable (and e. g. list of values as qualifiers (Q23766486) as the main value). Fixes the shortcomings of #2 but loses the order of syllables (qualifiers can’t themselves have series ordinal (P1545) qualifiers) and doesn’t support duplicate syllables.

Of these ideas, #1 – the least structured one – seems the only feasible one to me. I’m not really happy with that… --TweetsFactsAndQueries (talk) 22:02, 27 August 2017 (UTC)

That isn't syllabification; that's hyphenation. Syllabification is the division of a word into spoken syllables, and can only be indicated using IPA or some other phonetic system. By contrast, you are showing hyphenation which indicates where a word may be divided in print when it wraps from one line or page to another. These two processes are not the same thing and do not result in the same kinds of divisions.

For example, the English word boxes has two syllables as /bɒk.səz/ but could be hyphenated here: box‧es. These do not correspond to the same place, since the syllable breaks in the middle of the sound produced by the letter x, and this cannot be represented using the English spelling of the word. Worse, the syllabification of words differs depending upon how those words are pronounced, so syllabification is tied to a specific pronunciation in every instance, and not directly to the lexeme.

Further, hyphenation is not universal within a language. The allowed locations for hyphenating a word in print vary by country and even by publisher. Look in different dictionaries (from US, UK, Canada, Australia) and you will find that hyphenation is not consistent. --EncycloPetey (talk) 17:53, 28 August 2017 (UTC)

I’m not a linguist, so the best resource I had at hand was Wikipedia, whose syllabification article seems to imply that syllabification can also be a general term for “the separation of a word into syllables, whether spoken or written” (emphasis mine). I considered mentioning the difference between phonetic syllabification and written hyphenation in my posting (since the Wikipedia article had made me aware of it), but decided against it because it’s irrelevant to the question I asked: whether the individual elements are spoken syllables, written syllables or something else doesn’t matter because the question was how to store one or more lists of such elements where the order is significant and elements can occur more than once. (I assume that whatever we end up using, there will be properties for syllabification and for hyphenation, with a similar way to model the elements.)

Your point that hyphenation is not universal within a language confirms my suspicion that supporting several syllabifications/hyphenations/whatever is required, which rules out idea #2. --TweetsFactsAndQueries (talk) 21:59, 28 August 2017 (UTC)

The WP article is in need of a serious overhaul; note that it cites no references at all. But back to the point: neither syllabification nor hyphenation are universal within a language. The use of hyphenation with the written word to represent syllabification is frequently an approximation, and in some cases misleading. Hyphenation does not follow all the same breaks, as it is usually based on morphemes rather than pronunciation. Of course some languages, especially ideographic ones, do not hyphenate words in print. --EncycloPetey (talk) 23:09, 28 August 2017 (UTC)

I agree that the first option is the only realistic option. Separate statements would be weird (statements are independent of each other, there's nothing saying how multiple statements should be interpreted and it would mean copying the same reference to each statement). Qualifiers, as you pointed out, would be useless if we can't order them properly and that would really be a hack to work around the lack of support for an actual array (or whatever you want to call it) of values (Help:Qualifiers even says "Note that a statement should still provide useful data even without a qualifier"). While storing separators in the string is not ideal, it is simple for users to add (if each part has to be added separately, it would be really tedious) and as long as we insist on a standard separator (which we can do using constraints), it is not really any harder for people using the data to work with. - Nikki (talk) 13:10, 29 August 2017 (UTC)