Wikidata:Lexicographical data/Development/Proposals/2013-06

From before the start of the Wikidata project, there have been plenty of discussions of how to support not only the Wikipedias but also the other Wikimedia projects. And it became obvious very early that Wiktionary would be one of the projects that would benefit most – but be at the same time one of the more complicated projects to support. The data model required to support Wiktionary differs in subtle but important points from the current Wikidata data model. We are fully aware of the existing and long-standing discussions on how to best support or even replace Wiktionary, and today we want to provide our own proposal for consideration to the community.

To make it very explicit: this is not a plan to replace Wiktionary, but rather to provide a first step towards supporting some of the structured data already in the Wiktionaries. The goal is to reduce the maintenance effort of the Wiktionary projects, and to enable the different language editions of Wiktionary to share some structured data if they want so, in decisions that would be done from entry to entry. Wiktionary, as it is, provides a flexibility which goes well beyond what we aim to cover with this proposal.

In the following, technical terms are written in italics, and have a specific meaning as given by the Wikidata glossary. The following proposal is a bit technically dense, but we hope that we can clarify and explain it based on your feedback.

Wikidata currently has two entity types, and a third one is planned: items, properties, and queries. In order to support Wiktionary, we plan to introduce two new entity types: expression and sense.

Expressions, like all entities, are identified by a prefix and a number, e.g. W2808. The label of the expression is the actual word or expression. An expression can contain spaces (e.g. “to go”, “New York”, “carry coals to Newcastle”). There is only one label per expression, not one label per expression per language (like for items and properties in Wikidata, where each item can have a label in every language). An expression has no aliases and no description.

The sense is the second new entity type (Word sense). A sense is different from other entities as it is not independent, but each sense completely depends on an expression. A sense belongs to one expression, and one expression only. But each expression can have several senses. A sense, in turn, has no label or alias, but a description or gloss. There can be a description in every language. In the user interface, senses are different from other entities as they do not have a wiki page of their own, but they are rather sections on their respective expression's wiki page.

Both expressions and senses can have statements, just like items do. As with items, statements can have qualifiers and references, etc.

Furthermore, we introduce the respective new datatypes for properties, the expression and the sense. So new properties can be made that point to expressions or senses, like for example “infinitive form” connecting “going” with “to go”, or “plural form” connecting “apple” with “apples”, and so on. The set of properties, just as with items, is defined by the community - infinitive form and plural form are just examples. There is no restriction on where properties can be used, i.e. as qualifiers, on senses, on items, or wherever.

The system would, just like Wikidata for items, define the framework in which the editors can define the actual properties connecting expressions and senses and items. The system would be flexible in this regard. It is slightly more complex than the current system, but we think it remains as simple as possible without sacrificing the use cases and requirements.

The Wiktionaries would be able to access the data about expressions and senses (and also items, actually, for what it’s worth) through Lua. It would be completely up to the communities of how they want to use Wikidata data in their Wiktionaries.

Examples of usage

edit

That’s already the whole proposal. In the following, we make a few illustrative examples of the properties that could be created and how they could be used, etc. We are not domain experts, so regard the examples with a grain of salt. At this point of the proposal stage, we are interested in feedback regarding the basic data model described above, not the examples given below.

  • “language” (expression->item): which language or languages is the expression used in? Can appear several times per expression (like for some expressions that are both Serbian and Croatian). This is not meant to say that the expression “arm” is both an English word (with the sense of “part of a human or some animal’s body”) and a German word (with the sense of “poor”). This is for connecting expressions with the same morphology to all appropriate languages, without having to strictly define what languages are available or to duplicate expressions often.
  • “alternative spelling” (expression->expression)
  • “infinitive form” / “imperative form” / “plural form” (etc.) (expression->expression): connecting an expression with a derivative, e.g. “going” –infinitive form-> “to go”
  • “number” (expression->item): singular, plural, dual, etc. This is just an example for a grammatical property, others would be “tense”, “person”, “part of speech” or “word type” (noun, adjective, etc.)
  • “pronunciation” (expression->media): how is the expression pronounced? (also: “IPA” (expression->string))
  • “refers to” (sense->item), e.g. the sense of the expression “apple” as a fruit pointing to Q89
  • “example usage” (expression->monolingual text)

There are plenty of questions left that would be handled by the community. Would the expression “apples” also have the sense “fruits” or would it just be the “plural form” of “apple” which would have the sense “fruit” referring to Q89, and “apples” itself would have no senses? How to deal with etymology?

How would translation be handled? As a property pointing from one sense to another? That would lead to a big number of such relations - the number of languages times the number of languages. What about using the “refers to” property to find translations instead?

The good thing is: it does not matter. The data model is flexible enough to accommodate all these approaches and many others, and thus follows the recipe of Wikidata of providing an extremely flexible data model, which in turns can be used by the community in a creative way. We have been checking the data models of the English and German Wiktionary, of OmegaWiki, of WordNet and EuroWordNet, and as far as we can tell it seems that the proposed data model covers many of the use cases. We tried to keep the data model as simple as possible, and still convenient for the use cases.