Wikidata:Lexicographical data/Development/Proposals/2014-10

Introduction

Word relationships

There are essentially 3 ways of getting at a word (lexeme):

written representation
pronunciation (articulation)
meaning (sense)

These relationships form the basis of a number of resources:

dictionary (representation–sense)
reverse dictionary (sense–representation)
thesaurus (representation–sense–representation)
rhyming dictionary (representation–articulation–representation)

Proposal

Based on these relationships, and drawing heavily on a proposal initiated by Denny, I propose the following data model for Wiktionary:

Data model

This data model involves the addition of five new entities and five new properties, arranged in the following hierarchy:

lexeme (L)
- language
- lexical category
- sense (S)
  - gloss
- form (F)
  - grammatical category
  - articulation (A)
  - representation (R)
    - script

The terms are defined below.

Entities

A lexeme represents the abstract concept of a word.
A sense represents the distinct abstract connotation of a word which gives the word its meaning.
A form represents the abstract concept of a word in a particular context.
An articulation represents the abstract sequence of sounds which make up a word.
A representation is a character or sequence of characters which discretely encode a word in writing.

A lexeme has one or more forms and one or more senses. A form has one or more pronunciations and one or more representations.

Properties

A language is a complex system of communication which conveys meaning through the use of a rule-governed grammar.
A lexical category is...
A gloss is...
A grammatical category is...
A script is an organized set of characters used by a writing system to store and convey information.

Notes

Orthographic variants and transliterations in other scripts are handled simply by associating another representation with a particular form. They are treated equally and can all be indexed for search.
Translations are handled via senses. Any given sense will be associated with one or more lexemes (which each have a language), and each gloss can be freely translated into any Wikidata language.
The relationship between a lexeme and an item (Q) is orthogonal to this proposal and can be decided at a later date.

Example

lexeme :: L100
- language → English (Q1860)
- lexical category → noun (Q1084)
- sense → A type of knot with two loops, used to tie together two cords such as shoelaces or apron strings, and frequently used as decoration, such as in gift-wrapping. (S105)
- form (singular) → bow (F100)
- form (plural) → bows (F101)

sense :: S105
- gloss (en) : A type of knot with two loops, used to tie together two cords such as shoelaces or apron strings, and frequently used as decoration, such as in gift-wrapping.
- gloss (de) : Ein besonderer Knoten mit zwei Bögen.

form :: F100
- grammatical category → singular (Q110786)
- articulation (en-US) → /boʊ/ (A100)
- articulation (en-GB) → /bəʊ/ (A101)
- representation → bow (R100)

articulation :: A100
- IPA : /boʊ/
- X-SAMPA : /boU/
- phonetic respelling : bō

articulation :: A101
- IPA : /bəʊ/
- X-SAMPA : /b@U/
- phonetic respelling : bō

representation :: bow (R100)
- script → Latin script alphabet (Q8229)

representation :: bough (R101)
- script → Latin script alphabet (Q8229)

Implementation

Portions of this proposal have more consensus than others, so they should (because they can) be implemented in phases to facilitate further discussion on the more controversial aspects.

Phase 0: Interwiki links

The easiest and most immediate gain that can be had by transforming Wiktionary to a Wikidata format comes from leveraging the existing data model that Wiktionary uses at its most basic level.

Each localized Wiktionary has pages whose names are based on representation. Thus, Phase 0 is merely importing the combined set of (main namespace) pages from all Wiktionaries.

These can then be used to fill in any blanks in the interwiki link web, giving a single centralized location for interwiki links.

Data model

representation (R)
- script

Phase 1: Lexicons

This phase requires parsing the existing pages for some basic information. There appears to be clear consensus on the desired data model, but extracting the data will take some work.

Existing Wiktionary structure has language as the next level of hierarchy under representation, with lexeme being determined by a split on etymology and then lexical category. For the most part, form and grammatical category are only differentiated by sense or conjugation tables.

Data model

lexeme (L)
- language
- lexical category
- form (F)
  - grammatical category
  - representation (R)
    - script

Phase 2: Pronunciation guides

More discussion must be had to reach consensus on articulation.

Phase 3: Dictionaries

More discussion must be had to reach consensus on gloss.

Phase 4: Thesauruses

More discussion must be had to reach consensus on sense.

Wikidata:Lexicographical data/Development/Proposals/2014-10

Contents

Introduction

Proposal

Data model

Entities

Properties

Notes

Example

Implementation

Phase 0: Interwiki links

Data model

Phase 1: Lexicons

Data model

Phase 2: Pronunciation guides

Phase 3: Dictionaries

Phase 4: Thesauruses

See also

Wiktionary

Wikipedia