Wikidata:Lexicographical data/Development/Proposals/2014-10

IntroductionEdit

 
Word relationships

There are essentially 3 ways of getting at a word (lexeme):

  • written representation
  • pronunciation (articulation)
  • meaning (sense)

These relationships form the basis of a number of resources:

  • dictionary (representationsense)
  • reverse dictionary (senserepresentation)
  • thesaurus (representationsenserepresentation)
  • rhyming dictionary (representationarticulationrepresentation)

ProposalEdit

Based on these relationships, and drawing heavily on a proposal initiated by Denny, I propose the following data model for Wiktionary:

Data modelEdit

This data model involves the addition of five new entities and five new properties, arranged in the following hierarchy:

  • lexeme (L)
    • language
    • lexical category
    • sense (S)
      • gloss
    • form (F)
      • grammatical category
      • articulation (A)
      • representation (R)
        • script

The terms are defined below.

EntitiesEdit

  • A lexeme represents the abstract concept of a word.
  • A sense represents the distinct abstract connotation of a word which gives the word its meaning.
  • A form represents the abstract concept of a word in a particular context.
  • An articulation represents the abstract sequence of sounds which make up a word.
  • A representation is a character or sequence of characters which discretely encode a word in writing.

A lexeme has one or more forms and one or more senses. A form has one or more pronunciations and one or more representations.

PropertiesEdit

  • A language is a complex system of communication which conveys meaning through the use of a rule-governed grammar.
  • A lexical category is...
  • A gloss is...
  • A grammatical category is...
  • A script is an organized set of characters used by a writing system to store and convey information.

NotesEdit

  • Orthographic variants and transliterations in other scripts are handled simply by associating another representation with a particular form. They are treated equally and can all be indexed for search.
  • Translations are handled via senses. Any given sense will be associated with one or more lexemes (which each have a language), and each gloss can be freely translated into any Wikidata language.
  • The relationship between a lexeme and an item (Q) is orthogonal to this proposal and can be decided at a later date.

ExampleEdit

  • lexeme :: L100
    • language → English (Q1860)
    • lexical category → noun (Q1084)
    • sense → A type of knot with two loops, used to tie together two cords such as shoelaces or apron strings, and frequently used as decoration, such as in gift-wrapping. (S105)
    • form (singular) → bow (F100)
    • form (plural) → bows (F101)
  • sense :: S105
    • gloss (en) : A type of knot with two loops, used to tie together two cords such as shoelaces or apron strings, and frequently used as decoration, such as in gift-wrapping.
    • gloss (de) : Ein besonderer Knoten mit zwei Bögen.
  • form :: F100
    • grammatical category → singular (Q110786)
    • articulation (en-US) → /boʊ/ (A100)
    • articulation (en-GB) → /bəʊ/ (A101)
    • representation → bow (R100)
  • articulation :: A100
    • IPA : /boʊ/
    • X-SAMPA : /boU/
    • phonetic respelling : bō
  • articulation :: A101
    • IPA : /bəʊ/
    • X-SAMPA : /b@U/
    • phonetic respelling : bō
  • representation :: bow (R100)
    • script → Latin script alphabet (Q8229)
  • representation :: bough (R101)
    • script → Latin script alphabet (Q8229)

ImplementationEdit

Portions of this proposal have more consensus than others, so they should (because they can) be implemented in phases to facilitate further discussion on the more controversial aspects.

Phase 0: Interwiki linksEdit

The easiest and most immediate gain that can be had by transforming Wiktionary to a Wikidata format comes from leveraging the existing data model that Wiktionary uses at its most basic level.

Each localized Wiktionary has pages whose names are based on representation. Thus, Phase 0 is merely importing the combined set of (main namespace) pages from all Wiktionaries.

These can then be used to fill in any blanks in the interwiki link web, giving a single centralized location for interwiki links.

Data modelEdit

  • representation (R)
    • script

Phase 1: LexiconsEdit

This phase requires parsing the existing pages for some basic information. There appears to be clear consensus on the desired data model, but extracting the data will take some work.

Existing Wiktionary structure has language as the next level of hierarchy under representation, with lexeme being determined by a split on etymology and then lexical category. For the most part, form and grammatical category are only differentiated by sense or conjugation tables.

Data modelEdit

  • lexeme (L)
    • language
    • lexical category
    • form (F)
      • grammatical category
      • representation (R)
        • script

Phase 2: Pronunciation guidesEdit

More discussion must be had to reach consensus on articulation.

Phase 3: DictionariesEdit

More discussion must be had to reach consensus on gloss.

Phase 4: ThesaurusesEdit

More discussion must be had to reach consensus on sense.

See alsoEdit

WiktionaryEdit

WikipediaEdit