Wikidata:WikiProject Toki Pona

The goal of this WikiProject is to build a complete representation of all lexemes in the constructed language Toki Pona (Q36846), which use the lexeme data structure as exemplary as possible. This would probably be the first (and last) language represented in Wikidata in its entirety! :)

Toki Pona is interesting because words often can be used in different lexical categories, have a lot of different senses, and never have multiple forms.

Looking at the data

Example lexemes: toki (L220792), pona (L220753)

You can look at the current status quo using these sites and queries:

Simple (incomplete) noun dictionaries, using the item for this sense (P5137) linking:

Tasks

Create lexemes for all Toki Pona words from the official book, in at least one lexical category. Done
Make sure all nouns have an item for this sense (P5137), or a translation (P5972) if there's no matching item. Doing… Query, 92% done as of 2019-11-1
Add item for this sense (P5137) or translation (P5972) to all non-noun lexemes.
Add IPA transcription (P898) to all forms. Query, 0.5% done as of 2019-10-28
Find Commons-compatible recordings of all Toki Pona words or record them ourselves. Upload them to Commons, and link them from pronunciation audio (P443).
Add at least one usage example (P5831) to all lexemes.
Identify and tag all pairs of synonym (P5973) and antonym (P5974). Doing… There's no clear indicator of completion, but as of 2019-11-03, there are 15 pairs of antonyms and 4 pairs of synonyms. You can check whether the properties are symmetric using these queries (they shouldn't return anything): antonyms, synonyms
Change glosses of all verbs' senses so that they say "talk" instead of "to talk". Done Query (shouldn't return anything)
Add derived from lexeme (P5191) to lexemes where the derivation is clear. Wiktionary has some ideas.
Link all homograph lexeme (P5402) to each other, so editors know that their forms should have the same properties. Done
Add described by source (P1343) Toki Pona: The Language of Good (Q73205450) for all lexemes from the Toki Pona book.

Resources

Problems

A lot of Toki Pona words are used in multiple lexical categories, and thus, are represented as separate lexemes. Because of that, all properties of their forms have to be replicated across these lexemes, leading to a lot of redundancy. When one form is edited, the others are not kept in sync. Can we solve that problem somehow?

Questions

Can we enable the language shorthand "tok" in Wikidata? Currently, we're using "mis-x-Q36846", like it's done for ama/𒂼 (L1). Question asked here: Wikidata_talk:Lexicographical_data#Process_for_adding_a_new_language_code? Not done
- Seems to be feasible – I opened a ticket. blinry (talk) 18:14, 30 October 2019 (UTC)[reply]
  - The Language Committee already answered, but declined the request. :( blinry (talk) 09:51, 31 October 2019 (UTC)[reply]
    - Done Robin van der Vliet (talk) (contribs) 22:48, 17 January 2023 (UTC)[reply]
Should non-noun senses have a item for this sense (P5137)? Asked here: Wikidata_talk:Lexicographical_data#Best_practices
- There's currently no consensus on this. blinry (talk) 18:14, 30 October 2019 (UTC)[reply]
  - Ah, looking at the talk page for P5137, there seems to be support for this! Let's go ahead and try it. --blinry (talk) 18:23, 30 October 2019 (UTC)[reply]
How could we add sitelen pona (or other scripts) to the lexemes?
Is preverb (Q1552433) the correct lexical category for lexemes like kama (L220663)? The official Toki Pona book lists it as "pre-verb", but maybe we should create a custom category for these words?
Is adjective (Q34698) the correct lexical category for lexemes like pona (L220753)? All adjectives in Toki Pona can be used as adverbs, essentially. Should grammatical modifier (Q732699) be used instead?
Should we split transitive and intransitive meanings of verbs?
- It seems to be common practice to use the broadest possible category. blinry (talk) 18:14, 30 October 2019 (UTC)[reply]
- Intransitive verbs are nearly indistinguishable from modifiers (In the book, she calls them both adjectives)
Because parts of speech for content words are flexible in toki pona, should we get rid of the standard lexical categories we find in English and other languages such as noun, adjective, verb, etc. and replace them with a more general lexical category like content word (Q789016)? It would also help the problem with multiple entries for one word with different lexical categories.
From what I read so far, Toki Pona words match exactly their corresponding IPA representation. Is there any risk of adding IPA transcription (P898) to all lexemes in bulk? -- Pedropaulovc (talk) 05:06, 2 November 2019 (UTC)[reply]
- I see no problems with that! Wasn't aware that words exactly denote their IPA, that's pretty cool! But this article confirms that this is in fact the "norm pronounciation". Thanks! :) blinry (talk) 11:11, 3 November 2019 (UTC)[reply]

Potential applications

Create Anki decks from the data (later maybe even for sitelen pona?).
Render pretty dictionaries in many languages.

Lexical categories

These are the lexical categories currently used for Toki Pona lexemes:

Properties

These are the most important properties for lexemes:

Lexeme-level

Sense-level

Form-level

Participants

blinry (talk)
Robin van der Vliet (talk) (contribs)
Pedropaulovc (talk)
Jens Ohlig (talk)
--Tinker Bell ★ ♥