Wikidata talk:Identify problems with adding new languages into Wikidata

Latest comment: 5 years ago by Lea Lacroix (WMDE) in topic Wrap-up

We would be very happy to have your feedback, in order to help us listing the existing problems and finding solutions.

Your previous experiences edit

If you have tried in the past to get a new language added in Wikidata, what did work for you, and did not work?

These are the findings that were already collected while talking to Wikidata community members:

  1. The current process for adding new languages (both for termbox and monolingual strings) is perceived as complex, lacking transparency, and often getting stuck for a long time.
    1. The current process is to create a Phabricator task, which not all editors are familiar with
    2. The request is then handled by the Language Committee, whose process lacks transparency. The requester doesn’t know what’s the status of its request, how and by who the decision is handled.
    3. From the outside, the LangCom doesn’t seem to have shared opinion/guidelines about creating new languages, which makes it hard to get a clear and fast answer.
    4. Some guidelines have been built by the Wikidata community, but they are incomplete (no mention of lexicographical data) and still at a stage of draft so far, and the LangCom doesn’t rely on these criteria
    5. The Language Committee is originally focused on deciding which new language versions of Wikipedias should be created, and their criteria are not necessarily adapted to Wikidata’s needs
    6. As a result of the previous points, the process is sometimes stuck for months when the Language Committee doesn’t come to an agreement (example of en-us)
  2. When a new language has been added for monolingual strings, it is not shown immediately in the suggestions (the user has to enter the language code and save the edit anyway). This bug causes a hard time for the users to understand that the language is added and that they can use it. This problem happens for the following reason: the list of languages for monolingual strings is taken from an external database, Unicode CLDR, that collects, among other things, the name of languages in all the languages. If the new language is not in already in CLDR, the software has no way to know what’s the name of the new language in the user’s interface language, and therefore displays nothing.

In your answer, please add details about what you were doing, which language were you requesting, for which purpose (monolingual text, termbox, Lemma, etc.)

  • (Are we supposed to answer here in bullet point format?) There are a lot of multiple-script/dialect languages that don't have language codes for all of the variants used (for example, hak should probably have hak-Latn and hak-Hant variants, but Wikidata only has hak). I haven't proposed any "new" languages, though. Jc86035 (talk) 14:44, 4 December 2018 (UTC)Reply
  • It's LangCom policy forbid certain languages with valid IANA codes like mn-Mong as that language only gets written differently but not spoken differently then other Mongolian languages.
There's also no good reason why the language committee should have the authority to decide that we don't have en-US on Wikidata. ChristianKl13:20, 5 December 2018 (UTC)Reply

Your ideal process edit

If you’d like to add a language in the future, what would the ideal process look like for you?

In your answer, please add details about what process you're referring to: for which purpose (monolingual text, termbox, Lemma, etc.)

  • Here's my ideal process, which would allow the language for ALL purposes (monolingual text, termbox, lemma, gloss) except for language wikipedias which should continue to have special criteria for creation:
  1. Somebody creates an item for the language in Wikidata
  2. A special property is set for that item - maybe Wikimedia language code (P424).
  3. And then it just works...
in particular I think the identifier for the language (behind the scenes at least) should be the Wikidata item ID. If somebody edits the item to remove the special property statement that wouldn't invalidate old entries but it might prevent the use of the language for new data input - or at least remove it from the drop-down list/search index of options. Maybe there should be special editing restrictions on the "special" property, as we have now for property creation for example. But I think this approach of local management would be at least much more understandable to our users. ArthurPSmith (talk) 20:06, 4 December 2018 (UTC)Reply
  • For me there isn't an ideal process because the ideal scenario would be that we already have everything.
I would like to see monolingual text and lexemes use the same list of languages. It should be easy to add valid codes to this list because in both cases, we will add the data whether we're allowed to use the right code or not.
Without making big technical changes, what I'd like to see is:
  • There's a help link when editing which tells me where to make the request (which should be somewhere on Wikidata) and which information should be included.
  • When I make a request, someone responds quickly (ideally on the same day, at least within a couple of days) and either:
    • confirms that the request is fine and will be added within the next few hours/days/weeks (it should not take longer than a few weeks),
    • or asks for more information if the request is incomplete or unclear in some way so that I can respond,
    • or explains why it can't/won't be added.
  • After the language code has been added, someone responds to my request to tell me that it's now available.
I think it's more complicated for terms because they are closely linked to interface languages. As far as I know, it's not possible to use terms in a particular language unless the UI is available in that language, and if people can't use the terms, it's hard to add and maintain them. Therefore, if we're going to allow more languages for terms, it should be possible to select those languages for the UI (even if the UI itself is completely untranslated).
- Nikki (talk) 00:12, 5 December 2018 (UTC)Reply
  • I would favor that new languages get created analogous to how we create new properties. That process allows enough feedback before new items get created. As far as rules go, I would favor allowing languages that are registered with IANA. I see them as a good existing authority that we can reuse. ChristianKl13:23, 5 December 2018 (UTC)Reply
  • I am also in favour of using items for selecting languages. Or at least having a property that includes them automatically in the list of languages. There is no reason why it should be so complicated, and it goes against the wiki principle to have so many steps to add a language.--Micru (talk) 22:48, 11 December 2018 (UTC)Reply
  • I strongly support the ArthurPSmith's and Micru's argument in favor of using items. I have also given other arguments here. Pamputt (talk) 18:08, 7 January 2019 (UTC)Reply
  • To reduce process/bureaucracy, I wonder if it would perhaps be possible to add support for every human language with a valid BCP47 code unless it needs special attention (macrolanguages etc.) In the IANA language subtag registry for IETF BCP47, this would be every `Type: language` entry that does not carry `Deprecated:`, `Macrolanguage:` or `Scope:`. — Sascha (talk) 16:18, 8 January 2019 (UTC)Reply
    What I do not like with BCP47 is this is another code and as all codes this is only understandable by "advanced" users who know the code of the language they want to contribute (or at least know how to find it). Using Wikidata item allow to search the language using the language name in our mother language. This is clearly a big advantage IMHO. Pamputt (talk) 19:07, 9 January 2019 (UTC)Reply
    Actually, Wikidata already uses IETF BCP47 for its language codes. For example, if someone enters some monolingual text in Brazilian Portuguese into today’s Wikidata, they’re entering a string that gets tagged with BCP47 code `pt-BR` (without knowing it). The proposal here would be just about reducing process: Instead of running each and every existing language through a committee, the suggestion would be to add all the simple cases (those language where IETF/ISO have assigned a language code) in bulk, so that users don’t have to get things approved. — Sascha (talk) 15:27, 11 January 2019 (UTC)Reply
    I would support creating entries for all languages in the registry. I think if we use the language name in the registration (the first description) then even users who don't want to use the code will have no trouble finding the language that they mean. ChristianKl21:21, 9 January 2019 (UTC)Reply
    For many languages, translated names are available in the Common Locale Data Repository; if there’s interest, it should be possible to integrate them into Wikidata. — Sascha (talk) 15:27, 11 January 2019 (UTC)Reply

Other edit

If you have other questions or suggestions, feel free to add them here.

  • Since other people are asking for it: I don't like having to search for a language in items when adding lexemes. It's slow since it has query the server to find results and the results are unpredictable and include lots of irrelevant stuff. For example, when I enter the language code "mt" for Maltese, the first three suggestions are Malta, Montana and Mato Grosso, none of which are languages. I often accidentally select things which aren't languages. Because of all that, I avoid using the add lexeme page and either create lexemes using other tools, or use links with pre-filled fields so that I only have to click the create button. I'm concerned that switching to an item search for languages in more places will make it even harder to enter things. We already have problems with using items (e.g. it's impossible to add lexemes for some languages - phab:T209282) and, if we're going to go further down that route, I think there needs to be consideration about how it's going to be protected against vandalism, people adding invalid codes either accidentally (e.g. because they picked the wrong property) or on purpose (e.g. making up codes), how it will avoid using withdrawn codes and what happens when people change or remove codes from items. - Nikki (talk) 09:39, 12 December 2018 (UTC)Reply
  • Could it be documented how long it usually takes to process a Phabricator ticket for supporting a new language? And how to help out? I’ve filed phab:T210293 last November, assuming this would be handled in a few days. Admittedly, the silence is a bit frustrating... To manage expectations, it would help to know the typical processing time for such tickets, and if there’s any way to speed things up. (If somebody told me what to do, I’d gladly do the work myself; other users might be in a similar situation). — Sascha (talk) 19:37, 11 January 2019 (UTC)Reply
  • ...

Wikimedia policy on the use of languages edit

The longstanding policy for the creation of new Wikimedia projects is one where an ISO 639-3 code is required that indicates a single language that is explicitly associated with that language.

When for lexicographic purposes codes are needed to identify a specific language, script, orthography and or dialect combination of existing codes are to be used. The standarda are flexible enough to express what it is that is explicitly meant by that code.

One of the tennets of our projects is that we do not engage in original research. Consequently when something is concocted that is not supported elsewhere, it is in violation of this basic rule. Thanks, GerardM (talk) 14:29, 7 January 2019 (UTC)Reply

NB there is no problem in associating a Wikidata item with a specific construction of a code. There is one basic requirement: the code and consequently the labels used for it need to be unique.
  • I wonder if it would make sense to change the policy from ISO 639-3 to IETF BCP47. BCP47 supports finer-grained language distinctions such as language variants, or variants for country subdivisions/provinces/states via BCP47 extension U. Also, all the modern internet standards (such as XML, HTML, PDF, RDF, etc.) have switched to BCP47. — Sascha (talk) 16:22, 8 January 2019 (UTC)Reply
    • There is always the problem of languages that are not in those standards, and may never be. —Rua (mew) 13:55, 10 January 2019 (UTC)Reply
      When Wikimedia’s language committee gets a request for a new language (which does not have an ISO code, and cannot be modeled with an IETF language tag either), do people talk to the ISO/IETF registration authorities? These registries are not set in stone; if something is missing, it can be added. — Sascha (talk) 15:42, 11 January 2019 (UTC)Reply
  • It might be worth mentioning that GerardM might have a conflict of interested when commenting on these, this as founder of a competing dictionary project. --- Jura 13:53, 26 January 2019 (UTC)Reply

Wrap-up edit

Hello all,

Thanks to everyone who added input and answers. I'll try to sum it up, and to suggest some ways of improvement.

Issues you mentioned
  • many multiple-script/dialect languages are missing (eg hak-Latn and hak-Hant)
  • LangCom doesn't accept to add some languages that we would need (eg en-US and mn-Mong)
  • the process is very long, unclear and sometimes no answer is provided (eg phab:T210293)
  • Currently, looking for items in the entity suggester is not working so well (eg typing "mt" doesn't return Maltese)


Your suggestions
  • use items whith a special property (eg Wikimedia language code (P424)) for monolingual text, termbox, lemma and gloss
  • provide better documentation when requesting a language (help link)
  • have a more efficient process, with someone acknowledging the request in a few days
  • create a request process similar to the one we have for properties
  • allow all languages that are registered with IANA
  • allow all languages that have a a valid BCP47 code
  • Integrate the data from CLDR into Wikidata


Ideas to move forward
  • One of the issues seems to be that the LangCom is not applying Wikidata rules - but in the same time, the rules described here are still marked as "work in progress and not finished policy". Maybe a first and easy step would be to clean up and discuss these rules, so the community can transform them in a real policy, which would make it easier to enforce. You could for example start a RfC.
  • Once the community discussion is closed and the documentation updated, you could contact the committee to inform them about the decision
  • In order to prepare a possible switch to using items, if you find any issue with the entity suggester, you can report them on Wikidata:Suggester ranking input

If I forgot something important, or have anything else to discuss, feel free to add comments below. Thanks, Lea Lacroix (WMDE) (talk) 14:51, 13 March 2019 (UTC)Reply

Return to the project page "Identify problems with adding new languages into Wikidata".