Wikidata:Property proposal/Unicode character (item)

Unicode character (item) edit

Originally proposed at Wikidata:Property proposal/Generic

Not done

Description	Unicode character representing the item
Data type	Item
Domain	instance of (P31) → subclass of (P279)* → letter (Q9788)
Example 1	A (Q9659) → See Q9659#P487, each value will be a new item
Example 2	B (Q9705) → See Q9705#P487, each value will be a new item
Example 3	C (Q9820) → See Q9820#P487, each value will be a new item
Planned use	new items will be created for current values of Unicode character (P487) on instance of (P31) of letter (Q9788). The current values of Unicode character (P487) and Unicode code point (P4213) will be moved to these new items.
See also	code (P3295), Unicode character (P487) and Unicode code point (P4213)

Motivation edit

Currently Wikidata does not differ abstract symbols A (Q9659) and specific characters representing the symbols. So it may be meaningful to create new items for these Unicode characters. Unicode character (P487) and Unicode code point (P4213) will be moved to these new items and then a single value constraint will be set.

Note this does not affect any item about single characters like 雨 (Q3595028).GZWDer (talk) 22:02, 11 March 2020 (UTC)[reply]

Discussion edit

Comment How will this be different from Unicode character (P487)? And how come the name of the property is "Unicode character (item)" - why is there "(item)" in it? Iwan.Aucamp (talk) 16:37, 12 March 2020 (UTC)[reply]
- ~~Oppose~~ I get it now. I don't think this makes that much sense. For one the name does not make sense, if this were to be done you should make it rather "represented by character set code" - so it could potentially be more broad so it can cover Morse code (Q79897) and The Unicode® Standard (Q8819) - but I think you first have to try and get some consensus on the data model here. Iwan.Aucamp (talk) 16:43, 12 March 2020 (UTC)[reply]
  - Removed my oppose Iwan.Aucamp (talk) 13:03, 13 April 2020 (UTC)[reply]
- The disambiguator is needed as we can't have two identical labels for properties. --- Jura 21:30, 12 March 2020 (UTC)[reply]
Comment Do we have any items for individual unicode codepoints right now? I don't see that there's a real structural need for an item for each one, but some of them may be notable in themselves. On the other hand, for those notable ones (none of the "A's" count I think) what would point to them with a property like this? ArthurPSmith (talk) 17:26, 12 March 2020 (UTC)[reply]
Comment @ArthurPSmith: See Wikidata:Requests for permissions/Bot/GZWDer (flood) 3. In most cases, I do not want to reuse items like 𐓃 (Q66360724) for unicode character, as the character does not correspond to single Unicode character (it correspond to two, U+104C3 and U+104EB). This property will not have single value and unique value constraint, while (after migration) Unicode character (P487) and Unicode code point (P4213) will. "structural need for an item for each one": Each specific character itself has various properties (Unicode character property (Q1853267)) that can not be expressed without dedicated item (example).--GZWDer (talk) 17:44, 12 March 2020 (UTC)[reply]
Comment @Iwan.Aucamp: This is really not a new idea and there's some discussion. I am only going to work on this currently (after a break of 18 months) as Unicode 13.0 is released recently.--GZWDer (talk) 17:51, 12 March 2020 (UTC)[reply]
- Comment @GZWDer: I cleaned up the proposal a bit. I still think this should be more generic though, at the very least I would like to see some actual real example items? Presumably the subject items would be instance of (P31) of Unicode character (Q29654788) which is subclass of (P279) of character (Q32483)? Could we not call this property "codepoint (item)" rather? (see Code point and also maybe code (P3295)) then it is a bit broader than just The Unicode® Standard (Q8819)? If so I would change to support, I get the need. Iwan.Aucamp (talk) 18:28, 12 March 2020 (UTC)[reply]
- Comment @GZWDer: "Character" is such an overloaded term and the existing labels lead to confusion. Just explaining that a Unicode Character is not *just* a latin alphabetical character is hard! Please try to disambiguate the meaning upfront, something along the lines of Unicode Code Point (item), Unicode Abstract Character (item), and Unicode Character Encoding (external-identifier | string). -indo (talk) 02:34, 5 November 2020 (UTC)[reply]
  - @Iwan.Aucamp, Indolering: In this proposal, it means "abstract characters encoded by the Unicode Standard".--GZWDer (talk) 05:05, 5 November 2020 (UTC)[reply]
  - @Iwan.Aucamp: What does that mean? Is there a list of abstract characters somewhere? I'm not being facetious, I'm learning Unicode and this is driving me crazy! Here are all of the potential definitions I could find:
    - Unicode Glossary definition for Abstract Character: "A unit of information used for the organization, control, or representation of textual data. (See definition D7 in Section 3.4, Characters and Encoding.)"
    - Definition 7 in Section 3.4 "A unit of information used for the organization, control, or representation of textual data." followed by a list of vague bullet points, including the following:
      - "The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters."
      - "Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences."
    - The example in Chapter 2 Section 4 figure 2.5 shows how 3 different combinations of code points that all map to a single "abstract character", which (at least in this instance) appears to be a single common grapheme.
      - But that can't be right, because of this bullet point in Section 3.4 definition 7, "An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a grapheme."
    - This Character Encoding Model technical note mentions something close to what you are looking for, an Abstract Character Repertoire, "defined as a set of abstract characters to be encoded, normally a familiar alphabet or symbol set. The word abstract just means that these object are defined by convention."
    - But you don't want something that is defined by convention, you want the encoded abstract characters, which sound like a Coded Character Set, "defined to be a mapping from a set of abstract characters to the set of non-negative integers. ... The Unicode concept of the Unicode scalar value (cf. D28, in chapter 3 of the Unicode Standard) is explicitly this code point, used for mapping of the Unicode repertoire."
    - But that's not it either, as a Unicode Scalar Value is just, "Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF₁₆ and E000₁₆ to 10FFFF₁₆ inclusive."
  - Please correct me if I am wrong, but I don't think there is a set of "encoded abstract characters", there are just code points (the Universal Coded Character Set) and transformation rules (Universal Transformation Format) which can be used to map between code points that one might consider equivalent. If that's the case, then you shouldn't have a Unicode Abstract Character item at all and use Unicode Code Point terminology
Comment Some symbols may also have multiple characters, like dollar sign (Q11110).--GZWDer (talk) 18:31, 12 March 2020 (UTC)[reply]
- Comment @GZWDer: If we are trying to duplicate unicode maybe the lexeme namespace is better for this? It seems like in many of these cases you have a character with several different forms (uppercase vs lowercase) which is captured for example in Lexeme:L20817 (though they probably should not be all under "English" language). ArthurPSmith (talk) 18:56, 12 March 2020 (UTC)[reply]
  - Comment @ArthurPSmith: a Unicode character is not a lexeme as it only correspond to a specific writing system, not a specific language. For example, the letter "a" is used in more than 100 languages, but have only one codepoint (if we restrict it to normal small case letter). In a letter point new lexemes about letters actually used in a specific language may be created (e.g. A and a are same letter and will have one lexeme), which is out of scope of the current task.--GZWDer (talk) 19:01, 12 March 2020 (UTC)[reply]
    - It's possible to have languages such as "mul". Lexemes have the advantage that they don't require multiple labels nor description. --- Jura 21:30, 12 March 2020 (UTC)[reply]
      Sample at invalid ID (L61046) --- Jura 09:53, 14 March 2020 (UTC)[reply]
      - @Infovarius: who created invalid ID (L61046) --- Jura 14:56, 4 April 2020 (UTC)[reply]

Comment I could never really figure out the purpose of Unicode character (P487). It's being used in at least four or five different ways. The above would fix that and it could become an external-id property. --- Jura 21:30, 12 March 2020 (UTC)[reply]
Strong support I never understood why Unicode characters were mixed with the glyphs and concepts they represented. --Tinker Bell ★ ♥ 06:16, 15 March 2020 (UTC)[reply]
- @Tinker Bell: It's a little 'meta' I think, but I feel like I don't understand what is the actual subject of an item that is "about" a Unicode character. GZWDer's proposal is, I think, only to use this property where a current item has more than one Unicode character value. So for example for Chinese characters, there is only 1 Unicode character, so the item and the Unicode character are equivalent. Does that mean the "concept" of that character and the Unicode character are the same, or distinct? For the letter 'A' example, Unicode differentiates upper- and lower-case, and also those other special conditions that are sort of the letter 'A' in other contexts. So in each case where a new item would be created, that item would be "about" the conceptual context of the use of that letter, not specifically or exclusively about it as a Unicode character. Right? Or is that not the point here? ArthurPSmith (talk) 18:37, 16 March 2020 (UTC)[reply]
  - It is meaningless to split items about character and glyph of Chinese characters, as Unihan database (using Unicode character as primary key) is about the glyphs. Usually different values (contexts) may be differed by object has role (P3831).--GZWDer (talk) 18:41, 16 March 2020 (UTC)[reply]
Question can we change this to lexeme datatype per suggestion above? --- Jura 02:06, 28 March 2020 (UTC)[reply]
- I don't think so - a lexeme may still cover multiple character or sequences of characters. For example ? have seven characters; but they should be in one (translingual) lexeme unless thay are semantically different.--GZWDer (talk) 09:13, 29 March 2020 (UTC)[reply]
  - Well, the idea is to use lexemes like invalid ID (L61046) mentioned above. They would exactly be that. --- Jura 09:15, 29 March 2020 (UTC)[reply]
    - In addition, lexemes can not handle characters not in Unicode normalized form, like 著 (U+FA5F) (Q55726748) and 著 (U+2F99F) (Q55738328). I don't think we should have lexemes for them as they have no independent meaning.--GZWDer (talk) 09:22, 29 March 2020 (UTC)[reply]
      - It's possible that initially not all can be included. The namespace is still under development and eventually a way can be found. We didn't use items either when no lexemes were available. As each character has a definition, this can be included as S1. The problem with using items is that they require needless repetition of labels and descriptions. Lexemes have all that already included. --- Jura 09:28, 29 March 2020 (UTC)[reply]
        However still some lexemes for symbols may cover multiple characters such as X (L19342). I don't see the point for creating additional lexemes for individual characters with no additional meaning.--GZWDer (talk) 09:43, 29 March 2020 (UTC)[reply]
        I don't think existing entities in some languages should be replaced. They can use the proposed property to point to entities like invalid ID (L61046) as well. I don't think the question whether or not to create these is much different from the question of creating them as items. If you don't see the point of one, it's unclear why you would want to create the others. Given the 5 or so ways Unicode character (P487) is used, people clearly have problems with the current structure and the more formal approach of the L-namespace could help. --- Jura 09:52, 29 March 2020 (UTC)[reply]
        An item may easily tie to an specific Abstract Character, while a lexeme is a unit of lexical meaning, comprising a set of Abstract Characters with same semantic meaning. I don't think we should have lexemes for characters with no independent semantic meaning. For CJKV characters, I do not favor creating translingual lexemes for them - English Wiktionary deprecated translingual definitions long ago. --GZWDer (talk) 10:35, 29 March 2020 (UTC)[reply]
        I think you are mixing "lexeme" and Lexeme: --- Jura 10:53, 29 March 2020 (UTC)[reply]
        Lexemes should be created for symbols like X (L19342), and if we also create lexemes for individual characters, we will 1. unnecessarily duplicate the definitions and 2. make users confused.--GZWDer (talk) 11:17, 29 March 2020 (UTC)[reply]

┌────────────────────────────────────────────────────────────────────────────────────────────────────┘

Can you explain what you think would be duplicated? How users could be confused? ⋯ (L291359) explains clearly what it's about. For ⋯ (Q87524936) users would have to find the right language to read the alias to understand what it's about. Seems much more confusing to me. --- Jura 04:43, 31 March 2020 (UTC)[reply]

I have following addition reasons:

You can not add sitelinks to lexemes, so items like 😂 (Q33836537) and 雨 (Q3595028) will exist.
Some characters have Unicode aliases (See [1] p924). aliases can not be added to lexemes either.
We will anyway have lexemes for symbols like ( ) - this is a matching pair, and individual characters ( and ) - as a symbol, ( corresponding to multiple codepoints. Users may confuse the symbol with individual Unicode characters if both have lexeme.
Not every Unicode character has meaning, and Unicode names are only names, which does not always tell the meaning of character (like 𗊓 (Q87589786)), and sometimes even unrelated to the meaning. They only exist as aliases, not as definitions.

--GZWDer (talk) 19:19, 31 March 2020 (UTC)[reply]

I think it should be possible to solve these too. BTW, I'm not sure if 😂 (Q33836537) should actually have been merged with Q87581513, at least not in the logic you presented above. --- Jura 14:46, 4 April 2020 (UTC)[reply]

Each item is about an "Abstract Character" which may be encoded in multiple codesets (including Unicode). For example, "A" is a Unicode character which is (equals to) "Abstract Character" encoded in Unicode, and the same "Abstract Character" may also be encoded elsewhere. most emojis are also "Abstract Characters", some are encoded in Unicode, some are not. There will be only one item for each "Abstract Character" wherever it is encoded. I think this property should be limited to "Abstract Character" encoded in Unicode (as unencoded "Abstract Characters" are potentially infinite - this is why we have private use area (Q11152836).)--GZWDer (talk) 20:53, 5 April 2020 (UTC)[reply]

Support Ok, I think I get the point here, yes let's do this. ArthurPSmith (talk) 19:53, 1 April 2020 (UTC)[reply]

Comment There is just too much redundancy in the suggested datatype: compare item at [2] (~900 triples) and lexeme at [3] (~20 triples) --- Jura 14:46, 4 April 2020 (UTC)[reply]
- I don't think this is how we should concern, given the number of Unicode character is limited.--GZWDer (talk) 20:43, 5 April 2020 (UTC)[reply]
  - It's a massive redundancy due to a problem in the modeling. Besides, Query Server has problems dealing with them. Items become difficulte to edit when they have a larger number of triples. --- Jura 15:20, 7 April 2020 (UTC)[reply]
Strong support per Tinker Bell. Also there should probably be an inverse statement. 1234qwer1234qwer4 (talk) 15:03, 7 April 2020 (UTC)[reply]
An alternative proposal now at Wikidata:Property proposal/Unicode character --- Jura 15:20, 7 April 2020 (UTC)[reply]
Oppose given the redundancy of the proposed datatype. --- Jura 13:20, 13 April 2020 (UTC)[reply]
Oppose in favor of Jura's proposal; if we cannot implicitly or explicitly force a single label--a single triple--to be shown for all languages in the interest of efficiency, then this only promotes more bloat. Mahir256 (talk) 00:35, 18 April 2020 (UTC)[reply]
Weak support I strongly support spliting Unicode characters from concepts. Perhaps there is an existing property that could be used. --Matěj Suchánek (talk) 11:06, 4 September 2020 (UTC)[reply]

@GZWDer, Iwan.Aucamp, Jura1, Matěj Suchánek: @Mahir256, Tinker Bell, Indolering, 1234qwer1234qwer4: Closing as Not done due to the stale and inconclusive discussion here. I think any future proposal for this (and it may well be we need this) needs to set up some clear examples perhaps in test.wikidata.org, or otherwise be much clearer on the data model being proposed. ArthurPSmith (talk) 17:06, 9 September 2021 (UTC)[reply]
- @ArthurPSmith: There is currently a discussion at Wikidata:Requests_for_deletions#Lexeme:L61046.--GZWDer (talk) 13:39, 10 September 2021 (UTC)[reply]