Wikidata:Property proposal/Unicode character (item)

Unicode character (item) edit

Originally proposed at Wikidata:Property proposal/Generic

   Not done
DescriptionUnicode character representing the item
Data typeItem
Domaininstance of (P31)subclass of (P279)* → letter (Q9788)
Example 1A (Q9659) → See Q9659#P487, each value will be a new item
Example 2B (Q9705) → See Q9705#P487, each value will be a new item
Example 3C (Q9820) → See Q9820#P487, each value will be a new item
Planned usenew items will be created for current values of Unicode character (P487) on instance of (P31) of letter (Q9788). The current values of Unicode character (P487) and Unicode code point (P4213) will be moved to these new items.
See alsocode (P3295), Unicode character (P487) and Unicode code point (P4213)

Motivation edit

Currently Wikidata does not differ abstract symbols A (Q9659) and specific characters representing the symbols. So it may be meaningful to create new items for these Unicode characters. Unicode character (P487) and Unicode code point (P4213) will be moved to these new items and then a single value constraint will be set.

Note this does not affect any item about single characters like (Q3595028).GZWDer (talk) 22:02, 11 March 2020 (UTC)[reply]

Discussion edit

  •   Comment How will this be different from Unicode character (P487)? And how come the name of the property is "Unicode character (item)" - why is there "(item)" in it? Iwan.Aucamp (talk) 16:37, 12 March 2020 (UTC)[reply]
  •   Comment Do we have any items for individual unicode codepoints right now? I don't see that there's a real structural need for an item for each one, but some of them may be notable in themselves. On the other hand, for those notable ones (none of the "A's" count I think) what would point to them with a property like this? ArthurPSmith (talk) 17:26, 12 March 2020 (UTC)[reply]
  •   Comment @ArthurPSmith: See Wikidata:Requests for permissions/Bot/GZWDer (flood) 3. In most cases, I do not want to reuse items like 𐓃 (Q66360724) for unicode character, as the character does not correspond to single Unicode character (it correspond to two, U+104C3 and U+104EB). This property will not have single value and unique value constraint, while (after migration) Unicode character (P487) and Unicode code point (P4213) will. "structural need for an item for each one": Each specific character itself has various properties (Unicode character property (Q1853267)) that can not be expressed without dedicated item (example).--GZWDer (talk) 17:44, 12 March 2020 (UTC)[reply]
  •   Comment @Iwan.Aucamp: This is really not a new idea and there's some discussion. I am only going to work on this currently (after a break of 18 months) as Unicode 13.0 is released recently.--GZWDer (talk) 17:51, 12 March 2020 (UTC)[reply]
    •   Comment @GZWDer: I cleaned up the proposal a bit. I still think this should be more generic though, at the very least I would like to see some actual real example items? Presumably the subject items would be instance of (P31) of Unicode character (Q29654788) which is subclass of (P279) of character (Q32483)? Could we not call this property "codepoint (item)" rather? (see Code point and also maybe code (P3295)) then it is a bit broader than just The Unicode® Standard (Q8819)? If so I would change to support, I get the need. Iwan.Aucamp (talk) 18:28, 12 March 2020 (UTC)[reply]
    •   Comment @GZWDer: "Character" is such an overloaded term and the existing labels lead to confusion. Just explaining that a Unicode Character is not *just* a latin alphabetical character is hard! Please try to disambiguate the meaning upfront, something along the lines of Unicode Code Point (item), Unicode Abstract Character (item), and Unicode Character Encoding (external-identifier | string). -indo (talk) 02:34, 5 November 2020 (UTC)[reply]
      • @Iwan.Aucamp, Indolering: In this proposal, it means "abstract characters encoded by the Unicode Standard".--GZWDer (talk) 05:05, 5 November 2020 (UTC)[reply]
      • @Iwan.Aucamp: What does that mean? Is there a list of abstract characters somewhere? I'm not being facetious, I'm learning Unicode and this is driving me crazy! Here are all of the potential definitions I could find:
        • Unicode Glossary definition for Abstract Character: "A unit of information used for the organization, control, or representation of textual data. (See definition D7 in Section 3.4, Characters and Encoding.)"
        • Definition 7 in Section 3.4 "A unit of information used for the organization, control, or representation of textual data." followed by a list of vague bullet points, including the following:
          • "The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters."
          • "Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences."
        • The example in Chapter 2 Section 4 figure 2.5 shows how 3 different combinations of code points that all map to a single "abstract character", which (at least in this instance) appears to be a single common grapheme.
          • But that can't be right, because of this bullet point in Section 3.4 definition 7, "An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a grapheme."
        • This Character Encoding Model technical note mentions something close to what you are looking for, an Abstract Character Repertoire, "defined as a set of abstract characters to be encoded, normally a familiar alphabet or symbol set. The word abstract just means that these object are defined by convention."
        • But you don't want something that is defined by convention, you want the encoded abstract characters, which sound like a Coded Character Set, "defined to be a mapping from a set of abstract characters to the set of non-negative integers. ... The Unicode concept of the Unicode scalar value (cf. D28, in chapter 3 of the Unicode Standard) is explicitly this code point, used for mapping of the Unicode repertoire."
        • But that's not it either, as a Unicode Scalar Value is just, "Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive."
      • Please correct me if I am wrong, but I don't think there is a set of "encoded abstract characters", there are just code points (the Universal Coded Character Set) and transformation rules (Universal Transformation Format) which can be used to map between code points that one might consider equivalent. If that's the case, then you shouldn't have a Unicode Abstract Character item at all and use Unicode Code Point terminology
  •   Comment Some symbols may also have multiple characters, like dollar sign (Q11110).--GZWDer (talk) 18:31, 12 March 2020 (UTC)[reply]
  •   Comment I could never really figure out the purpose of Unicode character (P487). It's being used in at least four or five different ways. The above would fix that and it could become an external-id property. --- Jura 21:30, 12 March 2020 (UTC)[reply]
  •   Strong support I never understood why Unicode characters were mixed with the glyphs and concepts they represented. --Tinker Bell 06:16, 15 March 2020 (UTC)[reply]
    • @Tinker Bell: It's a little 'meta' I think, but I feel like I don't understand what is the actual subject of an item that is "about" a Unicode character. GZWDer's proposal is, I think, only to use this property where a current item has more than one Unicode character value. So for example for Chinese characters, there is only 1 Unicode character, so the item and the Unicode character are equivalent. Does that mean the "concept" of that character and the Unicode character are the same, or distinct? For the letter 'A' example, Unicode differentiates upper- and lower-case, and also those other special conditions that are sort of the letter 'A' in other contexts. So in each case where a new item would be created, that item would be "about" the conceptual context of the use of that letter, not specifically or exclusively about it as a Unicode character. Right? Or is that not the point here? ArthurPSmith (talk) 18:37, 16 March 2020 (UTC)[reply]
  •   Question can we change this to lexeme datatype per suggestion above? --- Jura 02:06, 28 March 2020 (UTC)[reply]
    • I don't think so - a lexeme may still cover multiple character or sequences of characters. For example ? have seven characters; but they should be in one (translingual) lexeme unless thay are semantically different.--GZWDer (talk) 09:13, 29 March 2020 (UTC)[reply]
      • Well, the idea is to use lexemes like invalid ID (L61046) mentioned above. They would exactly be that. --- Jura 09:15, 29 March 2020 (UTC)[reply]
        • In addition, lexemes can not handle characters not in Unicode normalized form, like 著 (U+FA5F) (Q55726748) and 著 (U+2F99F) (Q55738328). I don't think we should have lexemes for them as they have no independent meaning.--GZWDer (talk) 09:22, 29 March 2020 (UTC)[reply]
          • It's possible that initially not all can be included. The namespace is still under development and eventually a way can be found. We didn't use items either when no lexemes were available. As each character has a definition, this can be included as S1. The problem with using items is that they require needless repetition of labels and descriptions. Lexemes have all that already included. --- Jura 09:28, 29 March 2020 (UTC)[reply]
            • However still some lexemes for symbols may cover multiple characters such as X (L19342). I don't see the point for creating additional lexemes for individual characters with no additional meaning.--GZWDer (talk) 09:43, 29 March 2020 (UTC)[reply]
              • I don't think existing entities in some languages should be replaced. They can use the proposed property to point to entities like invalid ID (L61046) as well. I don't think the question whether or not to create these is much different from the question of creating them as items. If you don't see the point of one, it's unclear why you would want to create the others. Given the 5 or so ways Unicode character (P487) is used, people clearly have problems with the current structure and the more formal approach of the L-namespace could help. --- Jura 09:52, 29 March 2020 (UTC)[reply]
                • An item may easily tie to an specific Abstract Character, while a lexeme is a unit of lexical meaning, comprising a set of Abstract Characters with same semantic meaning. I don't think we should have lexemes for characters with no independent semantic meaning. For CJKV characters, I do not favor creating translingual lexemes for them - English Wiktionary deprecated translingual definitions long ago. --GZWDer (talk) 10:35, 29 March 2020 (UTC)[reply]

────────────────────────────────────────────────────────────────────────────────────────────────────

  • Can you explain what you think would be duplicated? How users could be confused? (L291359) explains clearly what it's about. For (Q87524936) users would have to find the right language to read the alias to understand what it's about. Seems much more confusing to me. --- Jura 04:43, 31 March 2020 (UTC)[reply]

I have following addition reasons:

  1. You can not add sitelinks to lexemes, so items like 😂 (Q33836537) and (Q3595028) will exist.
  2. Some characters have Unicode aliases (See [1] p924). aliases can not be added to lexemes either.
  3. We will anyway have lexemes for symbols like ( ) - this is a matching pair, and individual characters ( and ) - as a symbol, ( corresponding to multiple codepoints. Users may confuse the symbol with individual Unicode characters if both have lexeme.
  4. Not every Unicode character has meaning, and Unicode names are only names, which does not always tell the meaning of character (like 𗊓 (Q87589786)), and sometimes even unrelated to the meaning. They only exist as aliases, not as definitions.

--GZWDer (talk) 19:19, 31 March 2020 (UTC)[reply]

Each item is about an "Abstract Character" which may be encoded in multiple codesets (including Unicode). For example, "A" is a Unicode character which is (equals to) "Abstract Character" encoded in Unicode, and the same "Abstract Character" may also be encoded elsewhere. most emojis are also "Abstract Characters", some are encoded in Unicode, some are not. There will be only one item for each "Abstract Character" wherever it is encoded. I think this property should be limited to "Abstract Character" encoded in Unicode (as unencoded "Abstract Characters" are potentially infinite - this is why we have private use area (Q11152836).)--GZWDer (talk) 20:53, 5 April 2020 (UTC)[reply]