Wikidata:Property proposal/text features

Text featuresEdit

number of wordsEdit

Originally proposed at Wikidata:Property proposal/Sister projects

Descriptionnumber of words in text
Data typeQuantity
DomainWikisource texts
Allowed values>0
Allowed unitsnone
Example 1À M. Paul Foucher (Q55867126) → 1000 (replace with actual number)
Example 2À M. des Herbiers (Q55867160) → 800 (replace with actual number)
Example 3À son frère (Q55867161) → 700 (replace with actual number)
Planned useadd to some Wikisource text

DiscussionEdit

  •   Comment it could be interesting, but the rules to compute the number of words should be fixed, because the exact same text can have different word counts in different systems... --Hsarrazin (talk) 08:11, 14 November 2018 (UTC)
  •   Support @Hsarrazin: determination method (P459) can be used to denote which method was used to count words. Dhx1 (talk) 10:28, 14 November 2018 (UTC)
  •   Comment Also suggest the domain be expanded to cover all texts described by Wikidata--not just Wikisource items, where a reliable source exists for the word count. Dhx1 (talk) 10:33, 14 November 2018 (UTC)
    • Yes, P459 would generally be added with a value that links to a fairly detailed explanation on how it's being done. Personally I'd start out with Wikisource and see how it goes. Eventually it could be expanded. --- Jura 12:23, 15 November 2018 (UTC)
  • tend to   Oppose, seems like a specific version of number of parts of this work (P2635), or redundant with the scheme :
    ⟨ text ⟩ has part (P527)   ⟨ word ⟩
    quantity (P1114)   ⟨ number of words ⟩
    .
    Also see type–token distinction (Q175928)      there is a difference in the number of word-types used (if you use « dog » twice in your text this count as one word type but two « occurences » of the word « dog ») - actually we may be able to solve this with the pair of property has part (P527)  /has parts of the class (P2670)   now that I think of it :
    ⟨ the text ⟩ has parts of the class (P2670)   ⟨ word-type ⟩
    quantity (P1114)   ⟨ the number of different word-type ⟩
    and
    ⟨ the text ⟩ has part (P527)   ⟨ word ⟩
    quantity (P1114)   ⟨ the number of different occurences ⟩
    Indeed, « word-type » can be thought as a metaclass of words and « has part of the type » can cross the boundary beetween the class level and the metaclass one (that’s what it is for actually), while « text » and « words » can be thought as classes of the same level - you use words to build text, each time you copy a text you copy all of its words alike with the text. author  TomT0m / talk page 13:16, 20 November 2018 (UTC)
    • Thanks for your input. number of parts of this work (P2635) could work if we were just interested in one aspect, but using units to differentiate between types of parts seems complicated as we would need to retrieve the detailed SPARQL node each time. has parts of the class (P2670) seems a good alternative, but as we will likely have several values for the statements (depending one calculation method), selecting the correct one is slightly easier with a separate property. Furthermore, as this property will apply to many items, I think a dedicated property is preferable. --- Jura 06:42, 23 November 2018 (UTC)
      @Jura1: The counting method actually is a case to discriminate using « has part » / « has part of the type », « has part of the type » is appropriate for example if you count the « word-type » number, per the type-token distinction, and « word-token » we can even use « has part ». I also note you don’t details at all the way to model different counting method, I think it may be way more appropriate not to use arbitrary items for obscure non-described method if we can use generic concepts to model them ( an item for « word type » for example, through metaclassification). author  TomT0m / talk page 13:49, 16 December 2018 (UTC)
  •   Support Good idea. I wonder if the domain could indeed be stretched beyond wikisource-entries. Lymantria (talk) 11:14, 16 December 2018 (UTC)

@ديفيد عادل وهبة خليل 2, Hsarrazin, Lymantria, TomT0m, Dhx1, Jura1:   Done: number of words (P6570). − Pintoch (talk) 20:28, 6 March 2019 (UTC)

number of sentencesEdit

Originally proposed at Wikidata:Property proposal/Sister projects

Descriptionnumber of sentences in text
Data typeQuantity
DomainWikisource texts
Allowed values>0
Allowed unitsnone
Example 1À M. Paul Foucher (Q55867126) → 50 (replace with actual number)
Example 2À M. des Herbiers (Q55867160) → 40 (replace with actual number)
Example 3À son frère (Q55867161) → 30 (replace with actual number)
Planned useadd to some Wikisource text

Motivation (both proposals)Edit

I think it would be good to add such metadata to Wikisource texts. Maybe additional properties can be useful.

@Hsarrazin: who edits there frequently. @Dhx1: who mentioned related readability scores on Project chat --- Jura 05:40, 14 November 2018 (UTC)

DiscussionEdit

  •   Support Both David (talk) 08:02, 14 November 2018 (UTC)
  •   Support determination method (P459) should be allowed and encouraged to be used as a qualifier to denote the method used to count the number of sentences.
  •   Comment Also suggest the domain be expanded to cover all texts described by Wikidata--not just Wikisource items, where a reliable source exists for the sentence count. Dhx1 (talk) 10:34, 14 November 2018 (UTC)
  • See also comments for first proposal only above. --- Jura 12:23, 15 November 2018 (UTC)
  • @Hsarrazin, Dhx1, ديفيد عادل وهبة خليل 2: thanks for your input. For both counts, the method should probably list the separators used to identify words (e.g. " ") and sentences ("." or "?" or "!", etc.) which may vary by language. Maybe an existing property can work for that, maybe we need a new one too. --- Jura 05:09, 16 November 2018 (UTC)
  • Tend to   Oppose with the same ideas as in « number of words » : use « has part ». author  TomT0m / talk page 13:25, 20 November 2018 (UTC)
    • see comment above. --- Jura 06:42, 23 November 2018 (UTC)
  •   Comment I wonder how should be dealt with poems or song lyrics, which often do not have a clear sentence-structure? Is the combination of these two properties meant to indicate text complexity (longer sentences means harder to read)? Lymantria (talk) 11:14, 16 December 2018 (UTC)
    • I think the "criterion used"-item needs to enumerate separators. --- Jura 14:29, 16 December 2018 (UTC)
      • This seems rather a naive approach. is this backed with references and known segmentation algorithm in mind or is this original work ? author  TomT0m / talk page 14:36, 16 December 2018 (UTC)