Wikidata:Property proposal/Library of Congress Demographic Group Terms ID

Library of Congress Demographic Group Terms ID edit

Originally proposed at Wikidata:Property proposal/Authority control

DescriptionID in the Library of Congress controlled vocabulary for demographic groups
RepresentsLibrary of Congress (Q131454)
Data typeExternal identifier
Allowed valuesdg\d{10}
Exampleteenager (Q1492760)dg2015060011
Sourcehttp://id.loc.gov/authorities/demographicTerms.html
Formatter URLhttp://id.loc.gov/authorities/demographicTerms/$1
See alsoLibrary of Congress authority ID (P244)

Motivation

Controlled vocabulary of approved terms used by the Library of Congress for different groups in society. The thesaurus includes schemes for groups of people by different ages, educational level, ethnic background, gender, language, medical condition, nationality, occupation, religion, social grouping and sexuality. See User:Jheald/LoC/dg for a full schedule.

These values could be stored using Library of Congress authority ID (P244). (The values use the same URL format and endpoint). However the current preference at that property is to keep P244 for the Library of Congress Name Authority File (LCNAF) and Library of Congress Subject Headings (LCSH), and not include the LCDGT vocabulary there, to try to keep the LCSH well distinguished and to try to avoid having different values from different vocabularies for the same item. Jheald (talk) 00:42, 6 March 2018 (UTC)[reply]

Discussion

@ArthurPSmith: The LoC maintains quite a lot of different indexes/vocabularies/thesauruses -- see http://id.loc.gov
A number of the smaller, more specialist ones have identifier URLs that start https://id.loc.gov/vocabulary/..., with a variety of different identifier formats, and are now covered by the property LoC and MARC vocabularies ID (P4801), apart from Code List for Cultural Heritage Organizations (P3234) which has its own property.
The primary most important ones have linked-data URLs that start https://id.loc.gov/authorities/..., followed by an identifier made up of one or two characters followed by up to ten digits, with the characters indicating the dataset and then the numbers a particular entry within that dataset.
Of the LoC datasets with identifiers of this form, the format as a regular expression (P1793) filter for the existing property Library of Congress authority ID (P244) currently only allows entries from the LoC Name Authority File (LCNAF), which covers names of persons, institutions and events, with IDs starting with 'n'; and the LoC Subject Headings list (LCSH), which covers subjects, with IDs starting with 'sh'.
A decision was taken at Property talk:P244#Genre/Form to exclude the LoC Genre/Form thesaurus (LCGFT) from P244, to avoid uniqueness constraint clashes with values in the Subject Headings list. The Genre/Form list is a set of identifiers used to describe what an object actually is rather than what it is about -- so eg an aerial photograph would have an LCGFT identifier for aerial photograph, which would be different from the LCSH identifier for books etc about aerial photographs. The LCGFT thesaurus is the subject of a parallel current property proposal, Wikidata:Property_proposal/Library_of_Congress_Genre/Form_Terms_ID.
It was additionally decided, at Property_talk:P244#Children's_Subjects to exclude the LoC Children's Subject Headings (LCCSH), which is a parallel set of subject headings to the LCSH, developed to be more closely attuned for material aimed at children.
The driving rationale given for these exclusions -- of avoiding uniqueness constraint clashes -- can to some extent be questioned (see Property_talk:P244#Genre/Form_(revisited), because the LCSH itself can contain multiple entries for what would correspond to a single item here -- eg when that concept is a 'main' subject, when it appears as a 'topic' applied to or modifying a different main subject, or as a 'form' (because there are still entries for forms in the general LCSH, as well as in the LCGFT devoted specifically to them). There is additionally also a (small) amount of overlap between the LCSH and the LCNAF.
So one possibility for the LCGFT and the LCDGT (the subject of this proposal) would be to overturn the previous discussion at Talk:P244, and simply use P244 for everything, with entries perhaps qualified by part of (P361) LCDGT or subject has role (P2868) = genre (Q483394) to distinguish them from LCSH entries, using separator (P4155) with one of these qualifiers to mute the single-value constraint. And in fact this is what I originally thought to do, as expressed at Property_talk:P244#Genre/Form_(revisited).
But on reconsideration, I think it proposing new separate properties for the LCDGT and LCGFT, as I have now done, probably makes more sense. It will save a lot of fiddling about with qualifiers to establish which vocabularies are in. And that, given that P244 now has over 500,000 uses for LCNAF and LCSH, if one is trying to write tracking queries eg to assess coverage or progress with LCDGT or LCGFT, these can be a lot more efficient if the query engine can go straight to the LCDGT or LCGFT set, rather than join in the full set of P244 values, only to then exclude almost all (ie over half a million) of them.
In fact, I now wonder whether there isn't a case for also splitting out the present 5400 LCSH values from Library of Congress authority ID (P244), and leaving P244 as a property purely for the LCNAF personal and institutional names. At the moment the 500,000 LCNAF names are such a big haystack for the LCSH values to be lost in, that it takes almost 24 seconds for WDQS just to identify and count them: tinyurl.com/ybzhl25d. Splitting them would give much tighter definition for what P244 was for, with tighter constrainsts on expected qualifiers; while allowing the LCSH entries to be more easily examined and developed as a set. In passing, I also notice that User:Pigsonthewing has just added a third-party formatter URL (P3303) for the Worldcat database, based on P244 -- a formatter that only works for the values that are LCNAF entries, not LCSH ones.
So, a slightly long answer to your question; but I hope this sets out why I think these two new properties would be a useful addition, rather than trying to force them into P244; and why it might indeed be beneficial to split a further new property out of even the current use of P244. Jheald (talk) 13:34, 7 March 2018 (UTC)[reply]

@Jheald, ديفيد عادل وهبة خليل 2, ArthurPSmith, TimK MSI:   Done: Library of Congress Demographic Group Terms ID (P4946). − Pintoch (talk) 17:50, 13 March 2018 (UTC)[reply]