Wikidata:Property proposal/publication type of scholarly article
Publication type of scholarly article
editOriginally proposed at Wikidata:Property proposal/Generic
Description | Publication type of scholarly article |
---|---|
Data type | Item |
Template parameter | Different from publication type as used for example in w:Template:Infobox short story |
Domain | Instances of scholarly article (Q13442814) and its subclasses. |
Allowed values | Permitted values typically should be potential subclasses of scholarly work (Q55915575). In practice there is diversity in instance of (P31) statements additional to scholarly article (Q13442814) items, which number in tens of millions, and some cleanup work is anticipated for both domain and range. |
Example 1 | Malaria and the microbiome: a systematic review (Q56383548) → systematic review (Q1504425) |
Example 2 | NIH Consensus conference. Gallstones and laparoscopic cholecystectomy (Q70552083) → NIH consensus development conference summary (Q27718083) |
Example 3 | Practice guidelines for the management of bacterial meningitis (Q33982444) → medical guideline (Q878041) |
Source | The new statements initially will be generated by rules from existing statements, as followup to the WDQS split. |
Robot and gadget jobs | Bots will be used heavily to implement the migration from instance of (P31) statements. |
Wikidata project | Wikidata:WikiCite |
Motivation
editCurrently these publication types of articles are added as instance of (P31) statements, but better data modelling can follow from having a separate property. For example, on clinical trial (Q30612) under MeSH descriptor ID (P486) the publication type meaning is at present given preferred rank over the "clinical trials as topic" meaning. It would be better not to overload the item in this way, given the importance of clinical trials in medical research. We should have two items, one of which should only be used in "publication type of scholarly article" statements.
This idea was mentioned already several years ago. It comes up now because of the graph split treating the scholarly article items as a graph in their own right. See Wikidata talk:WikiCite#Community input into WDQS graph split: a publication type property proposal for a preliminary discussion. That thread links to a graph split page which goes into fuller details of the technical side. I've been asked by the developers working on the split to make this proposal. @Daniel Mietchen: @Bluerasberry: @Sj:
While the graph split will make SPARQL queries more complex, good can come of it if this proposed property is created, and some systematic work goes on to sort out the current overloading of dozens of items. Charles Matthews (talk) 10:35, 12 September 2024 (UTC)
Discussion
edit- Support Would most (all?) subclasses of scholarly article (Q13442814) be replaced by this new property then, and their instances updated to just be instances of scholarly article (Q13442814)? ArthurPSmith (talk) 13:23, 12 September 2024 (UTC)
- Not part of the original plan anyway, which was simply to create new triples from old, with new object items where, for example, "clinical trial" had become either an item for the real-life testing, or for a publication type. Charles Matthews (talk) 15:07, 12 September 2024 (UTC)
- Comment The existing set of rules has already raised some concerns and confusions at Wikidata_talk:SPARQL_query_service/WDQS_graph_split/Rules and I think this proposal is going to help to reduce these confusions/ambiguities. From a technical standpoint what is important is that this new property will help the system to determine if an item should be part of the scholarly_articles subgraph or not. My current understanding (but please let me know if I'm wrong) is that the fact that an entity has a non-deprecated statement with this new property will be sufficient to classify it as a scholarly article (it would not even have to look at the value of this property). From a practical point of view, assuming this proposal is accepted, we should update the WDQS software with this new rule before any migration is attempted. During the migration we might have to keep both types of rules (the one based on P31 and the one based on this property). DCausse (WMF) (talk) 07:05, 13 September 2024 (UTC)
- @DCausse (WMF): So there can be a case analysis with a few cases. An example that is clear is the case of multicenter study report (Q91901000), label "class of publication", and multicenter clinical trial (Q6934595). I have checked just now, and nothing that is instance of multicenter clinical trial (Q6934595) is also instance of scholarly article (Q13442814). On Study of GLS-5700 in Dengue Virus Seropositive Adults (Q26762063) there is another P31 statement, but for another type of trial. I have worked through the 28 hits for multicenter study report (Q91901000):
- Try it!
SELECT ?item ?itemLabel WHERE {?item wdt:P31 wd:Q91901000; wdt:P31 wd:Q13442814. SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". } }
- I see a case where controlled clinical trial (Q70447452) is used instead of controlled clinical trial (Q58897597). So as an example for your question, a P31 statement with controlled clinical trial (Q58897597) ought to be enough to classify as a scholarly article. I find no hits like that, so perhaps all those items have already been split out. One hit for Role of lopinavir/ritonavir in the treatment of SARS: initial virological and clinical findings. (Q35536588) which is "wrong", twice. But really this can't be discussed fully here. Charles Matthews (talk) 10:23, 13 September 2024 (UTC)
DisagreeSupport So if I understand correctly, rather than using instance of (P31) or subclass of (P279), you want a dedicated property for scholarly works to subclassify by the idea of a "publication type", which upon your inspection seems to be quite varied in 1000's of "publication types". You didn't want to use subclass of (P279) to subclassify them because it would make querying a bit harder and less straightforward when dealing with the migration, and scholary works in general? So a dedicated property just for scholary works to subclassify/subcategorize (without resorting to using subclass of (P279)) was justified, and hence this proposal. YES/NO? (after a clarifying reply, I can update my disagreement) --Thadguidry (talk) 00:09, 17 September 2024 (UTC)- @Thadguidry: In the background here, we have the blind men and an elephant (Q1218005) issue applied to WikiCite (Q21831105). The graph split will have a major impact on the "WikiCite area", because the scholarly graph split out is the natural habitat for WikiCite on Wikidata: the big citation graph lives there. But people may talk past each other when they have different conceptions of WikiCite.
- So "quite varied" might be fair, but for me the list of publication types of interest is from Medical Subject Headings (MeSH). Those can be found with this query.
- Try it!
SELECT DISTINCT ?item ?itemLabel WHERE {?item wdt:P672 ?string. FILTER (STRSTARTS(?string, "V")) SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } }
- This is the list of terms that can be used on a PubMed page in the "Publication types" section, e.g. on https://pubmed.ncbi.nlm.nih.gov/24394640/ where the types are review article (Q7318358) and a term that comments on the grant support. It would be very good to import systematically those statements into Wikidata, and "subclass of scholarly article" isn't a good fit. Points about this: maybe we only import a subset. Maybe we want to use some other way of looking at publication types (Wikidata is agnostic about ontology, doesn't insist on limiting values), and someone has already mentioned a classification used on The Lens (Q7144471).
- But looking at MeSH with PubMed explains quite well the overloading issue we have, that is impeding a clean split. clinical trial (Q30612) is going to need to become two items, one with MeSH tree code (P672) value V03.175.250, and another with the other three values. The latter item would be fit to be used in main subject (P921) statements, for example where PubMed has a "Clinical Trials as Topic" term. The former item should equally be fit to be used in "publication type" statements. This would be a good resolution to an ambiguity issue we have here and now. Charles Matthews (talk) 08:48, 17 September 2024 (UTC)
- It sounds like there will eventually be overlap with general "publication types". But that's ok, because they can be multi-typed (multiple subclasses added to any particular "publication type of scholary article". Thanks, I think I understand better. To me, this feels like a 1st pass at fixing things, and perhaps this property will be slightly less useful in the future, but for now, it seems you've made a convincing argument that it's needed for now in respect to migration. Changed my vote to support. --Thadguidry (talk) 12:27, 17 September 2024 (UTC)
- Support Contrary to popular belief, instance of (P31) is not intended to be a data dump. Specific qualities deserve specific properties, especially when we have millions of relevant items, as in this case. --Jklamo (talk) 13:27, 17 September 2024 (UTC)
- Support Makes sense. --Prototyperspective (talk) 14:28, 18 September 2024 (UTC)
- Support as per Jklamo. Thanks for taking the time to write this proposal--So9q (talk) 16:34, 20 September 2024 (UTC)
- Oppose I don't think it makes sense to have a publication type property specific to scholarly articles. Why doesn't it apply to other publications (which currently use genre (P136) for things like this)? I also think some of the more specific instance of (P31) values currently being used are redundant and should not be used on the items at all. I don't see any reason to add academic journal article (Q18918145) to something which we already know is an article published in a journal in the same way we don't add woman (Q467) to a human (Q5) who is female (Q6581072). - Nikki (talk) 09:27, 21 September 2024 (UTC)
- Good points that need to be addressed. If your proposed solution is to instead use Genre I don't think those are genres. And if not, please specify what your solution would be. Prototyperspective (talk) 13:35, 21 September 2024 (UTC)
- I can give a concise kind of reason: if you want to have database constraints that apply in this particular context. Certainly it doesn't make very much sense to have instance of (P31) subjected to database constraints, when it is universal. When you say "should not be used on the items at all" you are arguing for constraints, and the standard way to do that is with a definite property. Charles Matthews (talk) 19:33, 21 September 2024 (UTC)
- Makes sense. One thing is that I think it would generally be best if values in properties can be constrained depending on other values/properties of the item and think this is already done. Moreover, could you please explain why properties like language of work or name also show values other than in this case languages in the autocomplete box? Prototyperspective (talk) 10:07, 22 September 2024 (UTC)
- I don't think I want to talk here about details of constraints, because it is anyway going to be a community decision what is wanted. The general principle is to have constraints based on queries, so a list of constraint violations can be generated automatically. In this case it is worth emphasising (a) that there are tens of millions of items involved, and (b) preliminary checks on the instance of (P31) statements we are starting with show a complex situation. So I don't think we should approach this business with ad hoc ideas. We may end up with a package of constraints that is effective in keeping the data clean, but that would require some effort. Charles Matthews (talk) 11:08, 22 September 2024 (UTC)
- Makes sense. One thing is that I think it would generally be best if values in properties can be constrained depending on other values/properties of the item and think this is already done. Moreover, could you please explain why properties like language of work or name also show values other than in this case languages in the autocomplete box? Prototyperspective (talk) 10:07, 22 September 2024 (UTC)
- I can give a concise kind of reason: if you want to have database constraints that apply in this particular context. Certainly it doesn't make very much sense to have instance of (P31) subjected to database constraints, when it is universal. When you say "should not be used on the items at all" you are arguing for constraints, and the standard way to do that is with a definite property. Charles Matthews (talk) 19:33, 21 September 2024 (UTC)
- @Nikki: genre (P136) is not similar at all; for scholarly works maybe main subject (P921) would be similar to genre (P136), though perhaps for a real "genre" you would pick a more high-level subject area (mathematics, biology, etc.). This proposed property would be much more similar to form of creative work (P7937) - but obviously we are not talking about "creative works" here I think, at least not as normally understood. Perhaps form of creative work (P7937) could be renamed/extended to support what is wanted here? But I think a separate property for this makes a lot of sense. ArthurPSmith (talk) 17:09, 23 September 2024 (UTC)
- I agree. So9q (talk) 11:03, 26 September 2024 (UTC)
- Support For several reasons
- Agree with Jklamo, "instance of" is being used as a data dump and this property helps correct that
- Nikki's rationale for opposing would apply in most similar cases, and often taking action is debatable, but this is a very unusual case and strong action is merited. As the pie chart shows and Wikidata:Statistics further explains, Wikidata has a huge number of items which are "instance of -> scholarly article". Currently, these are sorted by properties which apply to publications generally, like genre (P136), or by adding additional items to instance of (P31). There are enough of these items to merit specific sorting through a dedicated property.
- I have contributed to the problem of adding imprecise data through Wikidata:WikiProject Clinical Trials, where I encouraged tagging scholarly articles about clinical trials as "instance of clinical trial". Although it is common to call such papers "clinical trials", the trial is actually the research experiment itself. There are probably other classes of items loaded into scholarly articles for similar reasons. Sorting this in a dedicate property enables better cleanup outside of P31.
- We can reasonably expect editor engagement with this property because WikiCite is a popular project, and everything discussed here is WikiCite-related. For additional context on the community and its projects see meta:WikiCite and Wikidata:WikiCite.
- The need to clean this up is now because of Wikidata:SPARQL query service/WDQS graph split, which I explained in an English Wikipedia Signpost article at Wikidata to split as sheer volume of information overloads infrastructure
- genre (P136) is not quite a fit in some cases. Some kinds of papers that come up include clinical trial (Q30612), obituary (Q309481), product testing (Q7247798), and letter to the editor (Q651270). These are not well developed already as genres, and I do not think Wikidata should make a precedent into overloading the concept of genre with such a new application of the term. It makes sense to me to have a new term for Wikidata's needs which is not already loaded with meaning from existing disciplines.
- I do not have all the answers and I am uncertain about how all of this goes, so I encourage anyone to ask questions and critique these plans. I do agree that a problem exists, action is useful, and this proposal is the best idea I have heard for addressing it. I see no significant shortcomings to this idea except the newness and uncertainty, but I think this is the way. Bluerasberry (talk) 19:18, 23 September 2024 (UTC)
- Thank you for the detailed overview of the issues involved. <3 So9q (talk) 11:06, 26 September 2024 (UTC)
- Oppose instance of (P31) does the job well. ChristianKl ❪✉❫ 19:46, 23 September 2024 (UTC)
- Support. 慈居 (talk) 11:43, 24 September 2024 (UTC)
- Support My main concern with overloading instance of (P31) for scholarly articles is that it misses the intrinsic value that sister projects place on the quality of sourcing. As a concrete example, there is no extra value gained when looking at sub-categories like genre (P136) for a novel because one value is equivalent to any other. But for someone working with medical articles, there is a huge difference between a systematic review and a single study because the former is an acceptable source, while the latter is very unlikely to be. It is always preferable when developing tools to test the quality of sourcing to be able to simply differentiate between scholarly and non-scholarly sources, and then to drill down into the type of scholarly article in order to get an assessment of their quality. That job, whether done manually or by an automated tool, would be far easier if this proposal were accepted. --RexxS (talk) 10:46, 30 September 2024 (UTC)
- Comment I wonder if we should be thinking about this in the context of this recent RFC - Wikidata:Requests for comment/object vs design class vs functional class for manufactured objects? They are proposing a "design class" property for manufactured goods, but it seems to me quite similar to what this property proposes. Is there a broader principle applicable here (maybe a generalization of genre and related properties also)? ArthurPSmith (talk) 20:12, 30 September 2024 (UTC)
- @Charles Matthews, RexxS, 慈居, So9q, Bluerasberry, ArthurPSmith: @DCausse (WMF), ChristianKl, Prototyperspective, Thadguidry: Done as publication type of scholarly work (P13046) Regards, ZI Jony (Talk) 21:19, 3 October 2024 (UTC)