Wikidata talk:SPARQL query service/WDQS graph split/WDQS Split Refinement
property triples maybe unnecessary?
editAfter more thought (and looking at your example) I don't see a whole lot of use for the property triples on the scholarly subgraph, I think we can leave them out. ArthurPSmith (talk) 14:41, 17 April 2024 (UTC)
Inconsistency with relation to theses. But also - what does this mean to me as a user?
editI run the NZThesisProject. I naturally looked over the spreadsheet with that in mind and was surprised to find that you have included some types of thesis and not others. I presume this is an oversight as it makes no logical sense for a Doctor of Medicine thesis to be in one graph and a Doctor of Philosophy thesis in the other, for example.
Could I also make a plea for a plain-language (ie less technical) explanation of what this whole split means to me as a Wikidata user? (tip: I have no idea what a truthy graph is, and I'm going to need some excellent tutorials if I'm going to understand how those example federated queries work). I can understand the description of what you propose to do, more or less, but it is extremely unclear to me what it means for me as a user in practical terms, other than that every Sparql query I have (and most of the tools I use) is going to need to be rewritten. How will I link authors, theses and their publications when they are in two different graphs? Is there anything I can do now that I will no longer be able to do under federation? Will tools like OpenRefine reconcile authors and publications the same way as now or will that change too? DrThneed (talk) 20:51, 17 April 2024 (UTC)
- Yes that was definitely unintentional, I think I can edit the list, I'll take a look.
- Only SPARQL queries that need data from both graphs would need rewriting. If you have some sample SPARQL queries that you are using now it might be helpful to share them here so that can be assessed. The hope is that most users will only need data from the "main" subgraph, and so no SPARQL rewriting needed. The ones that need rewriting are those that fetch data both about articles and about things that are not articles. Does OpenRefine use SPARQL? Any application that is just talking directly to the regular (non-SPARQL) Wikidata APIs will not require any change. ArthurPSmith (talk) 12:56, 18 April 2024 (UTC)
- Thank you that's really helpful. Re the queries, I have quite a few including Scholias and Histropedia timelines (https://www.wikidata.org/wiki/Wikidata:WikiProject_NZThesisProject/Dashboards_and_queries) ...although I think a lot of those might be alright if I've understood correctly. So a query like this:
- Theses in the project that have a main subject that is an instance of a person https://w.wiki/6HyD
- would not need rewriting, but if I wanted more information on the people items that are the subjects, e.g. their English description or sitelink, then I would need to adjust it?
- And a query like this:
- Thesis where the author is linked but not in the thesis project https://w.wiki/9pt5 would need rewriting? DrThneed (talk) 08:27, 21 April 2024 (UTC)
- Thanks for the examples - I believe unfortunately both would need rewriting since they mix in the same query items from the scholarly graph and from the main graph. @DCausse (WMF) do we have theses included in the scholarly graph yet? Or perhaps just an illustration of how one would do this with the current split? ArthurPSmith (talk) 19:03, 22 April 2024 (UTC)
- @DrThneed Thank your for your feedback. Proper documentation is indeed going to be key and thank you for calling out that the current language is lacking clear and less technical explanations, we will try to improve this.
- @ArthurPSmith Regarding the queries I have added them to Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples, the current experimental endpoint does not yet separate these publications out of the main graph so they can't be really tested but they seem simple enough that I don't anticipate any problems with them.
- @DrThneed in the coming weeks we will put in place a page Wikidata:Request a query rewrite (in the same vein as Wikidata:Request_a_query but for requesting a rewrite), in my experience, and after rewriting a dozen of queries the patterns and techniques used for rewriting are generally the same.
- On OpenRefine, I'm not very knowledgeable but this project does seem very generic and highly configurable and I would be surprised if it had any predefined SPARQL queries in its source code. I think that if it has SPARQL capabilities I am pretty sure that the SPARQL query would have to be given by the user when setting up a project. DCausse (WMF) (talk) 08:49, 23 April 2024 (UTC)
- Hi, ̊@DrThneed kindly pointed me to her thread her. I also have a thesis project here Wikidata:WikiProject LSEThesisProject which contains multiple queries linking our thesis metadata with other entities in Wikidata, including Histropedia Timelines and use of EntiTree. I have basic SPARQL query skills, so it was complex to produce all these queries (plenty with kind help from others in the communityǃ). Will the Request a re-write query mean that we can make a request for all our queries to be re-written for us? How will it be possible to query across both graphs going forwards? I am just embarking on a new project around research visibility which relies on use of Scholia and being able to query data about individuals, institutions, topics and research outputs - but my technical Wikidata knowledge is limited and I'm wondering how possible it's going to be to proceed with our intended project with the data split across 2 graphs. Thanks. ̊HelsKRW HelsKRW (talk) 09:44, 19 June 2024 (UTC)
- @HelsKRW Sorry for the late reply, we decided that we will re-use the existing Wikidata:Request_a_query to seek help regarding rewriting queries with federation, I think the purpose of this page is mainly to help query owners to learn how to rewrite a query so that they get more autonomous in doing do. SPARQL is indeed quite complicated by nature and I agree that adding federation into the mix makes it particularly challenging. However after rewriting a couple queries the process becomes easier and the patterns and techniques are generally the same. For instance I rewrote two queries from the NZ Thesis Project (which I believe might share similar use-cases):
- Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples#Publications_in_a_WikiProject_(Q16695773)_that_have_a_main_subject_that_is_an_instance_of_a_person
- Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples#Publications_in_a_WikiProject_(Q16695773)_but_where_the_linked_author_is_not_in_that_project
- We will soon provide the required endpoints to start migrating existing queries, I quickly glanced at the queries of your project and I believe they might be quite straightforward to migrate.
- Regarding scholia I would suggest you to get in touch with its maintainers directly to assess the feasibility of your project.
- I hope this helps, if you have more questions I would suggest to contact us directly on Wikidata:Report_a_technical_problem/WDQS_and_Search. This particular talk page has been setup to seek feedback regarding the nature of the split and since the feedback period ended last month it might no longer be monitored actively in the future. DCausse (WMF) (talk) 19:46, 27 June 2024 (UTC)
- @HelsKRW Sorry for the late reply, we decided that we will re-use the existing Wikidata:Request_a_query to seek help regarding rewriting queries with federation, I think the purpose of this page is mainly to help query owners to learn how to rewrite a query so that they get more autonomous in doing do. SPARQL is indeed quite complicated by nature and I agree that adding federation into the mix makes it particularly challenging. However after rewriting a couple queries the process becomes easier and the patterns and techniques are generally the same. For instance I rewrote two queries from the NZ Thesis Project (which I believe might share similar use-cases):
- Hi, ̊@DrThneed kindly pointed me to her thread her. I also have a thesis project here Wikidata:WikiProject LSEThesisProject which contains multiple queries linking our thesis metadata with other entities in Wikidata, including Histropedia Timelines and use of EntiTree. I have basic SPARQL query skills, so it was complex to produce all these queries (plenty with kind help from others in the communityǃ). Will the Request a re-write query mean that we can make a request for all our queries to be re-written for us? How will it be possible to query across both graphs going forwards? I am just embarking on a new project around research visibility which relies on use of Scholia and being able to query data about individuals, institutions, topics and research outputs - but my technical Wikidata knowledge is limited and I'm wondering how possible it's going to be to proceed with our intended project with the data split across 2 graphs. Thanks. ̊HelsKRW HelsKRW (talk) 09:44, 19 June 2024 (UTC)
- Thanks for the examples - I believe unfortunately both would need rewriting since they mix in the same query items from the scholarly graph and from the main graph. @DCausse (WMF) do we have theses included in the scholarly graph yet? Or perhaps just an illustration of how one would do this with the current split? ArthurPSmith (talk) 19:03, 22 April 2024 (UTC)
Choice of subclasses to include
editMy two cents, not beeing very knowledgeable on the technical matters behind the split: I am not quite convinced by the way scholarly articles and thesis will be separated from the rest of the written production. From an academic perspective, notably the humanities, a book is still a very valid way to publish new knowledge. IMO a more robust and consistent solution, if harder to implement given the mess that are the subclasses of publication (Q732577) and article (Q191067), would be to put every manifestation (in the FRBR sense of it) of a written document in this separated graph. That would includes books, encyclopedia entries, news articles, basically everything that you can read from a printed source (including "digital" print, but excluding everything handwritten such as manuscripts, inscriptions, letters, etc.). I think this would allow for a more straightforward querying: should one need something published, they should include the subgraph, no matter if it's for the list of Shakespear's editions (editions, not works mind you) or the latest papers on high energy physics. This would also allow a more consistent approach to retrieve references, because one can't know beforehand if the reference to a statement might be a book or an article. --Jahl de Vautban (talk) 08:04, 20 April 2024 (UTC)
- Thank you for the feedback. I understand where you are coming from. This proposal would move quite a lot more Items that are widely used to the second graph. We are trying to avoid impacting too many reusers of the main graph so I fear that is not a good way forward overall. Lydia Pintscher (WMDE) (talk) 15:05, 24 April 2024 (UTC)
- Given the discussion above, it looks like it impacts almost every tool I use and every query I have. So I'm not a fan! I would very much favour keeping theses and dissertations (which are regarded as unpublished items anyway, from an academic perspective) out of the split. DrThneed (talk) 23:38, 3 May 2024 (UTC)
- @DrThneed I would like to re-assure you that there will be a transition period (at least 6months) during which the impacted use-cases will have the time to adapt, we will provide as much support as we can to help the transition (mainly through the Wikidata:Request_a_query_rewrite page). DCausse (WMF) (talk) 08:16, 17 May 2024 (UTC)
- Given the discussion above, it looks like it impacts almost every tool I use and every query I have. So I'm not a fan! I would very much favour keeping theses and dissertations (which are regarded as unpublished items anyway, from an academic perspective) out of the split. DrThneed (talk) 23:38, 3 May 2024 (UTC)
Medical Scholarly Articles
editWhile reviewing the list of types I found medical scholarly article (Q82969330), non-randomized controlled trial report (Q70471362) and multicenter study report (Q91901000) that could potentially be considered as scientific publication, should these be included as well? (cc @ArthurPSmith) DCausse (WMF) (talk)
Keep entities with sitelinks
editIn accordance with Wikidata:Notability point 1 ("It contains at least one valid sitelink"), I think a better split expression might be something like
}. That would also start to cover the point 3 ("It fulfils a structural need"), even if that point might need more fine-tuning. Maxlath (talk) 14:50, 16 May 2024 (UTC)
?entity wdt:P31/wdt:P279* wd:Q13442814 . FILTER NOT EXISTS { ?article schema:about ?entity
- @Maxlath Thanks for raising this. We briefly discussed using sitelinks to inform the nature of the split but for some reason it did not make to the list of suggested improvements and I'm glad that you raise it so that we can have a conversion about it. From a technical standpoint I don't have objections to this idea:
- A rule like
[] schema:about ?entity
can be implemented - It would not change much the size of the splits with roughly 43,000 papers (only 0.09% of the 44,000,000 papers) moving from the scholarly subgraph to the main subgraph.
- A rule like
- So I believe the discussion to have is regarding the overall usability. I'm probably not knowledgeable enough to make an informed judgement on this but here are the points I can think of:
- Use-cases relying on scientific publications will always have to
UNION
both subgraphs - Use-cases relying on sitelinks (regardless of their nature) could continue to query solely the main subgraph
- Use-cases relying on scientific publications will always have to
- DCausse (WMF) (talk) 08:04, 17 May 2024 (UTC)
- Yes this is a good point. Probably scholarly works will always need to do the UNION thing anyway just in case something has an unexpected instance of (P31) value. So I don't think this is so concerning, and if it's helpful for other use cases to keep some of them in the main graph that seems ok to me. ArthurPSmith (talk) 21:12, 17 May 2024 (UTC)
- I think having to do the UNION just because of unexpected instance of values is actually a data quality issue we need to fix. I hope that the graph split actually is a forcing function in this direction and I'd like us not to have to work around it.
- I am personally dubious about the usefulness of this distinction on the sitelink level for the following reasons:
- It makes it harder to understand what is where, especially for less-informed reusers, who are not familiar with concepts like using sitelinks as a proxy for importance. Wikidata's data is already much harder to reuse than it should be and I think we need to try to keep the complexity of the split criteria very low to not make this even worse.
- I was under the impression that scientific articles very rarely have articles on Wikipedia but more on Wikisource for example. And in those cases it might not be as good a proxy for notability as for Wikipedia because other factors come into play.
- Is there anything I am missing? Lydia Pintscher (WMDE) (talk) 15:32, 24 May 2024 (UTC)
- Yes this is a good point. Probably scholarly works will always need to do the UNION thing anyway just in case something has an unexpected instance of (P31) value. So I don't think this is so concerning, and if it's helpful for other use cases to keep some of them in the main graph that seems ok to me. ArthurPSmith (talk) 21:12, 17 May 2024 (UTC)
Clinical trials
editAround 390K items on Wikidata are instances of clinical trial (Q30612). Some of these are also instances of scholarly article (Q13442814) and some are not. Now, a clinical trial is not in itself a publication, but a report on it probably is: there is an ambiguity here. In some similar cases, there are two items, one of which is a publication type.
Apart from the ambiguity, splitting the clinical trial items into two parts could be awkward. I find just 284 are also scholarly article items, so the simple approach would be a fix for those. The property ClinicalTrials.gov ID (P3098) doesn't currently apply to any scholarly article items; and there are currently just two hits, perhaps anomalous, for ClinicalTrials.gov ID (P3098) and DOI (P356).
That would come down to saying that data about the clinical trial itself is not bibliographical data. That makes ontological sense. A typical and important potential application, however, of WikiCite data, would be automated compilation of a corpus that could serve for the basis of a systematic review. Science librarians get involved in that process, which tends to involve trawling in multiple databases by topic and nature of the trials. Before just saying clinical trial items stay in Wikidata, related publications are split out, it would be good to consider the requirements of federated queries in this area. Charles Matthews (talk) 08:08, 19 May 2024 (UTC)
- @Charles Matthews thanks for pointing this out, I agree that scientific articles declaring multiple P31s might pose some challenges, the clinical trials also declared as scholarly articles will be moved to the scholarly subgraph and thus not findable in the main one, spreading some of those across both graphs which is not ideal. It seems that the solution is to disambiguate those by creating a separate entity. I've compiled a list of types that are used alongside scholarly articles to see if there are other instances of this issue, please see this spreadsheet, the corresponding list of items have been compiled here.
- Regarding your concern about the requirements of federated queries in this area, would you have existing use-cases in mind (example queries we could look at)? DCausse (WMF) (talk) 09:25, 22 May 2024 (UTC)
- @DCausse (WMF): I would need two answers to deal with everything there! Firstly, from the spreadsheet, I did some filtering with MeSH descriptor ID (P486) and MeSH tree code (P672), and the most important cases seem to be these 11: phase I clinical trial (Q5452194), clinical trial (Q30612), twin study (Q244775), observational study (Q818574), randomized controlled trial (Q1436668), NIH consensus development conference summary (Q27718083), phase II clinical trial (Q42824440), evaluation study (Q58898636), validation studies (Q58900694), consensus development conference proceedings (Q58900768), clinical study (Q58902670). These are characterised by having a P672 value with prefix V03, and also some other value with prefix such as E, L or N. The V03 prefix says they are suitable as an object of instance of (P31), and the other prefix indicates they can be the subject of a main subject (P921) statement, both types of statement possibly coming from a PubMed page. There are an additional 18 cases I find when I replace V03 by V02, maybe more because the P672 yet data may not be complete. Charles Matthews (talk) 11:55, 22 May 2024 (UTC)
- I can be clearer with a query that finds 29 items. (URL shortening fails :-(.)
- @DCausse (WMF): I would need two answers to deal with everything there! Firstly, from the spreadsheet, I did some filtering with MeSH descriptor ID (P486) and MeSH tree code (P672), and the most important cases seem to be these 11: phase I clinical trial (Q5452194), clinical trial (Q30612), twin study (Q244775), observational study (Q818574), randomized controlled trial (Q1436668), NIH consensus development conference summary (Q27718083), phase II clinical trial (Q42824440), evaluation study (Q58898636), validation studies (Q58900694), consensus development conference proceedings (Q58900768), clinical study (Q58902670). These are characterised by having a P672 value with prefix V03, and also some other value with prefix such as E, L or N. The V03 prefix says they are suitable as an object of instance of (P31), and the other prefix indicates they can be the subject of a main subject (P921) statement, both types of statement possibly coming from a PubMed page. There are an additional 18 cases I find when I replace V03 by V02, maybe more because the P672 yet data may not be complete. Charles Matthews (talk) 11:55, 22 May 2024 (UTC)
SELECT DISTINCT ?item ?itemLabel
WHERE {
VALUES ?item { wd:Q871232 wd:Q637866 wd:Q47461344 wd:Q193842 wd:Q5690540 wd:Q5437326 wd:Q732577 wd:Q5246046 wd:Q69488 wd:Q333291 wd:Q21481766
wd:Q2352616 wd:Q3331189 wd:Q10870555 wd:Q309481 wd:Q122846871 wd:Q56478376 wd:Q1980247 wd:Q58901591 wd:Q265158 wd:Q30612
wd:Q122847646 wd:Q191067 wd:Q21112633 wd:Q58632367 wd:Q58901470 wd:Q571 wd:Q7725634 wd:Q58898636 wd:Q830588 wd:Q36774
wd:Q58900768 wd:Q54877584 wd:Q122644068 wd:Q4006 wd:Q604733 wd:Q234460 wd:Q651270 wd:Q5003624 wd:Q1711593 wd:Q878041 wd:Q1143604
wd:Q95977810 wd:Q123584446 wd:Q13433827 wd:Q1517777 wd:Q108070213 wd:Q11826511 wd:Q60712335 wd:Q591041 wd:Q95988374 wd:Q106140535
wd:Q101072613 wd:Q605175 wd:Q36279 wd:Q746654 wd:Q133492 wd:Q21680312 wd:Q109229154 wd:Q55915575 wd:Q1962297 wd:Q19389637 wd:Q1172284
wd:Q108386385 wd:Q20540385 wd:Q58900694 wd:Q1277575 wd:Q17518461 wd:Q3099732 wd:Q2376293 wd:Q5707594 wd:Q386724 wd:Q178651 wd:Q1784733
wd:Q20655472 wd:Q352858 wd:Q121763407 wd:Q1631107 wd:Q1002697 wd:Q8054 wd:Q2412849 wd:Q20747295 wd:Q550089 wd:Q47114558 wd:Q694975
wd:Q107013291 wd:Q128093 wd:Q86460068 wd:Q77253277 wd:Q17537576 wd:Q35760 wd:Q58898586 wd:Q49848 wd:Q8513 wd:Q1228945 wd:Q112983
wd:Q21668810 wd:Q155207 wd:Q737498 wd:Q187947 wd:Q1298668 wd:Q818574 wd:Q1050259 wd:Q1630279 wd:Q2217301 wd:Q7553 wd:Q106645589
wd:Q758901 wd:Q2267705 wd:Q2438528 wd:Q305178 wd:Q87167 wd:Q42240 wd:Q112193867 wd:Q108196115 wd:Q1047113 wd:Q850950 wd:Q17518557
wd:Q861911 wd:Q5 wd:Q603773 wd:Q7433672 wd:Q1436668 wd:Q88392887 wd:Q38926 wd:Q193495 wd:Q3719255 wd:Q4119870 wd:Q13136 wd:Q2146881
wd:Q4184 wd:Q59259094 wd:Q82753 wd:Q4769616 wd:Q42848 wd:Q190084 wd:Q2085381 wd:Q11396303 wd:Q3055347 wd:Q2565355 wd:Q123177031
wd:Q947859 wd:Q1667023 wd:Q670787 wd:Q170584 wd:Q21156247 wd:Q26840225 wd:Q73364223 wd:Q1348645 wd:Q2915731 wd:Q59908 wd:Q164666
wd:Q12139612 wd:Q2668072 wd:Q96416347 wd:Q1238720 wd:Q277759 wd:Q2990839 wd:Q11862829 wd:Q28948553 wd:Q1391420 wd:Q96729626 wd:Q2136117
wd:Q12310958 wd:Q62024811 wd:Q1734578 wd:Q35127 wd:Q5962346 wd:Q122636877 wd:Q60534442 wd:Q3346024 wd:Q1358138 wd:Q5633421 wd:Q42350535
wd:Q17166051 wd:Q25839930 wd:Q567303 wd:Q1778788 wd:Q18168594 wd:Q74817647 wd:Q28869365 wd:Q1006160 wd:Q1787111 wd:Q190399 wd:Q3030248
wd:Q58900805 wd:Q933348 wd:Q482 wd:Q65772760 wd:Q26944781 wd:Q428632 wd:Q5185279 wd:Q124622948 wd:Q12042160 wd:Q7432048 wd:Q55333737
wd:Q17085509 wd:Q95000087 wd:Q193955 wd:Q65589911 wd:Q317623 wd:Q4202018 wd:Q16521 wd:Q2100278 wd:Q5268834 wd:Q1572600 wd:Q83790
wd:Q244775 wd:Q1391417 wd:Q69699844 wd:Q18340514 wd:Q1294318 wd:Q1541005 wd:Q134995 wd:Q5146094 wd:Q105582462 wd:Q2020153 wd:Q80267
wd:Q111448803 wd:Q12343820 wd:Q904997 wd:Q16324495 wd:Q166142 wd:Q18536349 wd:Q384515 wd:Q1834161 wd:Q131449 wd:Q187631 wd:Q11016
wd:Q5687679 wd:Q42396623 wd:Q1714118 wd:Q28923 wd:Q5977147 wd:Q57268247 wd:Q2085515 wd:Q114834437 wd:Q165158 wd:Q118563234 wd:Q2106255
wd:Q686822 wd:Q24685869 wd:Q747288 wd:Q218682 wd:Q11707 wd:Q106963809 wd:Q223729 wd:Q1279564 wd:Q4327689 wd:Q105763243 wd:Q111448685
wd:Q8719053 wd:Q15416 wd:Q46337 wd:Q251212 wd:Q223638 wd:Q21358050 wd:Q961652 wd:Q106334491 wd:Q3239681 wd:Q54117920 wd:Q954845
wd:Q1224889 wd:Q124653107 wd:Q62662439 wd:Q1376568 wd:Q618779 wd:Q11279204 wd:Q111124 wd:Q60186 wd:Q1438033 wd:Q1302249 wd:Q472342
wd:Q116025148 wd:Q58902670 wd:Q960189 wd:Q6646911 wd:Q620615 wd:Q116235645 wd:Q70436236 wd:Q1057179 wd:Q622425 wd:Q58902427 wd:Q839954
wd:Q115528532 wd:Q1410600 wd:Q128758 wd:Q1477856 wd:Q39911916 wd:Q1414531 wd:Q21905924 wd:Q45400320 wd:Q1762591 wd:Q7225113 wd:Q1436703
wd:Q1018633 wd:Q1427116 wd:Q1200750 wd:Q108618539 wd:Q17737 wd:Q1377447 wd:Q8134 wd:Q192425 wd:Q42824440 wd:Q3691017 wd:Q7315176 wd:Q49850
wd:Q5172784 wd:Q215380 wd:Q1456936 wd:Q22908280 wd:Q4344852 wd:Q1445211 wd:Q413 wd:Q87917582 wd:Q18216009 wd:Q216526 wd:Q7189713 wd:Q26529
wd:Q4830453 wd:Q2602337 wd:Q2385804 wd:Q29063418 wd:Q220659 wd:Q17142652 wd:Q1156854 wd:Q859161 wd:Q1076968 wd:Q238354 wd:Q2940514
wd:Q56478588 wd:Q18359 wd:Q15629444 wd:Q212971 wd:Q2257880 wd:Q1164267 wd:Q7397 wd:Q836950 wd:Q39364723 wd:Q1141067 wd:Q757290
wd:Q60534428 wd:Q1004 wd:Q11028 wd:Q19692233 wd:Q1053964 wd:Q18674739 wd:Q4785459 wd:Q625994 wd:Q942582 wd:Q811097 wd:Q476068
wd:Q773668 wd:Q60920906 wd:Q104445146 wd:Q28924364 wd:Q213051 wd:Q121403963 wd:Q11633 wd:Q97012313 wd:Q123750979 wd:Q128406 wd:Q70447452
wd:Q21293489 wd:Q106473769 wd:Q14204246 wd:Q8366 wd:Q1379672 wd:Q1053916 wd:Q5173771 wd:Q10753032 wd:Q9023538 wd:Q646754 wd:Q484692
wd:Q3697781 wd:Q783521 wd:Q5452194 wd:Q959782 wd:Q10898227 wd:Q873506 wd:Q67035425 wd:Q602446 wd:Q170978 wd:Q1415275 wd:Q105422226
wd:Q694134 wd:Q21004260 wd:Q1410069 wd:Q5159954 wd:Q3918409 wd:Q134307 wd:Q42750320 wd:Q442781 wd:Q21682525 wd:Q110419944 wd:Q11538
wd:Q7936612 wd:Q8434 wd:Q90042395 wd:Q58854 wd:Q188952 wd:Q60797 wd:Q101116078 wd:Q560361 wd:Q431289 wd:Q30070590 wd:Q110156968
wd:Q3208168 wd:Q98374854 wd:Q7210349 wd:Q56383918 wd:Q1784021 wd:Q28640 wd:Q59156132 wd:Q24033349 wd:Q2894989 wd:Q27718083 wd:Q52947181
wd:Q45786140 wd:Q1447141 wd:Q56648531 }
?item wdt:P486 ?mesh;
wdt:P672 ?meshcode1;
wdt:P672 ?meshcode2;
FILTER (STRSTARTS(?meshcode1, "V"))
FILTER (STRSTARTS(?meshcode2, "E")|| STRSTARTS(?meshcode2, "J") || STRSTARTS(?meshcode2, "K") || STRSTARTS(?meshcode2, "L") || STRSTARTS(?meshcode2, "N"))
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
- Charles Matthews (talk) 19:01, 22 May 2024 (UTC)
- To note that biography (Q36279) should appear as a hit of the query, but doesn't because of deprecation of the MeSH tree code (P672) statement. Charles Matthews (talk) 10:05, 23 May 2024 (UTC)
- @Charles Matthews Thanks for investigating this, I must admit that I'm out of my depth regarding data modeling of the medical domain. Getting back to the specifics of the graph split:
- Can you confirm that the issue being discussed here can be resolved by updating a relatively low number of items (<1000)?
- Do you have suggestions to help improve the rules for splitting the graph?
- Thanks! DCausse (WMF) (talk) 07:53, 30 May 2024 (UTC)
- @DCausse (WMF): As it seems to me, there will be about 50 Wikidata items that will definitely need attention, because they are currently shared between a "publication type" sense, and a "topic" sense. It is easier to think about the graph assuming the decision is made to create a new property for "publication type of a scholarly article", which we could call "P12345". It should have its own database constraint, so that the object is always a publication type. The publication types with a MeSH tree code (P672) statement with prefix V should be the important ones.
- In those terms, the splitting of the instance of (P31) statements from your list could go like (1) migrate about 50 types of statement to "P12345", following a rule like the query above; (2) treat the rest of the instance of (P31) statements as the "long tail" of the problem, which would need to be looked at further.
- Anyway, the logic of the clinical trial issue is to create new items that will be used on the split-out graph, in a systematic way. Charles Matthews (talk) 08:08, 30 May 2024 (UTC)
- @Charles Matthews Thanks for investigating this, I must admit that I'm out of my depth regarding data modeling of the medical domain. Getting back to the specifics of the graph split:
- To note that biography (Q36279) should appear as a hit of the query, but doesn't because of deprecation of the MeSH tree code (P672) statement. Charles Matthews (talk) 10:05, 23 May 2024 (UTC)
- Charles Matthews (talk) 19:01, 22 May 2024 (UTC)
Case analysis
edit@DCausse (WMF): To help with further discussion, I'm giving a case analysis of the top 20 objects from your spreadsheet. (Some of the very common cases, such as review article (Q7318358), systematic review (Q1504425) and meta-analysis (Q815382) have been excluded.) There is an item type of publication (Q39725049) that may turn out to be useful here: at present for the listed items it is there in a P31 statement only on scientific publication (Q591041).
- MeSH V only: editorial (Q871232) abstract (Q333291) expression of concern (Q56478376) comparative study (Q58901591) review (Q265158). This is the straightforward case where the P31 triple can be split out without change of object, which is indicated by MeSH to be a publication type.
- MeSH VL: book (Q571), 155 hits. In this case the object item needs to be split into two, and in the triple the object needs to be changed so that it is the new item, a publication type. The clinical trial case (284 hits) is similar to this one.
- MeSH L only: catalogue (Q2352616). This is a "needy" case (989 hits), which would be in the previous case except that the MeSH term "Catalog" has been deprecated. A new item should be created, but "Catalog" in MeSH is only for library catalogs as a publication type.
- Special topics: geological map (Q193842) and three types of National Institute for Occupational Safety and Health (Q60346) publications: Health Hazard Evaluation (Q5690540) Fatality Assessment and Control Evaluation (Q5437326) NIOSH Workplace Survey Report (Q122846871). These need decisions on whether the publications are scholarly articles, but may not otherwise be troublesome.
- General publication types: written work (Q47461344) academic chapter (Q21481766) review (Q265158) report (Q10870555) obituary (Q309481) chapter (Q1980247). These are not included in the MeSH system. They might be instances of type of publication (Q39725049), except the first. Probably written work (Q47461344) should be excluded in the split?
- Questionable usage: publication (Q732577) academic publishing (Q5246046) are unhelpful because more general than "scholarly article". MDMA (Q69488) is just a mistake, since it is a drug. I think the split should exclude the triples.
Matters arising from meeting 28 May
edit@DCausse (WMF): This is about actions that should be taken, according to the discussion we had. The fundamental questions are about creation of new items. The context is some modelling of data on the split graph.
I think the first point, coming from convenient modelling, is a new item for "publication type of scholarly article". It will be a subclass of type of publication (Q39725049). The defining feature of the other new items will be that they are instances of "publication type of scholarly article". So, for example for clinical trial (Q30612), there will be a new item that is the publication type for reports on clinical trials, and the English label could be "clinical trial report". Of course the English label is not the crux, here, but that this item will (a) be an instance of "publication type of scholarly article", (b) will be used on the split-out graph for scholarly articles, (c) will link to clinical trial (Q30612) at least with different from (Q66087861) (reciprocal link) to prevent future merges.
There needs to be further discussion about the business of reciprocal links and how federated queries will work. For example said to be the same as (Q66209246) reciprocal links.
From the point of view of MeSH descriptor ID (P486) and MeSH tree code (P672), the "new item" would have the MeSH descriptor ID (P486) value D016430 and the subject named as (P1810) qualifier "Clinical Trial", and also the MeSH tree code (P672) value V03.175.250. In general, the splitting of items would move the V-prefix MeSH tree code (P672) to the new items, leaving the other values on the old items.
I can undertake to do this work.
You asked how many items in total would be affected. The case of systematic review (Q1504425) alone will be very large.
I did work over the past few days with the query above, for the VALUES list as given - at the meeting you mentioned three (I think) additions to that list. This point should be clarified. review article (Q7318358) is very much used on scholarly article items, but has no MeSH statement. On the other hand review (Q265158) has a MeSH statement, and is unproblematic - no split needed in this case.
After my data work, for MeSH tree code (P672) prefix N, there are still 29 cases found by the query that need attention for the P31 statement list as given. I can go ahead with the work of splitting the items, therefore, as the first step.
We also discussed the creation of a new "publication type" property for use on the split-out graph, and who from the community could ask for its creation. This is important for the data modelling on the split-out graph. There is a "big picture" about rectifying the P31 statements at the time of the split. To comment on what I said at meeting: I think the rectificaton will only be partial, at best, leaving many P31 statements on the split graph in a state of deprecation (of some kind). But I would argue that triples (Qid of A, P31, item1), with A subject to some rule, should be changed not to (Qid of A, P31, item2) with the new item2 determined from item1 by a rule, but to (Qid of A, new_property, item2).
In other words, the changes should use automation as much as possible, to clean up the data modelling, and the database constraints set on new_property will then become part of the charter of the split-out graph.
Of course we can discuss this all in greater detail.
Charles Matthews (talk) 08:51, 5 June 2024 (UTC)
- @Charles Matthews thanks! A small clarification regarding the three additions I made: I added medical scholarly article (Q82969330), non-randomized controlled trial report (Q70471362) and multicenter study report (Q91901000) (that we initially missed in the spreadsheet) to be considered as scholarly articles (which account for 28 publications).
- Your suggestions do make sense to me but I feel that this talk page might no longer be the best place to carry out this work. I wonder if a subpage on Wikidata:WikiCite would not make more sense to detail and discuss the plan you suggest. I started Wikidata:SPARQL_query_service/WDQS_graph_split/Rules which is very minimal and briefly describes the issue by mentioning this discussion in the Known Issues section. I believe it would make sense to expand it with more details once we have this new page.
- From a technical perspective the most important thing is to update the rules once we have this new "publication type" property. We will likely keep both the rules based on P31 and this new "publication type" property while the data is being updated. Thanks! DCausse (WMF) (talk) 07:59, 6 June 2024 (UTC)
- @DCausse (WMF): OK, we can adjourn this discussion. Charles Matthews (talk) 09:23, 6 June 2024 (UTC)
- Just to bring closure to this thread, the discussion has resumed at Wikidata_talk:WikiCite#Community_input_into_WDQS_graph_split:_a_publication_type_property_proposal. DCausse (WMF) (talk) 08:47, 11 September 2024 (UTC)
- @DCausse (WMF): OK, we can adjourn this discussion. Charles Matthews (talk) 09:23, 6 June 2024 (UTC)
Wikidata ontology related to the split
editMy concern is that part of the Wikidata ontology will end up being split away from the main part of Wikidata. How is this going to be avoided?
I note that currently there are two more subclasses of article (Q191067) in the entire graph than in the main graph - Halifax Harbor Contaminants of Concern (Q117745373) and Hospitalizações evitáveis em Sergipe: Uma Análise Econométrica (Q126964203). These are both scientific articles that have mistaken subclass links, but there is also the possibility of the reverse and thus of part of the Wikidata ontology being segregated to the split part. Peter F. Patel-Schneider (talk) 18:43, 22 September 2024 (UTC)
- The intent of these rules is to not split the ontology (here the subclass hierarchy of article (Q191067)).
- I could be wrong but I doubt there are instances of an item in this hierarchy that we consider a scientific publication and that could also be considered/used as a class, if such cases exist we might have to take a closer look at them. I suspect that these must be mistakes like the two you found.
- I think this is a broader concern about wikidata data-quality issues in general. Could there be some existing systems in wikidata that might help to remediate/avoid those (WikibaseQualityConstraint? EntitySchemas?).
- For WDQS I agree that this makes the extraction of the subclass hierarchy (including data quality issues) more challenging, if your concern is more about detecting such problems using WDQS I wonder if we could craft dedicated queries for this (by learning about the most common mistakes made with subclass of (P279) on scholarly articles)?
- (Also note that this page is now archived and I'd suggest using the talk page Wikidata:SPARQL_query_service/WDQS_graph_split/Rules which I think might have more visibilty than this one).
- Thanks! DCausse (WMF) (talk) 08:11, 23 September 2024 (UTC)
- There seems to be 51 subclass of (P279) statements in the scholarly subgraph (query). Looking at few samples I see two common problems:
- P279 used instead of main subject (P921)
- P279 used instead of instance of (P31)
- DCausse (WMF) (talk) 08:23, 23 September 2024 (UTC)
- There seems to be 51 subclass of (P279) statements in the scholarly subgraph (query). Looking at few samples I see two common problems: