Wikidata talk:SPARQL query service/WDQS graph split/WDQS Split Refinement

Latest comment: 10 minutes ago by Charles Matthews in topic Clinical trials

property triples maybe unnecessary? edit

After more thought (and looking at your example) I don't see a whole lot of use for the property triples on the scholarly subgraph, I think we can leave them out. ArthurPSmith (talk) 14:41, 17 April 2024 (UTC)Reply

Inconsistency with relation to theses. But also - what does this mean to me as a user? edit

I run the NZThesisProject. I naturally looked over the spreadsheet with that in mind and was surprised to find that you have included some types of thesis and not others. I presume this is an oversight as it makes no logical sense for a Doctor of Medicine thesis to be in one graph and a Doctor of Philosophy thesis in the other, for example.

Could I also make a plea for a plain-language (ie less technical) explanation of what this whole split means to me as a Wikidata user? (tip: I have no idea what a truthy graph is, and I'm going to need some excellent tutorials if I'm going to understand how those example federated queries work). I can understand the description of what you propose to do, more or less, but it is extremely unclear to me what it means for me as a user in practical terms, other than that every Sparql query I have (and most of the tools I use) is going to need to be rewritten. How will I link authors, theses and their publications when they are in two different graphs? Is there anything I can do now that I will no longer be able to do under federation? Will tools like OpenRefine reconcile authors and publications the same way as now or will that change too? DrThneed (talk) 20:51, 17 April 2024 (UTC)Reply

Yes that was definitely unintentional, I think I can edit the list, I'll take a look.
Only SPARQL queries that need data from both graphs would need rewriting. If you have some sample SPARQL queries that you are using now it might be helpful to share them here so that can be assessed. The hope is that most users will only need data from the "main" subgraph, and so no SPARQL rewriting needed. The ones that need rewriting are those that fetch data both about articles and about things that are not articles. Does OpenRefine use SPARQL? Any application that is just talking directly to the regular (non-SPARQL) Wikidata APIs will not require any change. ArthurPSmith (talk) 12:56, 18 April 2024 (UTC)Reply
Thank you that's really helpful. Re the queries, I have quite a few including Scholias and Histropedia timelines (https://www.wikidata.org/wiki/Wikidata:WikiProject_NZThesisProject/Dashboards_and_queries) ...although I think a lot of those might be alright if I've understood correctly. So a query like this:
Theses in the project that have a main subject that is an instance of a person https://w.wiki/6HyD
would not need rewriting, but if I wanted more information on the people items that are the subjects, e.g. their English description or sitelink, then I would need to adjust it?
And a query like this:
Thesis where the author is linked but not in the thesis project https://w.wiki/9pt5 would need rewriting? DrThneed (talk) 08:27, 21 April 2024 (UTC)Reply
Thanks for the examples - I believe unfortunately both would need rewriting since they mix in the same query items from the scholarly graph and from the main graph. @DCausse (WMF) do we have theses included in the scholarly graph yet? Or perhaps just an illustration of how one would do this with the current split? ArthurPSmith (talk) 19:03, 22 April 2024 (UTC)Reply
@DrThneed Thank your for your feedback. Proper documentation is indeed going to be key and thank you for calling out that the current language is lacking clear and less technical explanations, we will try to improve this.
@ArthurPSmith Regarding the queries I have added them to Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples, the current experimental endpoint does not yet separate these publications out of the main graph so they can't be really tested but they seem simple enough that I don't anticipate any problems with them.
@DrThneed in the coming weeks we will put in place a page Wikidata:Request a query rewrite (in the same vein as Wikidata:Request_a_query but for requesting a rewrite), in my experience, and after rewriting a dozen of queries the patterns and techniques used for rewriting are generally the same.
On OpenRefine, I'm not very knowledgeable but this project does seem very generic and highly configurable and I would be surprised if it had any predefined SPARQL queries in its source code. I think that if it has SPARQL capabilities I am pretty sure that the SPARQL query would have to be given by the user when setting up a project. DCausse (WMF) (talk) 08:49, 23 April 2024 (UTC)Reply

Choice of subclasses to include edit

My two cents, not beeing very knowledgeable on the technical matters behind the split: I am not quite convinced by the way scholarly articles and thesis will be separated from the rest of the written production. From an academic perspective, notably the humanities, a book is still a very valid way to publish new knowledge. IMO a more robust and consistent solution, if harder to implement given the mess that are the subclasses of publication (Q732577) and article (Q191067), would be to put every manifestation (in the FRBR sense of it) of a written document in this separated graph. That would includes books, encyclopedia entries, news articles, basically everything that you can read from a printed source (including "digital" print, but excluding everything handwritten such as manuscripts, inscriptions, letters, etc.). I think this would allow for a more straightforward querying: should one need something published, they should include the subgraph, no matter if it's for the list of Shakespear's editions (editions, not works mind you) or the latest papers on high energy physics. This would also allow a more consistent approach to retrieve references, because one can't know beforehand if the reference to a statement might be a book or an article. --Jahl de Vautban (talk) 08:04, 20 April 2024 (UTC)Reply

Thank you for the feedback. I understand where you are coming from. This proposal would move quite a lot more Items that are widely used to the second graph. We are trying to avoid impacting too many reusers of the main graph so I fear that is not a good way forward overall. Lydia Pintscher (WMDE) (talk) 15:05, 24 April 2024 (UTC)Reply
Given the discussion above, it looks like it impacts almost every tool I use and every query I have. So I'm not a fan! I would very much favour keeping theses and dissertations (which are regarded as unpublished items anyway, from an academic perspective) out of the split. DrThneed (talk) 23:38, 3 May 2024 (UTC)Reply
@DrThneed I would like to re-assure you that there will be a transition period (at least 6months) during which the impacted use-cases will have the time to adapt, we will provide as much support as we can to help the transition (mainly through the Wikidata:Request_a_query_rewrite page). DCausse (WMF) (talk) 08:16, 17 May 2024 (UTC)Reply

Medical Scholarly Articles edit

While reviewing the list of types I found medical scholarly article (Q82969330), non-randomized controlled trial report (Q70471362) and multicenter study report (Q91901000) that could potentially be considered as scientific publication, should these be included as well? (cc @ArthurPSmith) DCausse (WMF) (talk)

Keep entities with sitelinks edit

In accordance with Wikidata:Notability point 1 ("It contains at least one valid sitelink"), I think a better split expression might be something like ?entity wdt:P31/wdt:P279* wd:Q13442814 . FILTER NOT EXISTS { ?article schema:about ?entity}. That would also start to cover the point 3 ("It fulfils a structural need"), even if that point might need more fine-tuning. Maxlath (talk) 14:50, 16 May 2024 (UTC)Reply

@Maxlath Thanks for raising this. We briefly discussed using sitelinks to inform the nature of the split but for some reason it did not make to the list of suggested improvements and I'm glad that you raise it so that we can have a conversion about it. From a technical standpoint I don't have objections to this idea:
  • A rule like [] schema:about ?entity can be implemented
  • It would not change much the size of the splits with roughly 43,000 papers (only 0.09% of the 44,000,000 papers) moving from the scholarly subgraph to the main subgraph.
So I believe the discussion to have is regarding the overall usability. I'm probably not knowledgeable enough to make an informed judgement on this but here are the points I can think of:
  • Use-cases relying on scientific publications will always have to UNION both subgraphs
  • Use-cases relying on sitelinks (regardless of their nature) could continue to query solely the main subgraph
DCausse (WMF) (talk) 08:04, 17 May 2024 (UTC)Reply
Yes this is a good point. Probably scholarly works will always need to do the UNION thing anyway just in case something has an unexpected instance of (P31) value. So I don't think this is so concerning, and if it's helpful for other use cases to keep some of them in the main graph that seems ok to me. ArthurPSmith (talk) 21:12, 17 May 2024 (UTC)Reply

Clinical trials edit

Around 390K items on Wikidata are instances of clinical trial (Q30612). Some of these are also instances of scholarly article (Q13442814) and some are not. Now, a clinical trial is not in itself a publication, but a report on it probably is: there is an ambiguity here. In some similar cases, there are two items, one of which is a publication type.

Apart from the ambiguity, splitting the clinical trial items into two parts could be awkward. I find just 284 are also scholarly article items, so the simple approach would be a fix for those. The property ClinicalTrials.gov ID (P3098) doesn't currently apply to any scholarly article items; and there are currently just two hits, perhaps anomalous, for ClinicalTrials.gov ID (P3098) and DOI (P356).

That would come down to saying that data about the clinical trial itself is not bibliographical data. That makes ontological sense. A typical and important potential application, however, of WikiCite data, would be automated compilation of a corpus that could serve for the basis of a systematic review. Science librarians get involved in that process, which tends to involve trawling in multiple databases by topic and nature of the trials. Before just saying clinical trial items stay in Wikidata, related publications are split out, it would be good to consider the requirements of federated queries in this area. Charles Matthews (talk) 08:08, 19 May 2024 (UTC)Reply

@Charles Matthews thanks for pointing this out, I agree that scientific articles declaring multiple P31s might pose some challenges, the clinical trials also declared as scholarly articles will be moved to the scholarly subgraph and thus not findable in the main one, spreading some of those across both graphs which is not ideal. It seems that the solution is to disambiguate those by creating a separate entity. I've compiled a list of types that are used alongside scholarly articles to see if there are other instances of this issue, please see this spreadsheet, the corresponding list of items have been compiled here.
Regarding your concern about the requirements of federated queries in this area, would you have existing use-cases in mind (example queries we could look at)? DCausse (WMF) (talk) 09:25, 22 May 2024 (UTC)Reply
@DCausse (WMF): I would need two answers to deal with everything there! Firstly, from the spreadsheet, I did some filtering with MeSH descriptor ID (P486) and MeSH tree code (P672), and the most important cases seem to be these 11: phase I clinical trial (Q5452194), clinical trial (Q30612), twin study (Q244775), observational study (Q818574), randomized controlled trial (Q1436668), NIH consensus development conference summary (Q27718083), phase II clinical trial (Q42824440), evaluation study (Q58898636), validation studies (Q58900694), consensus development conference proceedings (Q58900768), clinical study (Q58902670). These are characterised by having a P672 value with prefix V03, and also some other value with prefix such as E, L or N. The V03 prefix says they are suitable as an object of instance of (P31), and the other prefix indicates they can be the subject of a main subject (P921) statement, both types of statement possibly coming from a PubMed page. There are an additional 18 cases I find when I replace V03 by V02, maybe more because the P672 yet data may not be complete. Charles Matthews (talk) 11:55, 22 May 2024 (UTC)Reply
I can be clearer with a query that finds 29 items. (URL shortening fails :-(.)
SELECT DISTINCT ?item ?itemLabel

   WHERE {
     VALUES ?item {  wd:Q871232 wd:Q637866 wd:Q47461344 wd:Q193842 wd:Q5690540 wd:Q5437326 wd:Q732577 wd:Q5246046 wd:Q69488 wd:Q333291 wd:Q21481766
                                wd:Q2352616 wd:Q3331189 wd:Q10870555 wd:Q309481 wd:Q122846871 wd:Q56478376 wd:Q1980247 wd:Q58901591 wd:Q265158 wd:Q30612 
                                wd:Q122847646 wd:Q191067 wd:Q21112633 wd:Q58632367 wd:Q58901470 wd:Q571 wd:Q7725634 wd:Q58898636 wd:Q830588 wd:Q36774 
                                wd:Q58900768 wd:Q54877584 wd:Q122644068 wd:Q4006 wd:Q604733 wd:Q234460 wd:Q651270 wd:Q5003624 wd:Q1711593 wd:Q878041 wd:Q1143604 
                                wd:Q95977810 wd:Q123584446 wd:Q13433827 wd:Q1517777 wd:Q108070213 wd:Q11826511 wd:Q60712335 wd:Q591041 wd:Q95988374 wd:Q106140535 
                                wd:Q101072613 wd:Q605175 wd:Q36279 wd:Q746654 wd:Q133492 wd:Q21680312 wd:Q109229154 wd:Q55915575 wd:Q1962297 wd:Q19389637 wd:Q1172284 
                                wd:Q108386385 wd:Q20540385 wd:Q58900694 wd:Q1277575 wd:Q17518461 wd:Q3099732 wd:Q2376293 wd:Q5707594 wd:Q386724 wd:Q178651 wd:Q1784733 
                                wd:Q20655472 wd:Q352858 wd:Q121763407 wd:Q1631107 wd:Q1002697 wd:Q8054 wd:Q2412849 wd:Q20747295 wd:Q550089 wd:Q47114558 wd:Q694975 
                                wd:Q107013291 wd:Q128093 wd:Q86460068 wd:Q77253277 wd:Q17537576 wd:Q35760 wd:Q58898586 wd:Q49848 wd:Q8513 wd:Q1228945 wd:Q112983 
                                wd:Q21668810 wd:Q155207 wd:Q737498 wd:Q187947 wd:Q1298668 wd:Q818574 wd:Q1050259 wd:Q1630279 wd:Q2217301 wd:Q7553 wd:Q106645589 
                                wd:Q758901 wd:Q2267705 wd:Q2438528 wd:Q305178 wd:Q87167 wd:Q42240 wd:Q112193867 wd:Q108196115 wd:Q1047113 wd:Q850950 wd:Q17518557
                                wd:Q861911 wd:Q5 wd:Q603773 wd:Q7433672 wd:Q1436668 wd:Q88392887 wd:Q38926 wd:Q193495 wd:Q3719255 wd:Q4119870 wd:Q13136 wd:Q2146881 
                                wd:Q4184 wd:Q59259094 wd:Q82753 wd:Q4769616 wd:Q42848 wd:Q190084 wd:Q2085381 wd:Q11396303 wd:Q3055347 wd:Q2565355 wd:Q123177031 
                                wd:Q947859 wd:Q1667023 wd:Q670787 wd:Q170584 wd:Q21156247 wd:Q26840225 wd:Q73364223 wd:Q1348645 wd:Q2915731 wd:Q59908 wd:Q164666 
                                wd:Q12139612 wd:Q2668072 wd:Q96416347 wd:Q1238720 wd:Q277759 wd:Q2990839 wd:Q11862829 wd:Q28948553 wd:Q1391420 wd:Q96729626 wd:Q2136117 
                                wd:Q12310958 wd:Q62024811 wd:Q1734578 wd:Q35127 wd:Q5962346 wd:Q122636877 wd:Q60534442 wd:Q3346024 wd:Q1358138 wd:Q5633421 wd:Q42350535 
                                wd:Q17166051 wd:Q25839930 wd:Q567303 wd:Q1778788 wd:Q18168594 wd:Q74817647 wd:Q28869365 wd:Q1006160 wd:Q1787111 wd:Q190399 wd:Q3030248 
                                wd:Q58900805 wd:Q933348 wd:Q482 wd:Q65772760 wd:Q26944781 wd:Q428632 wd:Q5185279 wd:Q124622948 wd:Q12042160 wd:Q7432048 wd:Q55333737 
                                wd:Q17085509 wd:Q95000087 wd:Q193955 wd:Q65589911 wd:Q317623 wd:Q4202018 wd:Q16521 wd:Q2100278 wd:Q5268834 wd:Q1572600 wd:Q83790 
                                wd:Q244775 wd:Q1391417 wd:Q69699844 wd:Q18340514 wd:Q1294318 wd:Q1541005 wd:Q134995 wd:Q5146094 wd:Q105582462 wd:Q2020153 wd:Q80267 
                                wd:Q111448803 wd:Q12343820 wd:Q904997 wd:Q16324495 wd:Q166142 wd:Q18536349 wd:Q384515 wd:Q1834161 wd:Q131449 wd:Q187631 wd:Q11016 
                                wd:Q5687679 wd:Q42396623 wd:Q1714118 wd:Q28923 wd:Q5977147 wd:Q57268247 wd:Q2085515 wd:Q114834437 wd:Q165158 wd:Q118563234 wd:Q2106255 
                                wd:Q686822 wd:Q24685869 wd:Q747288 wd:Q218682 wd:Q11707 wd:Q106963809 wd:Q223729 wd:Q1279564 wd:Q4327689 wd:Q105763243 wd:Q111448685 
                                wd:Q8719053 wd:Q15416 wd:Q46337 wd:Q251212 wd:Q223638 wd:Q21358050 wd:Q961652 wd:Q106334491 wd:Q3239681 wd:Q54117920 wd:Q954845 
                                wd:Q1224889 wd:Q124653107 wd:Q62662439 wd:Q1376568 wd:Q618779 wd:Q11279204 wd:Q111124 wd:Q60186 wd:Q1438033 wd:Q1302249 wd:Q472342 
                                wd:Q116025148 wd:Q58902670 wd:Q960189 wd:Q6646911 wd:Q620615 wd:Q116235645 wd:Q70436236 wd:Q1057179 wd:Q622425 wd:Q58902427 wd:Q839954 
                                wd:Q115528532 wd:Q1410600 wd:Q128758 wd:Q1477856 wd:Q39911916 wd:Q1414531 wd:Q21905924 wd:Q45400320 wd:Q1762591 wd:Q7225113 wd:Q1436703 
                                wd:Q1018633 wd:Q1427116 wd:Q1200750 wd:Q108618539 wd:Q17737 wd:Q1377447 wd:Q8134 wd:Q192425 wd:Q42824440 wd:Q3691017 wd:Q7315176 wd:Q49850
                                wd:Q5172784 wd:Q215380 wd:Q1456936 wd:Q22908280 wd:Q4344852 wd:Q1445211 wd:Q413 wd:Q87917582 wd:Q18216009 wd:Q216526 wd:Q7189713 wd:Q26529
                                wd:Q4830453 wd:Q2602337 wd:Q2385804 wd:Q29063418 wd:Q220659 wd:Q17142652 wd:Q1156854 wd:Q859161 wd:Q1076968 wd:Q238354 wd:Q2940514
                                wd:Q56478588 wd:Q18359 wd:Q15629444 wd:Q212971 wd:Q2257880 wd:Q1164267 wd:Q7397 wd:Q836950 wd:Q39364723 wd:Q1141067 wd:Q757290 
                                wd:Q60534428 wd:Q1004 wd:Q11028 wd:Q19692233 wd:Q1053964 wd:Q18674739 wd:Q4785459 wd:Q625994 wd:Q942582 wd:Q811097 wd:Q476068 
                                wd:Q773668 wd:Q60920906 wd:Q104445146 wd:Q28924364 wd:Q213051 wd:Q121403963 wd:Q11633 wd:Q97012313 wd:Q123750979 wd:Q128406 wd:Q70447452 
                                wd:Q21293489 wd:Q106473769 wd:Q14204246 wd:Q8366 wd:Q1379672 wd:Q1053916 wd:Q5173771 wd:Q10753032 wd:Q9023538 wd:Q646754 wd:Q484692 
                                wd:Q3697781 wd:Q783521 wd:Q5452194 wd:Q959782 wd:Q10898227 wd:Q873506 wd:Q67035425 wd:Q602446 wd:Q170978 wd:Q1415275 wd:Q105422226 
                                wd:Q694134 wd:Q21004260 wd:Q1410069 wd:Q5159954 wd:Q3918409 wd:Q134307 wd:Q42750320 wd:Q442781 wd:Q21682525 wd:Q110419944 wd:Q11538 
                                wd:Q7936612 wd:Q8434 wd:Q90042395 wd:Q58854 wd:Q188952 wd:Q60797 wd:Q101116078 wd:Q560361 wd:Q431289 wd:Q30070590 wd:Q110156968 
                                wd:Q3208168 wd:Q98374854 wd:Q7210349 wd:Q56383918 wd:Q1784021 wd:Q28640 wd:Q59156132 wd:Q24033349 wd:Q2894989 wd:Q27718083 wd:Q52947181 
                                wd:Q45786140 wd:Q1447141 wd:Q56648531 }

          ?item wdt:P486 ?mesh;
                wdt:P672 ?meshcode1;
                wdt:P672 ?meshcode2;
                FILTER (STRSTARTS(?meshcode1, "V"))
                FILTER (STRSTARTS(?meshcode2, "E")|| STRSTARTS(?meshcode2, "J") || STRSTARTS(?meshcode2, "K") || STRSTARTS(?meshcode2, "L") || STRSTARTS(?meshcode2, "N")) 
              
   SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!
Charles Matthews (talk) 19:01, 22 May 2024 (UTC)Reply
To note that biography (Q36279) should appear as a hit of the query, but doesn't because of deprecation of the MeSH tree code (P672) statement. Charles Matthews (talk) 10:05, 23 May 2024 (UTC)Reply
Return to the project page "SPARQL query service/WDQS graph split/WDQS Split Refinement".