Wikidata:Requests for permissions/Bot/OpenCitations Bot
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 10:03, 15 September 2021 (UTC)[reply]
OpenCitations Bot edit
OpenCitations Bot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Csisc (talk • contribs • logs)
Task/s: Adding references and citation data for scholarly publications found in Wikidata using Wikidata tools and OpenCitations.
Code: [1]
Function details:
- This bot extracts the Wikidata IDs and DOIs of scholarly publications using from the list created by James Hare. Then, it uses the REST API of OpenCitations to retrieve the DOI of the references of each publication. Then, if a considered reference is available in Wikidata, its DOI is converted to the corresponding Wikidata ID using Wikidata Hub tool. Finally, this Wikidata Bot adds citation links between available research papers as cites work (P2860) relations.
- The License of OpenCitations is CC0.
--Csisc (talk) 13:23, 29 July 2020 (UTC)[reply]
- We had another bot doing this work a while ago, is it no longer operational? Or was there a reason it stopped? Also, since each article usually has a dozen or more references, sometimes many times that, it would be better to add the references in a single update, rather than one at a time as would be necessary through QuickStatements. Other than that, yes it would be good to get this added to Wikidata. ArthurPSmith (talk) 17:51, 29 July 2020 (UTC)[reply]
- @ArthurPSmith: What I found out is that there are many publications in Wikidata not linked to their reference publications although reference data about them are available in OpenCitations. I can restrict the work to scholarly publications not having any reference using SPARQL. --Csisc (talk) 09:34, 30 July 2020 (UTC)[reply]
- @ArthurPSmith: We had User:Citationgraph bot and User:Citationgraph bot 2 work on this. Both stopped operating in 2018, since their operator, User:Harej, had rearranged his priorities. Yes, it would make sense to add all cites work (P2860) statements for an item in one go, e.g. via Wikidata Integrator. Not sure how the bot should handle citations of things for which Wikidata does not have an entry yet — perhaps with "no value" and "stated as", so that the information can be converted later as needed. --Daniel Mietchen (talk) 09:02, 7 September 2020 (UTC)[reply]
- @ArthurPSmith: @Daniel Mietchen: Just wanted to mention that Citationgraph bot seems to be back online and Citationgraph bot 2 would follow along soon. I wish we had some something like Scroll To Text Fragment widely supported as web standard, or some paragraph-based anchoring, so I could point to the exact paragraph in this long thread (look for "Harej" there instead). --Diegodlh (talk) 22:05, 15 February 2021 (UTC)[reply]
- @Csisc: Hi! I understand this was part of the Wikicite grant proposal you presented last year. I'm sorry it wasn't approved. Do you plan developing the bot anyway? Now that Elsevier has made their citations open in Crossref, I understand COCI coverage will see a dramatic increase next time it is published (last time was 07 Dec 20, before Elsevier's announcement). Thank you! --Diegodlh (talk) 04:53, 28 January 2021 (UTC)[reply]
- Diegodlh: Of course, I am still for developing the bot. However, we need a server to host it. If the bot can be hosted, I do not mind developing it. The acceptance of Elsevier to include its citation data in OpenCitations corpus will certainly allow a trustworthy coverage of citation data in Wikidata graph. --Csisc (talk) 12:55, 31 January 2021 (UTC)[reply]
- Hi, @Csisc:! Thanks for answering. Sorry I'm relatively new in this. Cannot it be hosted in Toolforge? --Diegodlh (talk) 18:49, 1 February 2021 (UTC)[reply]
- @Diegodlh: I am studying this. The matter with Toolforge is that the Cloud can be easily blocked. --Csisc (talk) 12:25, 2 February 2021 (UTC)[reply]
- Hi, @Csisc:! Thanks for answering. Sorry I'm relatively new in this. Cannot it be hosted in Toolforge? --Diegodlh (talk) 18:49, 1 February 2021 (UTC)[reply]
- I am very excited about this project. The reason my old bot shut down was, among other factors, the scaling issues. I was no longer able to get a reliable mapping of Wikidata items and DOIs from the Wikidata Query Service. The use of WDumper addresses that nicely. For data sources I also recommend PubMed Central. Harej (talk) 21:40, 9 September 2020 (UTC)[reply]
- Please develop the code and make some test edits.--Ymblanter (talk) 19:30, 10 September 2020 (UTC)[reply]
- Just as an observation, I have been trying to produce a dump of DOIs on Wikidata, and the task has yet to complete after seven days and as of writing is going to take months to complete. However I am developing an alternative strategy for producing lists of identifiers and hope to share more later. Harej (talk) 22:51, 6 October 2020 (UTC)[reply]
- I have generated a dataset of Wikidata items with DOIs as of the 20 August 2020 dump. This should definitely help you get started. Harej (talk) 21:51, 7 October 2020 (UTC)[reply]
- Ymblanter, Harej: I thank you for your answer. I will consider your comments and develop the bot for several months. --Csisc (talk) 12:55, 31 January 2021 (UTC)[reply]
- Great, I will be looking forward.--Ymblanter (talk) 20:09, 31 January 2021 (UTC)[reply]
- @Diegodlh, Ymblanter, Harej, Daniel Mietchen: I am sorry for the delay. I am honoured to inform you that I have released a new edition of the source code for this bot where I solved all the critical issues with User:So9q. Please find the source code at https://github.com/csisc/OpenCitations-Bot. --Csisc (talk) 23:36, 4 August 2021 (UTC)[reply]
- Great, I will be looking forward.--Ymblanter (talk) 20:09, 31 January 2021 (UTC)[reply]
- Ymblanter, Harej: I thank you for your answer. I will consider your comments and develop the bot for several months. --Csisc (talk) 12:55, 31 January 2021 (UTC)[reply]
- Oppose While I would really like to see Wikidata getting richer and more useful for scientists and others, I oppose this bot because it could create millions of new items which the WDQS-infrastructure currently cannot handle. Even if no new items are created this will still likely add millions of statements to existing scholary items. I recommend the community to wait for [2] to be resolved (might take a year or more since WMF teams are currently busy with other tasks) before this bot is approved. @multichill, Lydia_Pintscher_(WMDE):--So9q (talk) 13:25, 5 August 2021 (UTC)[reply]
- I agree we should not batch-create items by bot at this time. However, the addition of statements to existing items should not be an issue. I asked the Wikimedia Foundation's search platform engineers (who run WDQS) and they do not think bots in general need to be curtailed at this time. So a bot that adds statements without creating new items should be fine. Harej (talk) 14:27, 11 September 2021 (UTC)[reply]
- @So9q, multichill, Lydia Pintscher (WMDE), Diegodlh, Ymblanter, Harej: Based on the proposal of Egonw, I am willing to restrict the work of the OpenCitations Bot to addition of the citations between existing Wikidata items about scholarly publications. By that, the problem about WDQS Scaling can be solved. --Csisc (talk) 12:12, 11 September 2021 (UTC)[reply]
- I updated the function description accordingly: https://www.wikidata.org/w/index.php?title=Wikidata:Requests_for_permissions/Bot/OpenCitations_Bot&type=revision&diff=1496661429&oldid=1496639604&diffmode=source --Egon Willighagen (talk) 11:04, 12 September 2021 (UTC)[reply]
- Support Now that it no longer created items and only enriches items, I can only but give my full support. Until we have full CiTO annotation, we do not know why articles are cited. But an article that is cited a lot by other notable articles, sounds by definition notable to me. But independent of that, each scholarly reference in Wikipedia is part of the collection of notable knowledge, but each cited article omits a lot of knowledge to ensure the article is concise. An article is not complete without the articles it derives ideas from, uses data from, uses methods from. This is demonstrated by the citation network that this bot introduces. Without these (open) citations, we cannot fulfill our Wikipedia’s Ongoing Search for the Sum of All Human Knowledge ambition. By not creating new items, it addresses the above mentioned problem. --Egon Willighagen (talk) 10:28, 12 September 2021 (UTC)[reply]
- I will approve the bot in a few days provided no objections have been raised.--Ymblanter (talk) 14:11, 12 September 2021 (UTC)[reply]
- I just want to point out that according to @Egon Willighagen: on Telegram we have 37M scientific articles missing "cites work" and if we say that each reference 10 others as a mean and that takes a few triples each we are looking at potentially 37Mx10x3= 1,11B new triples for Blazegraph to handle. @MPham (WMF): This might be the pushover event forcing us to enact some of the emergency measures (like removing all descriptions from WDQS) to keep the backend up and running. I just finished a new fancy tool to add main subject (P921) to millions of scientific articles also which will result in a similar number of new triples (although not quite as quickly as a bot).--So9q (talk) 15:20, 12 September 2021 (UTC)[reply]
- No, that's not what I said (or not meant anyway...). There are 37M scholarly articles (according to https://scholia.toolforge.org/statistics). The query to calculate the subset that has not cites work (P2860) times out. I was unable to calculate that yet. --Egon Willighagen (talk) 15:27, 12 September 2021 (UTC)[reply]
- Oh, ok, I could calculate that myself from the dump in PAWS, but it would take a while. One strategy could be to do it in python reading the RDF looking for triples x wdt:cites work y and add x to a set(). Then in the end count the length of the set and compare it to 37M.--So9q (talk) 05:27, 15 September 2021 (UTC)[reply]
- I just proposed this in PC: https://www.wikidata.org/wiki/Wikidata:Project_chat#Discourage_adding_descriptions_to_scientific_articles? to make bots like this less likely to cause a catastrophic failure (by adding millions of valuable statements).--So9q (talk) 09:56, 15 September 2021 (UTC)[reply]
- Support I prefer the content by this bot over having descriptions if we come in a situation where we have to choose, so go ahead @Ymblanter, Csisc:!--So9q (talk) 09:59, 15 September 2021 (UTC)[reply]
- No, that's not what I said (or not meant anyway...). There are 37M scholarly articles (according to https://scholia.toolforge.org/statistics). The query to calculate the subset that has not cites work (P2860) times out. I was unable to calculate that yet. --Egon Willighagen (talk) 15:27, 12 September 2021 (UTC)[reply]
- I just want to point out that according to @Egon Willighagen: on Telegram we have 37M scientific articles missing "cites work" and if we say that each reference 10 others as a mean and that takes a few triples each we are looking at potentially 37Mx10x3= 1,11B new triples for Blazegraph to handle. @MPham (WMF): This might be the pushover event forcing us to enact some of the emergency measures (like removing all descriptions from WDQS) to keep the backend up and running. I just finished a new fancy tool to add main subject (P921) to millions of scientific articles also which will result in a similar number of new triples (although not quite as quickly as a bot).--So9q (talk) 15:20, 12 September 2021 (UTC)[reply]
- I will approve the bot in a few days provided no objections have been raised.--Ymblanter (talk) 14:11, 12 September 2021 (UTC)[reply]