Wikidata talk:SPARQL query service/WDQS backend update/Blazegraph failure playbook

Details of scholarly article deletion edit

Scholarly articles are used as sources, typically using stated in (P248). Would any triple involving a removed article be eliminated, too? I guess one could analyze the impact of keeping articles used as source. Toni 001 (talk) 17:58, 10 December 2021 (UTC)Reply

Perhaps the scholarly articles could be divided into two groups: those which are linked from at least item outside of the subgraph and those which are not. Deleting only the latter from Blazegraph would still provide a substantial portion of the ~3.75 years of added time while further minimizing the number of queries impacted.
--Quesotiotyo (talk) 22:31, 10 December 2021 (UTC)Reply

Questions about costs edit

What would it cost to develop software to fix this problem?

More like US$1 million, 10 million, 50 million, 100 million?

The Wikimedia Foundation expects a budget of US$1 billion every five years. I hope that a small amount of money is not the bottleneck. Has anyone estimated the amount of development and labor and costs to address the problem? I am curious if anyone involved here has thought about costs.

Has anyone estimated costs to community?
  1. How many community stakeholders are there in scholarly citations, astronomical objects, and Wikimedia categories?
  2. How many contributor hours will be lost by cutting WDQS for these communities?

I am curious about the extent to which anyone at the Wikimedia Foundation has measured or estimated community value for these things. We have had several meta:WikiCite conferences attended by hundreds of people, and also have about 100 university collaborations at the institution level some of which are listed at Wikidata:WikiProject_PCC_Wikidata_Pilot/Participants. Those are valuable partnerships which would be hard to replace, and which probably cost several hundred thousand dollars to establish with many volunteers acting online and locally over years. If we were to hire paid workers to replace the lost labor, then the projects which are measurable and identifiable might cost $10 million+ to replace, and that does not take into account disruption to a lot of contributors and users who will be confused.

Is there anyone associated with the failure playbook who has published consideration of community impact? Blue Rasberry (talk) 19:59, 13 December 2021 (UTC)Reply

Thank you for bringing up these important concerns, and I'll try to address them as best as possible.
  1. Financial cost. We do not have an estimate yet on the financial cost of scaling Blazegraph, as there are lots of unknowns currently that make it hard to predict. However, let's assume that it would minimally require a team of 3 engineers alongside 1 product manager approximately 2-3 years to do the migration work -- one can estimate an order of magnitude for compensation to be in at least the single digit millions before taking into account other overhead like material costs. While the Search team, who has been working on WDQS at the expense of improving search and relevance features, does not have direct control over how WMF funds are allocated, we are advocating for a dedicated team (with dedicated funding) to address WDQS scaling issues in the long term.
  2. Community cost. Estimating community cost is a difficult task, especially as WDQS currently allows anonymous usage, which makes having a good estimate on number of community stakeholders difficult. At Wikidata Con this year, we presented some data from the user survey that showed that roughly 36% of responders self-reported academic intents, while looking at queries related to the scholarly articles subgraph shows less than 1% of queries related to that subgraph. These measures of community costs aren't perfect, but they seem to imply that removing scholarly articles in the event of catastrophic failure will minimally affect user queries, while continuing to store new edits/additions to scholarly articles in Wikidata itself (without needing to load it into Blazegraph). To rephrase: the scholarly articles data would not be lost forever, and will be reloaded/reconnected to the graph when our infrastructure is capable of handling this.
While estimates on community impact for scholarly articles is hard to estimate, the alternative to not having a playbook is that all community members will be negatively affected by WDQS being non-functional. It is not easy or desirable to single out anybody's work in this way, but we hope to not take these emergency measures, and at worst, see them as temporary tactics to ensure that WDQS is still as functional as possible. MPham (WMF) (talk) 10:31, 14 December 2021 (UTC)Reply
@MPham (WMF): Thanks for the reply. I will think about all this for a while and discuss that 30-minute video with others. The cost and time estimate for fixing the problem is helpful. For the rest I need to re-read and talk it over. Thanks. Blue Rasberry (talk) 23:38, 14 December 2021 (UTC)Reply

Cost of transitioning to a supported graph db edit

Hello @MPham (WMF):, thanks for the above and other replies about this lately.

In some conversations over the past months, it was implied to be certain that we would move away from Blazegraph to an alternative that has an active community and support network. What would it cost in time and focus to do that? Other than the open-ended research challenge of evaluating alternatives, how would a decision to switch away be made?

Many other decisions about optimizing or tuning performance seem dependent on this decision, so it may be worth more than it at first seems to transition promptly before doing [and then redoing] that other work. Sj (talk) 19:18, 11 January 2022 (UTC)Reply

Thanks for the question. We do not have an exact cost estimate now other than a very rough prediction that the actual migration work from Blazegraph for a dedicated team should take at least ~2 years. As of now, it is almost certain that we will be moving away from Blazegraph due to its limitations and the fact that it is no longer being actively developed. We'll be working with a graph consultant during the first half of 2022 to try to determine the best option(s) for moving forward, including regular communication about our decision-making process and criteria for choosing an alternative. Transitioning to a more scalable graph backend is a high priority for WMF and WMDE, but there are some limits to how quickly we can do this transition, unfortunately. MPham (WMF) (talk) 20:57, 13 January 2022 (UTC)Reply
Will the progress and the list of evaluated options be published somewhere as the evaluation of alternatives happens? Also, what about some fundamental questions, like how important is the SPARQL query service vs. some other query service in some other language/syntax? For example, TerminusDB seems to be an active open source alternative, but it seems they have some negative views on RDF and are providing an alternative. Mitar (talk) 23:48, 26 January 2022 (UTC)Reply
@Mitar: This evaluation has since been published, perhaps it covers some of what you want to know. GreenReaper (talk) 07:15, 9 November 2022 (UTC)Reply

Exclude properties instead of item types edit

Hi, did somebody check potential effect of excluding some properties from Blazegraph? For example cites work (P2860). It might me more preferred because it is hard to predict what SPARQL requests will be affected by excluding scholarly articles subgraph. In opposite excluding P2860 will affect strongly defined set of requests. User will know: if request contains P2860 then he need to use some other data access way instead of SPARQL. Also SPARQL endpoint may fail instead of return wrong (incomplete or empty) results for requests that contain P2860. ā€” Ivan A. Krestinin (talk) 23:54, 27 December 2021 (UTC)Reply

@Ivan A. Krestinin This seems like a much better way of going about this, even if only initially and then working up to removing the whole item if needs be. According to the Scholia stats page, there are currently nearly 300 million citations, if you add up all of the references for those statements that's got to make a dent! Personally, I'd prefer the option below (deleting duplicate descriptions), but if space still needs to be made, getting rid of the scholarly article citation graph while keeping the items would be far less disruptive overall. Good shout šŸ‘ Aluxosm (talk) 14:47, 17 February 2022 (UTC)Reply

delete descriptions for categories, templates, disambiguation edit

An estimated 500 to 700 million triples are identical descriptions for these three types.

It's not really clear

  • why these have to be added as editable descriptions to Wikidata in the first place
  • or why they would be needed on Query Service. One could just stop exporting them to Query Service.

This is different from the suggestion in the playbook: exclude these items entirely.

See Wikidata:Project_chat#Typical_descriptions for discussion. --- Jura 14:31, 9 February 2022 (UTC)Reply

This is perfectly reasonable, in my opinion; although I don't know if this would relieve WDQS for much time, but it would be interesting having an esteem about this; currently we have nearly 4.9 M Wikimedia category (Q4167836) + nearly 1.4 M Wikimedia disambiguation page (Q4167410) + nearly 0.85 Wikimedia template (Q11266439). --EpƬdosis 16:16, 9 February 2022 (UTC)Reply
Seems I underestimated it ..
For categories: 4,895,522 instances and 626,707,500 descriptions of maybe 684,826,700+16,965,500+? triples total. So a large percentage of triples for category items are probably descriptions.
For dab items: 1,379,094 instances and 114,675,000 descriptions.
For template items: 849,655 instances and 1,762,500 descriptions.
To reduce edits on these, maybe we should also look into this outside this scenario. --- Jura 17:13, 9 February 2022 (UTC)Reply

Set up a Blazegraph server with these triples deleted already? edit

In the event of a big failure, reloading data will take many days. And only after several reload failures we will realize we have to delete some data. This will translate to a very long downtime for everybody. We need to prepare right now. Also I'm not interested in scholarly articles nor descriptions for disambiguation pages, so I'm happy to use this server instead. Midleading (talk) 05:41, 25 February 2023 (UTC)Reply

Return to the project page "SPARQL query service/WDQS backend update/Blazegraph failure playbook".