Wikidata:WikiProject Scholia/Robustifying/Testing
Testing whether and how Scholia performs is a key element of its development. Scholia is used thousands of times a day to visualize content that changes over scales from minutes to years, so the development of the tool has to take into account typical trajectories of such changes. Currently, such trajectories often mean that Scholia functionality breaks when it would be most useful, i.e. for highly curated content. The Robustifying Scholia project aims to address such situations.
Pilot corpora are example collections of Wikidata content which model Scholia profiles for various use cases. Scholia profiles content from Wikidata, so when there is complete content in Wikidata for a query, then the Scholia profile is higher quality. When Wikidata has incomplete content, then the Scholia profile will be incomplete. The pilot corpora below are complete enough for showcasing as model examples to copy, or to consider in critique, or in demonstrating Scholia and especially its limits.
Backend testing are activities aimed at addressing the subset of Scholia's limits that are due to the infrastructure that Scholia runs on, i.e. a Flask web app deployed on Wikimedia Toolforge that embeds JavaScript-enhanced iframes from the Wikidata Query Service that contain the results of SPARQL queries triggered by the web app. In the context of the Robustifying Scholia project, we are reviewing all of these components and their interactions to explore room for optimization.
About edit
Background edit
The ideal pilot corpora are complete datasets with some networking to other complete datasets. For example, to profile a topic it is not necessary to profile the leading authors publishing on that topic, but for the purpose of showcasing Scholia having the option to click various options can make a better first impression for anyone learning about Scholia as a feature-rich tool. More complete profiles demonstrate the sort of content users may curate for useful applications, and can inspire others to contribute data to profile content of their own choice.
As is common in Wikimedia projects, hundreds of people continuously edit the data corpora which are the foundations of these profiles, and consequently they are always changing. In the context of this crowdsourced engagement, individuals and the collective community do wiki-style documentation of changes, develop and publish guidelines for the general Wikidata environment, and experience personal and collective insight and cultural development toward best practices for curating this content. As of November 2019, the pace of Wikidata and Scholia development is too rapid to justify regular updates on best practices for curation. Anyone wishing to do curation should instead consider the below examples as showcased models where individuals and groups have worked toward completeness. For further details it is best to ask any experienced Wikidata contributor or Scholia project participant.
Usage edit
Pilot corpora only make sense in the context of particular profiling aspects. For example, a profile for a university may be from the perspective of papers from that university as an organization, or about that university as a topic. As different profiles present different content, the subject of a profile may be complete in one aspect but not another.
In the context of the Robustifying Scholia project, the corpora are used to test the limits of Scholia, and to explore workarounds and alternatives.
Key resources edit
- Scholia corpora tickets in GitHub
- Wikidata:WikiProject Zika Corpus — the literature around Zika virus (Q202864) serves as the primary testing ground for Scholia but it is too small to
- reach a audiences interested in unrelated topics
- hit some of the limits that Scholia experiences for some larger corpora
Testing goals for 2020 edit
This project requires systematic testing of Scholia performance across possible use cases or usage scenarios. On that basis, we will assemble a corpus of examples that test Scholia’s technical limits that can help us optimize the infrastructure or inform technical design decisions. We will make such decisions around the types of visualizations available through Scholia and how they are cached or preserved, the ways in which the data to visualize gets into Scholia from Wikidata, the ways in which users can configure the experience (e.g. for comparisons), and the ways in which Scholia is integrated with WikiCite curation workflows, or hardware requirements.
While these technical test sets may be of limited use or interest outside the Scholia development team, the systematic testing of Scholia’s limits can also help identify circumstances where the tool works well, and in conjunction with usage information, we can then start to build pilot datasets like the Zika corpus or the Invasion biology corpus to serve as examples that engage different user communities. Having example sets to show off creates a model and workflow for others to emulate to open, expand, integrate or clean up the datasets which are relevant to them.—Scholia team, Robustifying Scholia, 2019
Milestones edit
- Corpora curation
Every corpus which Scholia presents for profiling is a milestone in Scholia development. Stages of development of corpora are their ingestion into Wikidata, their curation to refine quality, then designating them as ready for use and public feedback.
A corpus is a collection of data which Scholia can visualize in a useful way. Examples are below. The Robustifying Scholia proposal explains that these are important test cases because if Scholia can profile a corpus, then profiles should also work for any similar general case. It is import to curate corpora for development, and because profiles of corpora are a major attraction to users, and because the most common community crowdsourcing behavior in Scholia participation is the curation of corpora.
Corpora curation should be familiar to Wikidata and Wikimedia editors, as content curation is community behavior focused on a dataset. More specific to Scholia development and goals for 2020 are development of technical infrastructure which enable Scholia to profile corpora.
- Technical developments relevant to Scholia
- Factoring out the SPARQL end point URL
- Wikibase testing
- https://github.com/fnielsen/scholia/issues/916
- OCLC Wikibase pilot “Project Passage”
- http://hangingtogether.org/?p=7385
- http://hangingtogether.org/?p=7398
- http://hangingtogether.org/?p=7433
- https://www.oclc.org/research/publications/2019/oclcresearch-creating-library-linked-data-with-wikibase-project-passage.html
- https://doi.org/10.25333/faq3-ax08
- Has a frontend called “Passage Explorer”
- Learning Wikibase
- Professional Wikibase hosting?
- Running Wikibase and Semantic MediaWiki on the same wiki
- Determination of the price of operating Scholia
- Dockerizing Scholia
- Specialist will be hired to work on this January 2020
- One pull request for a Docker file: https://github.com/fnielsen/scholia/pull/691/files
- Depends on the SPARQL endpoint to be configurable, see https://github.com/fnielsen/scholia/issues/809
- SPARQL visualizer tool from Potsdam
Corpora edit
For the moment, corpora that indicate technical issues (testing corpora) and corpora for which Scholia works fine and thus allows community engagement (community corpora) are both listed here, since transitions between both groups are common as bugs and errors arise and get addressed. We are exploring how best to facilitate distinction between the two when it matters.
Gallery edit
Testing corpora edit
-
If queries take longer than a minute, they time out. This is frequent for complex queries.
-
For comparisons, the likelihood of time-outs or other errors increases with the number of items to be compared (in this case, seven authors named "Li Li").
-
This error occurs when too many requests are being sent to the Wikidata Query Service in a short period from the same IP address. As Scholia profiles include multiple panels that are all requested at roughly the same time, the error is rather frequent. A simple remedy is to reload the page after a minute or so.
-
Publications with hundreds of authors can cause problems with network visualizations like the co-author graph.
-
Well-curated topics can cause the browser to stall.
-
Sometimes, the iframe embedded from the Wikidata Query Service does not result in any visual display.
-
Page number data is not readily available in a structured format, which is why this panel on an author's profile is often empty even if they have published lots of works during the indicated period.
Organizations edit
The following query (source) provides a list of organizations, sorted (in descending order) by number of affiliated people known to Wikidata. For most of these 200 most curated institutions, several of the panels in Scholia's organization aspect have issues with display.
The following query uses these:
- Properties: employer (P108) , member of (P463) , affiliation (P1416) , part of (P361) , GRID ID (P2427)
SELECT ?count ?institution ?institutionLabel WITH { SELECT (COUNT(DISTINCT ?researcher) AS ?count) ?institution WHERE { ?researcher ( wdt:P108 | wdt:P463 | wdt:P1416 ) / wdt:P361* ?institution . ?institution wdt:P2427 ?grid . } GROUP BY ?institution } AS %result WHERE { INCLUDE %result SERVICE wikibase:label { bd:serviceParam wikibase:language "en,da,de,ep,fr,jp,nl,no,ru,sv,zh" . } } ORDER BY DESC(?count) LIMIT 200
Universities edit
Africa edit
- Cairo University (Q194445)
- Makerere University (Q261506)
- University of Namibia (Q220226)
- University of Sfax (Q540341)
- University of Dar es Salaam (Q557597)
- University of Nairobi (Q649998)
- University of Las Palmas de Gran Canaria (Q940302)
- Stellenbosch University (Q1066492)
Asia edit
Europe edit
- Technical University of Denmark (Q1269766)
- Maastricht University (Q1137652)
- Delft University of Technology (Q752663)
North America edit
Department or subgroup edit
- Cognitive Systems (Q24283660)
- Research Institute of Text Analysis and Applications (Q27639076)
- Data Science Institute (Q50386370)
Other edit
- United States Geological Survey (Q193755)
- Centers for Disease Control and Prevention (Q583725)
- Young Academy Movement (Q75833130)
- National Climate Change Adaptation Research Facility (Q30264311)
Topics edit
Sustainable development edit
In this section, the target corpora are bolded, whereas the other items are listed to provide some of the context of curation in this area.
- Sustainable Development Goals (Q7649586)
- Sustainable Development Goal 3 (Q50216838)
- Sustainable Development Goal 4 (Q53581209)
- Sustainable Development Goal 6 (Q48741129)
- Sustainable Development Goal 13 (Q53581236)
- Sustainable Development Goal 14 (Q53581239)
- Sustainable Development Goal 15 (Q53581245)
Other topics edit
- Scholia (Q45340488)
- scholia (Q1358144)
- G protein-coupled receptor (Q38173)
- beta barrel (Q310424)
- black hole (Q589)
- amyotrophic lateral sclerosis (Q206901)
- Alzheimer's disease (Q11081)
- Parkinson's disease (Q11085)
- malaria (Q12156)
- snakebite (Q68854)
Individuals edit
For individuals, our testing mainly revolves around those that have Wikipedia articles with Scholia profiles as well as candidates for having that template added to their English Wikipedia article. However, we might list a few individuals below who do not neatly fit these two groups but who may be of interest for another reason.
- Physicians
- Other scientists
- Jo Dunkley (Q28757988)
- Karine Breckpot (Q38326061)
- Almaz A. Aldashev (Q25571999)
- Paolo Morettini (Q76757065)
- high-energy physicist, so loads of papers with loads of co-authors, which brings Scholia to its limits
- Other academics
Publishing edit
- Journals
- Proceedings of the National Academy of Sciences of the United States of America (Q1146531)
- Journal of Virology (Q1251128)
- Science, Technology & Human Values (Q166544)
- Nature (Q180445)
- Science (Q192864)
- Antiquity (Q4775205)
- International Gambling Studies (Q15749949)
- Indoor Air (Q15756118)
- International Journal for Equity in Health (Q15749959)
- Marine Policy (Q15757263)
- Journal of Sustainable Tourism (Q15757725)
- Reading and Writing (Q15763433)
- Publishers
- Public Library of Science (Q233358)
- BioMed Central (Q463494)
- Hindawi Publishing Corporation (Q1619253)
Locations edit
Countries edit
- Denmark
- Netherlands
- India
- Tanzania
- Uganda
- Estonia
Events edit
Awards edit
- Nobel Prize in Physiology or Medicine (Q80061)
- L'Oréal-UNESCO Award For Women in Science (Q1786381)
- Fellow of the African Academy of Sciences (Q63208574)