User:Theredproject/2018 Workplan

Art+Feminism Wikidata plans

March 2018

Art+Feminism plans to use Wikidata to link a collection of items from which to generate task lists. These lists will be used to improve Wikipedia articles as part of our outreach efforts. They will also be used to improve the Wikidata items themselves.

Over the past 4 years Art+Feminism participants have created/edited over 10,000 articles, and we have created/edited over 17,000 more this March. These articles include artists, artworks, writers, filmmakers and other articles that clearly fall in the rubric of Art+Feminism, as well as a smaller subset of articles that fall outside that rubric: doctors, scientists, politicians, etc; we welcome people at our events to edit all articles about women and gender but we seek to narrow our focus for the items/articles we seek to improve long term. In order to do so, our first task is to add profession data to each item, so we can filter our items by profession.

Overview of code edit

We (User:Danaras and User:Theredproject) have written a set of python scripts which we are provisionally calling Wikidata QuickSheets that queries the Wikipedia and Wikidata APIs to return QIDs, and then queries Wikidata for sex or gender (P21) and occupation (P106). Code is available on GitHub here: https://github.com/danaras/wikidata-quicksheets It turns out that only half of these items have occupation (P106) data, so we have written code to add occupation (P106) data to items missing it This script uses category data, and pulls the first sentence of the Wikipedia page for each QID and parses it for the list of all professions. The script outputs the profession and the first sentence in a Keyword In Context (KIC) style, so that a human can verify the validity of the script’s work. Only data verified by a human will be converted by the script into Quick Statements ready tuplets.

The data produces several error states. We are still refining the script to handle these error states in trying to pair our list of Wikipedia page names with QIDs: malformed page names, pages that were moved since creation, etc.

Future steps edit

We hope to unify them via catalog (P972), or whatever new Wikimedia Project Focus List property is agreed upon. Once we have added all profession data we will filter out the items that have occupation (P106) values that are out of scope for the project. We will add that AF specific P972/Wikidata focus list metadata to the final set of items.

At base, we will use this unified set of items to generate lists of items and articles for improvement on Wikipedia and on Wikidata via the semi-automated approach described above.

We also think this script based approach will be of interest to others. It is a process specifically designed to be accessible to those without programming experience. It uses simple article lists (which can be generated via SPARQL) to generate spreadsheets for human evaluation. These sheets will then be transformed back into QS ready data. The script requires no special libraries or dependencies, beyond what is available by default in basic python configurations.

In this regard it is similar to the Wikidata game, but differs in three key ways: 1) you can start with your own very focused list 2) it does more of the work for you 3) laid out in a spreadsheet format you can scan and approve the data faster and at scale. In a way it is like taking the Wikidata Game and transforming it to pair with Quick Statements.

How Wikidata QuickSheets works edit

Step 0: move category data over from Wikipedia in bulk edit

I moved Category data from enwiki to Wikidata occupation (P106) for about 14,000 items that fell into the categories we commonly find our articles in (artists, writers, academics, etc). I did this by generating QID lists with petscan, working with as shallow a depth as was reasonably reflected in the potential occupation (P106) values on Wikidata. I generated Quick Statement tuplets based on these lists.

Step 1: generate lists to start from edit

The script takes a CSV of articles in the following format:

Language, name

es, Deborah Ahenkorah

en, Caroline Woolard

en, Charlotte Cotton

The script could be modified to also accept a list of QIDs as outputted from a SPARQL query.

The script takes a list of all values for P106 from a SPARQL query and generates a working list of occupations sorted by total count on Wikidata that includes their descriptions, so as to differentiate Q3455803 “director of a creative work” from Q1162163 “in business or institutions person in charge of realizing an objective.” In our workfiles, this is “occupations-withDescriptions.csv”

Step 2: generate work files edit

With the list of articles, and occupations, the script uses Wikipedia and Wikidata APIs to look up the QID, gender (P21), and occupation (P106) of each item. It outputs them to several different CSVs:

  • Has QID, P21 = female, has P106
  • Has QID, P21 = female, P106 found via Category
  • Has QID, P21 = female, P106 found via Searching first line
  • Has QID, P21 = female, no P106 found (these are mostly items that do not have en.wiki articles)
  • Has QID, P21 = male OR doesn’t exist (these are mostly non human items, like paintings)
  • No QID, Error -1 (redirects, and other things we haven’t sorted through)
  • No QID, Error -2 (malformed text)

For the QID items missing P106, the script tries two routes to establish P106 data:

  1. It reads the Category data from en.wiki, and compares that to an interface matrix we generated by hand, that contains correlations between Wikipedia Category and QID for ~100 common occupations.
  2. It pulls the first sentence from the Wikipedia article, and searches for P106 values from the occupations-withDescriptions.csv list. The script has a tolerance setting, so you can say that you only want to search for items with more than x uses on Wikidata (I have settled on 5 for testing, which reduces the 9207 values for P106 to 3100 values.

Step 3: notate work file edit

We then take the CSV with QID items missing P106, and parse it manually. We look at each lede sentence and make decisions about each entry in the sheet. We check the occupation that the script found against the first sentence. If we accept the scripts suggestion, we put a Y in that column (actually any value will work). If we find a different value needs to be inserted, we put that in the alt occupation column. The Popular indicates the relative frequency of that particular value: an * indicates it has 1000 or more uses on Wikidata; a . indicates it has more than 100 uses; a space (which produces an empty cell) means it has less than 100 uses.

You can see a detail of a sample work set here:

 

Step 4: generate Quick Statements ready tuplets edit

The script reads the notated work file, and outputs Quick Statements ready tuplets for each row that is marked with a Y. Additionally, it creates a new work file for the lines where I entered an alt occupation; the code does a reverse lookup to see if it can find the correct P106 value.

Proof of Concept edit

We have also produced a proof of concept with a reduced data set from the new articles created during the 2017 campaign. We have not yet inputted this data. After a 10 day comment period, we inputted this data. You can find the full folder of data here: https://drive.google.com/drive/folders/1f1kyQGZ15Ry3nbs1khN3-ZaQcBVbwWbQ?usp=sharing

  • needs human review contains the files that need to be acted on
  • Category Outputs CSV contains human readable P106 data found via enwiki categories
  • Category Outputs QS contains the QS formatted version of that data found via categories

Step 5: Add Art+Feminism specific metadata edit

Once all items in our long list have professions, we can sort and include the ones we want to keep in our focus list. For example, we will likely remove politicians, scientists, etc. Having trimmed those, we will add that property to all of the remaining QIDs.

Step 6: generate worklists, etc edit

From here, we can do work like generate worklists, that encourage participants to expand articles started at Art+Feminism events.

Further Development edit

This script has applications beyond the needs of Art+Feminism.

Migrating other non-controversial claims edit

I know that Women in Red face similar issues regarding missing occupation (P106) data. This approach could also be used to pull in data like:

And probably other biographical properties for areas outside of the arts.

Sourcing P172 claims edit

For example, some editors have proposed a Cleanup of unsourced “ethnic group (P172)” claims under BLP Privacy rules, which have been further discussed at this RFC on Privacy and Living People. At present there are over 53,000 uses of P172, nearly all of which are unsourced. See SPARQL Query here: [1] Of these the largest number are 14,000 are for African Americans (Q49085) (likely because of the Wikidata Game). Additionally there are thousands of unsourced claims for  Armenians (Q79797),  Greeks (Q539051),  Albanians (Q179248),  Ukrainians (Q44806),  Japanese people (Q161652),  Serbs (Q127885),  and Jewish people (Q7325). Additionally there appear to be at least another 14,000 biographies on enwiki that are categorized as African-American, but have no ethnic group (P172) data on their corresponding Wikidata item. See this page pile [2] produced via Petscan.

During the discussion User:Multichill noted that if we were to require all instances to have a source "Sourcing this is going to be hard." Especially if we are not going to allow 'imported from a Wikipedia' as acceptable here. This methodology and technology outlined above could be retooled and repurposed to accomplish this goal.

Modifications required to add sourcing capabilities edit

  • It would need to be able to work from a list of QIDs (possibly via PagePile?)
  • It would need to pull references from the Wikipedia entry, and search for the ethnic group (P172) values in the text of the reference.
    • One challenge here would be searching for all the contemporary terms associated with a specific ethnicity, e.g. "African-American," "Afro-diasporic," "Black," etc. It will likely also be necessary to incorporate historical terms, such as "colored," or "negro."
  • It would need to be abstracted one level, so that you could configure it to accept ethnic group (P172) inputs.
  • Because of the potential for complex statements, it should be migrated to output in QuickStatements2 format
  • It would require better documentation; preferably this would include video documentation, which is something that most wiki tools lack, and produces a significant barrier to entry.

Sourcing other claims edit

The RFC mentioned above includes a list of items that are likely to be challenged:

My understanding is this is a non-exhaustive list. But at the very least, this technique could be used to handle some of these. I think this technique is most well suited to the top three on the list: sexual orientation (P91), ethnic group (P172), religion or worldview (P140). It would also do well for sourcing instances where sex or gender (P21) is not female (Q6581072) or male (Q6581097) (e.g. Trans*).

Summary and tabulation of all articles created during the past 5 years edit

We are currently tracking all 5578 articles created during the past 5 years that have Wikidata items with this SPARQL query.

These are articles that Art+Feminism tracks from previous edit-a-thon's output, which also have key article improvement alert templates. These articles gives different options for the type of editing one might want to do, from adding wiki-links to orphan articles to adding citations, and more.

List is current as of May 5, 2019.

How This List Was Made edit

This list is a subset of the Wikidata Art+Feminism On-Focus List Art+Feminism (Q24909800), which contains all articles edited at A+F Events that are related to the project focus. Because Listeria does not allow for us to cross reference Wikipedia templates, we used Petscan to generate our list, which we converted into wiki markup. This means that our links are based off of the Wikidata item names, which means there will be occasional errors for English Wikipedia articles, and frequent errors for other languages (though articles about people should be accurate). As such, you'll notice some red links below where the wikidata item did not correspond to an article on English Wikipedia.

Where are other language versions? edit

Spanish (ES) wiki list lives here. For the moment, we've only pulled Wikidata for Spanish and English Wikipedia but are open to other languages in the future! Contact us at info@artandfeminism.org for more information.

Articles with Flags on English Wikipedia edit

You'll find a list of articles with flags on English Wikipedia here, including

  1. English Wikipedia Orphan Articles
  2. Citation Problems on English Wikipedia
  3. Notability on English Wikipedia