Wikidata:WikiProject PCC Wikidata Pilot/University of Washington/Workflow, Trainings, and Resources/EAD to Wikidata Workflow

We welcome constructive feedback. If you would like to see this workflow expanded to include more EAD fields, or have other suggestions, please create an issue in our SCArchivesAgents GitHub repository.
Workflow created and openly shared by the University of Washington Libraries Linked Data Team. Python coding by Melissa Morgan, MLIS Candidate, University of Washington Information School Mcm104 (talkcontribslogs)

Retrieve EAD finding aids as XML files

edit

Make note of the directory where you save these files.
The files we use for this example are in EAD Version 2002.

Run XML files through reconciliation_csv.py

edit
  • Pulls values for EAD elements origination, unittitle, and bioghist and outputs them in a CSV file
  • To run reconciliation_csv.py from the command line:
$ python3.6 reconciliation_csv.py filepath
  • Replace "filepath" with the file path to the directory containing the EAD files
  • This results in the output reconciliation_csv.csv.
  • Any EAD files that do not contain an origination name will be skipped, and their respective file names will be listed in the command window.

Example

edit

From the SCArchivesAgents GitHub repository

Input

edit

From the original XML:

<origination>

<corpname encodinganalog="110" role="creator">AFL-CIO.$bRegion No. 9</corpname> </origination>

<unittitle encodinganalog="245$a" type="collection">AFL-CIO Region 9 records</unittitle>
<bioghist id="a2" encodinganalog="5451_">

After the 1955 merger of the American Federation of Labor and the Congress of Industrial Organizations, the national AFL-CIO implemented a regional system to communicate and coordinate with state and local labor councils. Since 1955, the regional system has been restructured several times. For the purposes of this collection, AFL-CIO Region 9 refers to the representative of the national AFL-CIO in the Pacific Northwest from 1986 to 1996. Located in Seattle, the office is now defunct, its former jurisdiction encompassed: Alaska, Idaho, Montana, Oregon and Washington. In 1996-1997, these areas were merged within new AFL-CIO regional office structure and placed under Western Region Office, in San Francisco, CA.

From 1986 until 1994, Region 9 was directed by Edward Collins. Previously, Collins had overseen the Region 6 office in San Francisco. David Gregory succeeded Collins in 1994. After the reorganization Gregory served as Regional Political Coordinator.

Due to the restructuring of the regional system this collection contains materials from AFL-CIO Region 21 (1955-1974) and AFL-CIO Region 6 (1974-1986). The AFL-CIO Region 21 office was located in Portland, Oregon and directed by Chester C. Dunsten (1956-1964), Claude Shaffer (1964-1965), and James J. Leary (1965-1974). The AFL-CIO Region 6 office was located in San Francisco and directed by William L. Gilbert (1974-1977), Charles Hogan (1977-1981), James Baker (1981-1983), and Edward J. Collins (1983-1986).

</bioghist>

Command

edit
$ python3.6 reconciliation_csv.py data_received/LaborArchivesEADLinkedDataProject

Output

edit

From reconciliation_csv.csv:

origination_name unittitle bioghist
AFL-CIO. Region No. 9 AFL-CIO Region 9 records After the 1955 merger of the American Federation of Labor and the Congress of Industrial Organizations, the national AFL-CIO implemented a regional system to communicate and coordinate with state and local labor councils. Since 1955, the regional system has been restructured several times. For the purposes of this collection, AFL-CIO Region 9 refers to the representative of the national AFL-CIO in the Pacific Northwest from 1986 to 1996. Located in Seattle, the office is now defunct, its former jurisdiction encompassed: Alaska, Idaho, Montana, Oregon and Washington. In 1996-1997, these areas were merged within new AFL-CIO regional office structure and placed under Western Region Office, in San Francisco, CA. From 1986 until 1994, Region 9 was directed by Edward Collins. Previously, Collins had overseen the Region 6 office in San Francisco. David Gregory succeeded Collins in 1994. After the reorganization Gregory served as Regional Political Coordinator. Due to the restructuring of the regional system this collection contains materials from AFL-CIO Region 21 (1955-1974) and AFL-CIO Region 6 (1974-1986). The AFL-CIO Region 21 office was located in Portland, Oregon and directed by Chester C. Dunsten (1956-1964), Claude Shaffer (1964-1965), and James J. Leary (1965-1974). The AFL-CIO Region 6 office was located in San Francisco and directed by William L. Gilbert (1974-1977), Charles Hogan (1977-1981), James Baker (1981-1983), and Edward J. Collins (1983-1986).

Load reconciliation_csv.csv into OpenRefine

edit
  • Upload to OpenRefine as a new project. We use version 3.4.1 in this example.
 


  • Make sure columns are separated by commas, and click "Create Project".
 


Reconcile Wikidata items for agent names in OpenRefine

edit
  • On the column containing the names of agents (in this example, origination_name), click the down arrow for options, and under Reconcile, select Start reconciling...
 
reconcile


  • Select Wikidata.
  • OpenRefine will suggest items. Selecting one of these will prompt the reconciliation engine to search for items which are instances of the item selected. You can also search for a different item to use. For this example we are looking for instances of agents.
  • Select “Start Reconciling.”
  • Once you are finished reconciling, create a new column based on your reconciled values.
 
new column


  • Select the property "Qid", and click OK.

Output new CSV with reconciled values

edit

Click Export, and select Comma-separated value.

Run original EAD and reconciled CSV through quickstatements_csv.py

edit

Pulls values for elements origination, unittitle, unitid, and url

  • To run quickstatements_csv.py from the command line:
$ python3.6 reconciliation_csv.py filepath1 filepath2
  • Replace "filepath1" with the file path to the directory containing the EAD files, and "filepath2" with the file path to the CSV exported from OpenRefine in the previous step.
  • This results in the output quickstatements_csv.csv.
  • Any EAD files that do not contain an origination name will be skipped, and their respective file names will be listed in the command window.

Example

edit

From the SCArchivesAgents GitHub repository

Input

edit

From the original XML:

<origination>
   <corpname encodinganalog="110" role="creator">AFL-CIO.$bRegion No. 9</corpname>

</origination>

<unittitle encodinganalog="245$a" type="collection">AFL-CIO Region 9 records</unittitle>
<unitid encodinganalog="9455_$a" countrycode="us" repositorycode="wau-ar" type="mss">5189</unitid>
<eadid countrycode="us" mainagencycode="wau-ar" encodinganalog="identifier" url="http://www.lib.washington.edu/specialcoll/findaids/docs/papersrecords/AFL-CIORegion9_5189.xml" identifier="80444/xv62756">AFL-CIORegion9_5189.xml</eadid>

Command

edit
$ python3.6 quickstatements_csv.py data_received/LaborArchivesEADLinkedDataProject data_reconciliation/reconciledValuesWithQNumbers-2021-01-20-cec.csv

Output

edit
Caption text
QID or create new Property Property value Qualifier property Qualifier value Qualifier property Qualifier value Qualifier property Qualifier value
CREATE
Label LAST Len "AFL-CIO. Region No. 9"
on focus list of Wikimedia project LAST P5008 Q98970039
archives at LAST P485 Q22096098 P1810 "AFL-CIO Region 9 records" P217 "5189" P973 "http://www.lib.washington.edu/specialcoll/findaids/docs/papersrecords/AFL-CIORegion9_5189"

Double-check your data

edit
  • Upload quickstatements_csv.csv to Excel or Google Sheets for easy viewing and editing.
  • From here, you can review your data before uploading it to Quickstatements. You can also add additional properties manually if desired.
  • Once you have determined your data is satisfactory, you can load it into Quickstatements as a new batch. Use V1 commands. Your data can be copy-pasted directly from your spreadsheet from the previous step. Be sure to ignore the header row and first column--these are included for human readability, not for Quickstatements.
  • Quickstatements Help