Wikidata:WikiProject PCC Wikidata Pilot/University of Washington/Workflow, Trainings, and Resources/EAD to Wikidata Workflow
We welcome constructive feedback. If you would like to see this workflow expanded to include more EAD fields, or have other suggestions, please create an issue in our SCArchivesAgents GitHub repository.
Workflow created and openly shared by the University of Washington Libraries Linked Data Team. Python coding by Melissa Morgan, MLIS Candidate, University of Washington Information School Mcm104 (talk • contribs • logs)
Retrieve EAD finding aids as XML files
editMake note of the directory where you save these files.
The files we use for this example are in EAD Version 2002.
Run XML files through reconciliation_csv.py
edit- Pulls values for EAD elements origination, unittitle, and bioghist and outputs them in a CSV file
- To run reconciliation_csv.py from the command line:
$ python3.6 reconciliation_csv.py filepath |
- Replace "filepath" with the file path to the directory containing the EAD files
- This results in the output reconciliation_csv.csv.
- Any EAD files that do not contain an origination name will be skipped, and their respective file names will be listed in the command window.
Example
editFrom the SCArchivesAgents GitHub repository
Input
editFrom the original XML:
<origination>
<corpname encodinganalog="110" role="creator">AFL-CIO.$bRegion No. 9</corpname> </origination> |
<unittitle encodinganalog="245$a" type="collection">AFL-CIO Region 9 records</unittitle> |
<bioghist id="a2" encodinganalog="5451_">
After the 1955 merger of the American Federation of Labor and the Congress of Industrial Organizations, the national AFL-CIO implemented a regional system to communicate and coordinate with state and local labor councils. Since 1955, the regional system has been restructured several times. For the purposes of this collection, AFL-CIO Region 9 refers to the representative of the national AFL-CIO in the Pacific Northwest from 1986 to 1996. Located in Seattle, the office is now defunct, its former jurisdiction encompassed: Alaska, Idaho, Montana, Oregon and Washington. In 1996-1997, these areas were merged within new AFL-CIO regional office structure and placed under Western Region Office, in San Francisco, CA. From 1986 until 1994, Region 9 was directed by Edward Collins. Previously, Collins had overseen the Region 6 office in San Francisco. David Gregory succeeded Collins in 1994. After the reorganization Gregory served as Regional Political Coordinator. Due to the restructuring of the regional system this collection contains materials from AFL-CIO Region 21 (1955-1974) and AFL-CIO Region 6 (1974-1986). The AFL-CIO Region 21 office was located in Portland, Oregon and directed by Chester C. Dunsten (1956-1964), Claude Shaffer (1964-1965), and James J. Leary (1965-1974). The AFL-CIO Region 6 office was located in San Francisco and directed by William L. Gilbert (1974-1977), Charles Hogan (1977-1981), James Baker (1981-1983), and Edward J. Collins (1983-1986). </bioghist> |
Command
edit$ python3.6 reconciliation_csv.py data_received/LaborArchivesEADLinkedDataProject |
Output
editFrom reconciliation_csv.csv:
origination_name | unittitle | bioghist |
---|---|---|
AFL-CIO. Region No. 9 | AFL-CIO Region 9 records | After the 1955 merger of the American Federation of Labor and the Congress of Industrial Organizations, the national AFL-CIO implemented a regional system to communicate and coordinate with state and local labor councils. Since 1955, the regional system has been restructured several times. For the purposes of this collection, AFL-CIO Region 9 refers to the representative of the national AFL-CIO in the Pacific Northwest from 1986 to 1996. Located in Seattle, the office is now defunct, its former jurisdiction encompassed: Alaska, Idaho, Montana, Oregon and Washington. In 1996-1997, these areas were merged within new AFL-CIO regional office structure and placed under Western Region Office, in San Francisco, CA. From 1986 until 1994, Region 9 was directed by Edward Collins. Previously, Collins had overseen the Region 6 office in San Francisco. David Gregory succeeded Collins in 1994. After the reorganization Gregory served as Regional Political Coordinator. Due to the restructuring of the regional system this collection contains materials from AFL-CIO Region 21 (1955-1974) and AFL-CIO Region 6 (1974-1986). The AFL-CIO Region 21 office was located in Portland, Oregon and directed by Chester C. Dunsten (1956-1964), Claude Shaffer (1964-1965), and James J. Leary (1965-1974). The AFL-CIO Region 6 office was located in San Francisco and directed by William L. Gilbert (1974-1977), Charles Hogan (1977-1981), James Baker (1981-1983), and Edward J. Collins (1983-1986). |
Load reconciliation_csv.csv into OpenRefine
edit- Upload to OpenRefine as a new project. We use version 3.4.1 in this example.
- Make sure columns are separated by commas, and click "Create Project".
Reconcile Wikidata items for agent names in OpenRefine
edit- On the column containing the names of agents (in this example, origination_name), click the down arrow for options, and under Reconcile, select Start reconciling...
- Select Wikidata.
- OpenRefine will suggest items. Selecting one of these will prompt the reconciliation engine to search for items which are instances of the item selected. You can also search for a different item to use. For this example we are looking for instances of agents.
- Select “Start Reconciling.”
- Once you are finished reconciling, create a new column based on your reconciled values.
- Select the property "Qid", and click OK.
Output new CSV with reconciled values
editClick Export, and select Comma-separated value.
Run original EAD and reconciled CSV through quickstatements_csv.py
editPulls values for elements origination, unittitle, unitid, and url
- To run quickstatements_csv.py from the command line:
$ python3.6 reconciliation_csv.py filepath1 filepath2 |
- Replace "filepath1" with the file path to the directory containing the EAD files, and "filepath2" with the file path to the CSV exported from OpenRefine in the previous step.
- This results in the output quickstatements_csv.csv.
- Any EAD files that do not contain an origination name will be skipped, and their respective file names will be listed in the command window.
Example
editFrom the SCArchivesAgents GitHub repository
Input
editFrom the original XML:
<origination>
<corpname encodinganalog="110" role="creator">AFL-CIO.$bRegion No. 9</corpname> </origination> |
<unittitle encodinganalog="245$a" type="collection">AFL-CIO Region 9 records</unittitle> |
<unitid encodinganalog="9455_$a" countrycode="us" repositorycode="wau-ar" type="mss">5189</unitid> |
<eadid countrycode="us" mainagencycode="wau-ar" encodinganalog="identifier" url="http://www.lib.washington.edu/specialcoll/findaids/docs/papersrecords/AFL-CIORegion9_5189.xml" identifier="80444/xv62756">AFL-CIORegion9_5189.xml</eadid> |
Command
edit$ python3.6 quickstatements_csv.py data_received/LaborArchivesEADLinkedDataProject data_reconciliation/reconciledValuesWithQNumbers-2021-01-20-cec.csv |
Output
editQID or create new | Property | Property value | Qualifier property | Qualifier value | Qualifier property | Qualifier value | Qualifier property | Qualifier value | |
---|---|---|---|---|---|---|---|---|---|
CREATE | |||||||||
Label | LAST | Len | "AFL-CIO. Region No. 9" | ||||||
on focus list of Wikimedia project | LAST | P5008 | Q98970039 | ||||||
archives at | LAST | P485 | Q22096098 | P1810 | "AFL-CIO Region 9 records" | P217 | "5189" | P973 | "http://www.lib.washington.edu/specialcoll/findaids/docs/papersrecords/AFL-CIORegion9_5189" |
Double-check your data
edit- Upload quickstatements_csv.csv to Excel or Google Sheets for easy viewing and editing.
- From here, you can review your data before uploading it to Quickstatements. You can also add additional properties manually if desired.
Load quickstatements_csv.csv into Quickstatements
edit- Once you have determined your data is satisfactory, you can load it into Quickstatements as a new batch. Use V1 commands. Your data can be copy-pasted directly from your spreadsheet from the previous step. Be sure to ignore the header row and first column--these are included for human readability, not for Quickstatements.
- Quickstatements Help