Wikidata:WikiProject PCC Wikidata Pilot/San Diego State University/Comic Arts Project Workflows/OpenRefine/Works

This workflow is still under development and will depend upon what type of work is being batch loaded from OCLC to Wikidata. As of right now, the project lead has mostly focused on books and series, and the tentative workflow below is for these types of materials.

Workflow

This workflow owes a tremendous amount to the wonderful work of the University of Washington. You can see their full workflow and documentation here.

Export MARC Records

Begin by searching for all your titles in OCLC (either via ISBN or ti:title pn:author or another method)
Priority is given to records from DLC or records held by SDSU. Otherwise, pick the record you feel is best
Add to your local save file using F4
Once all the works for a specific award have been added, open MarcEdit and then open the MarcEditor
From the MarcEditor, use the OCLC Connexion BIB File Reader
- Note, you might need to set this up the first time you use it, to ensure it's pointed toward your local save file
Load the file and then select the records you'll be using
Click edit records, and save the resulting file (I usually save it on my desktop and note the name of the award, i.e. Eisner Awards Continuing Series. You do not need to compile to MARC before saving.)

Convert Selected MARC Fields to a Tab-Delimited File

From the main menu of MarcEdit, go to Tools -> Export and then open the program "Export Tab Delimited Records"
Select your records (from your desktop) and select where you plan to export (typically this should also be your desktop/Wikidata folder and the title should be the same as the import)
- Note, you'll probably need to change to "all file" types unless you compiled your document to MARC
Leave default delimiter settings
Add MARC fields to extract

MARC21 Field for Export	Definition
245 $a	title
245 $b	subtitle
245 $c	statement of responsibility
100 $a	main author/creator
260 $b	publisher (non RDA record)
264 $b	publisher (RDA record)
260 $c	publication date (non RDA record)
264 $c	publication date (RDA record)
700 $a	all other creators
600 $a	people/fictitious characters featured in the story (important for nonfiction works) and works about a specific superhero
650 $a	subjects (mostly grabbed in case you want/are able to replace manually with Wikidata subjects). Also important for groups of fictitious characters, i.e. X-Men

Create a project in OpenRefine

Make sure the character encoding for the project is UTF-8
Uncheck the option to Use character " to enclose cells containing column separators
Name your project; should be after the specific award you're working on
Create project

Clean up the data in OpenRefine

Click "Apply" in the "Undo/Redo" tab
Paste the following into the text box (explanations coming soon). Click "Perform Operations"
Review your data

OpenRefine Cleanup: GUI Instructions	OpenRefine Cleanup: Apply JSON
title and subtitle Edit cells → Transform Python/Jython import re exceptions = ["and", "or", "the", "a", "of", "in", "to", "an"] title = value lowercase_words = re.split(" ", title.lower()) final_words = [lowercase_words[0].capitalize()] final_words += [word if word in exceptions else word.capitalize() for word in lowercase_words[1:]] final_title = " ".join(final_words) return (final_title) 245 $a Edit column → Rename this column → title 245 $b Edit column → Rename this column → subtitle Python/Jython: import re exceptions = ["and", "or", "the", "a", "of", "in", "to", "an"] title = value lowercase_words = re.split(" ", title.lower()) final_words = [lowercase_words[0].capitalize()] final_words += [word if word in exceptions else word.capitalize() for word in lowercase_words[1:]] final_title = " ".join(final_words) return (final_title) Edit cells → Common transforms → Trim leading and trailing whitespace Edit cells → Common transforms → Collapse consecutive whitespace Edit cells → Transform GREL value.chomp(" :") value.chomp(" /")	[ { "op": "core/column-rename", "oldColumnName": "245$a", "newColumnName": "title", "description": "Rename column 245$a to title" }, { "op": "core/column-rename", "oldColumnName": "245$b", "newColumnName": "subtitle", "description": "Rename column 245$b to subtitle" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "title", "expression": "jython:import re\nexceptions = [\"and\", \"or\", \"the\", \"a\", \"of\", \"in\", \"to\", \"an\"]\ntitle = value\nlowercase_words = re.split(\" \", title.lower())\nfinal_words = [lowercase_words[0].capitalize()]\nfinal_words += [word if word in exceptions else word.capitalize() for word in lowercase_words[1:]]\nfinal_title = \" \".join(final_words)\nreturn (final_title)", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column title using expression jython:import re\nexceptions = [\"and\", \"or\", \"the\", \"a\", \"of\", \"in\", \"to\", \"an\"]\ntitle = value\nlowercase_words = re.split(\" \", title.lower())\nfinal_words = [lowercase_words[0].capitalize()]\nfinal_words += [word if word in exceptions else word.capitalize() for word in lowercase_words[1:]]\nfinal_title = \" \".join(final_words)\nreturn (final_title)" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "subtitle", "expression": "jython:import re\nexceptions = [\"and\", \"or\", \"the\", \"a\", \"of\", \"in\", \"to\", \"an\"]\ntitle = value\nlowercase_words = re.split(\" \", title.lower())\nfinal_words = [lowercase_words[0].capitalize()]\nfinal_words += [word if word in exceptions else word.capitalize() for word in lowercase_words[1:]]\nfinal_title = \" \".join(final_words)\nreturn (final_title)", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column subtitle using expression jython:import re\nexceptions = [\"and\", \"or\", \"the\", \"a\", \"of\", \"in\", \"to\", \"an\"]\ntitle = value\nlowercase_words = re.split(\" \", title.lower())\nfinal_words = [lowercase_words[0].capitalize()]\nfinal_words += [word if word in exceptions else word.capitalize() for word in lowercase_words[1:]]\nfinal_title = \" \".join(final_words)\nreturn (final_title)" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "title", "expression": "value.trim()", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column title using expression value.trim()" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "subtitle", "expression": "value.trim()", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column subtitle using expression value.trim()" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "title", "expression": "value.replace(/\\s+/,' ')", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column title using expression value.replace(/\\s+/,' ')" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "subtitle", "expression": "value.replace(/\\s+/,' ')", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column subtitle using expression value.replace(/\\s+/,' ')" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "title", "expression": "grel:value.chomp(\" :\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column title using expression grel:value.chomp(\" :\")" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "title", "expression": "grel:value.chomp(\" /\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column title using expression grel:value.chomp(\" /\")" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "subtitle", "expression": "grel:value.chomp(\" :\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column subtitle using expression grel:value.chomp(\" :\")" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "subtitle", "expression": "grel:value.chomp(\" /\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column subtitle using expression grel:value.chomp(\" /\")" }, { "op": "core/column-addition", "engineConfig": { "facets": [ { "type": "list", "name": "subtitle", "expression": "isBlank(value)", "columnName": "subtitle", "invert": false, "omitBlank": false, "omitError": false, "selection": [ { "v": { "v": false, "l": "false" } } ], "selectBlank": false, "selectError": false } ], "mode": "row-based" }, "baseColumnName": "title", "expression": "grel:cells.title.value + \": \" + cells.subtitle.value", "onError": "set-to-blank", "newColumnName": "label", "columnInsertIndex": 1, "description": "Create column label at index 1 based on column title using expression grel:cells.title.value + \": \" + cells.subtitle.value" }, { "op": "core/column-addition", "engineConfig": { "facets": [ { "type": "list", "name": "subtitle", "expression": "isBlank(value)", "columnName": "subtitle", "invert": false, "omitBlank": false, "omitError": false, "selection": [ { "v": { "v": true, "l": "true" } } ], "selectBlank": false, "selectError": false } ], "mode": "row-based" }, "baseColumnName": "title", "expression": "grel:value", "onError": "set-to-blank", "newColumnName": "label2", "columnInsertIndex": 1, "description": "Create column label2 at index 1 based on column title using expression grel:value" }, { "op": "core/column-addition", "engineConfig": { "facets": [], "mode": "row-based" }, "baseColumnName": "label2", "expression": "join ([coalesce(cells['label2'].value,''),coalesce(cells['label'].value,'')],'')", "onError": "keep-original", "newColumnName": "label3", "columnInsertIndex": 2, "description": "Create column label3 at index 2 based on column label2 using expression join ([coalesce(cells['label2'].value,''),coalesce(cells['label'].value,'')],'')" }, { "op": "core/column-reorder", "columnNames": [ "title", "label3", "subtitle", "245$c", "100$a", "700$a", "260$b", "264$b", "260$c", "264$c", "600$a", "650$a", "300$a" ], "description": "Reorder columns" } ]
Main author Replace all of this text	[ { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "100$a", "expression": "grel:value.replace(/\\,$/,\"\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 100$a using expression grel:value.replace(/\\,$/,\"\")" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "100$a", "expression": "grel:value.replace(/\\.$/,\"\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 100$a using expression grel:value.replace(/\\.$/,\"\")" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "100$a", "expression": "grel:value.match(/(.),(.)/).reverse().join(\" \")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 100$a using expression grel:value.match(/(.),(.)/).reverse().join(\" \")" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "100$a", "expression": "value.trim()", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 100$a using expression value.trim()" } ]
publisher Add text	[ { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "264$b", "expression": "grel:value.replace(/\\,$/,\"\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$b using expression grel:value.replace(/\\,$/,\"\")" }, { "op": "core/multivalued-cell-split", "columnName": "264$b", "keyColumnName": "100$a", "mode": "separator", "separator": ";", "regex": false, "description": "Split multi-valued cells in column 264$b" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "260$b", "expression": "grel:value.replace(/\\,$/,\"\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 260$b using expression grel:value.replace(/\\,$/,\"\")" }, { "op": "core/multivalued-cell-split", "columnName": "260$b", "keyColumnName": "100$a", "mode": "separator", "separator": ";", "regex": false, "description": "Split multi-valued cells in column 260$b" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "260$b", "expression": "join ([cells['260$b'].value,cells['264$b'].value],)", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 260$b using expression join ([cells['260$b'].value,cells['264$b'].value],)" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "260$b", "expression": "value.trim()", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 260$b using expression value.trim()" }, { "op": "core/column-addition", "engineConfig": { "facets": [], "mode": "row-based" }, "baseColumnName": "260$b", "expression": "join ([coalesce(cells['260$b'].value,''),coalesce(cells['264$b'].value,'')],'')", "onError": "keep-original", "newColumnName": "publisher", "columnInsertIndex": 7, "description": "Create column publisher at index 7 based on column 260$b using expression join ([coalesce(cells['260$b'].value,''),coalesce(cells['264$b'].value,'')],'')" }, { "op": "core/column-reorder", "columnNames": [ "title", "label3", "subtitle", "245$c", "100$a", "700$a", "publisher", "260$c", "264$c", "600$a", "650$a", "300$a" ], "description": "Reorder columns" } ]
publication date/copyright date New text	[ [ { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "264$c", "expression": "join ([cells['264$c'].value,cells['260$c'].value],)", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$c using expression join ([cells['264$c'].value,cells['260$c'].value],)" }, { "op": "core/column-addition", "engineConfig": { "facets": [], "mode": "row-based" }, "baseColumnName": "260$c", "expression": "join ([coalesce(cells['260$c'].value,''),coalesce(cells['264$c'].value,'')],'')", "onError": "keep-original", "newColumnName": "date", "columnInsertIndex": 8, "description": "Create column date at index 8 based on column 260$c using expression join ([coalesce(cells['260$c'].value,''),coalesce(cells['264$c'].value,'')],'')" }, { "op": "core/column-reorder", "columnNames": [ "title", "label3", "subtitle", "245$c", "100$a", "700$a", "publisher", "date", "600$a", "650$a", "300$a" ], "description": "Reorder columns" }, { "op": "core/column-rename", "oldColumnName": "date", "newColumnName": "264$c", "description": "Rename column date to 264$c" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "264$c", "expression": "join ([cells['264$c'].value,cells['260$c'].value],)", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$c using expression join ([cells['264$c'].value,cells['260$c'].value],)" }, { "op": "core/text-transform", "engineConfig": { "facets": [ { "type": "list", "name": "Starred Rows", "expression": "row.starred", "columnName": "", "invert": false, "omitBlank": false, "omitError": false, "selection": [ { "v": { "v": false, "l": "false" } } ], "selectBlank": false, "selectError": false } ], "mode": "row-based" }, "columnName": "264$c", "expression": "grel:value.replace(/\\;./,)", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$c using expression grel:value.replace(/\\;./,)" }, { "op": "core/text-transform", "engineConfig": { "facets": [ { "type": "list", "name": "Starred Rows", "expression": "row.starred", "columnName": "", "invert": false, "omitBlank": false, "omitError": false, "selection": [ { "v": { "v": false, "l": "false" } } ], "selectBlank": false, "selectError": false } ], "mode": "row-based" }, "columnName": "264$c", "expression": "grel:value.replace(\"[\", \"\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$c using expression grel:value.replace(\"[\", \"\")" }, { "op": "core/text-transform", "engineConfig": { "facets": [ { "type": "list", "name": "Starred Rows", "expression": "row.starred", "columnName": "", "invert": false, "omitBlank": false, "omitError": false, "selection": [ { "v": { "v": false, "l": "false" } } ], "selectBlank": false, "selectError": false } ], "mode": "row-based" }, "columnName": "264$c", "expression": "grel:value.replace(\"]\", \"\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$c using expression grel:value.replace(\"]\", \"\")" }, { "op": "core/text-transform", "engineConfig": { "facets": [ { "type": "list", "name": "Starred Rows", "expression": "row.starred", "columnName": "", "invert": false, "omitBlank": false, "omitError": false, "selection": [ { "v": { "v": false, "l": "false" } } ], "selectBlank": false, "selectError": false } ], "mode": "row-based" }, "columnName": "264$c", "expression": "grel:value.replace(\"©\", \"\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$c using expression grel:value.replace(\"©\", \"\")" }, { "op": "core/text-transform", "engineConfig": { "facets": [ { "type": "list", "name": "Starred Rows", "expression": "row.starred", "columnName": "", "invert": false, "omitBlank": false, "omitError": false, "selection": [ { "v": { "v": false, "l": "false" } } ], "selectBlank": false, "selectError": false } ], "mode": "row-based" }, "columnName": "264$c", "expression": "grel:value.chomp('.')", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$c using expression grel:value.chomp('.')" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "264$c", "expression": "grel:value.replace(/\\;.+/, '')", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$c using expression grel:value.replace(/\\;.+/, '')" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "264$c", "expression": "grel:value.replace(/\\..+/, '')", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$c using expression grel:value.replace(/\\..+/, '')" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "264$c", "expression": "grel:value.chomp('.')", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 264$c using expression grel:value.chomp('.')" }, { "op": "core/column-rename", "oldColumnName": "264$c", "newColumnName": "date", "description": "Rename column 264$c to date" } ]
other creators New text	[ { "op": "core/column-split", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "700$a", "guessCellType": true, "removeOriginalColumn": true, "mode": "separator", "separator": ";", "regex": false, "maxColumns": 0, "description": "Split column 700$a by separator" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "700$a 1", "expression": "grel:value.replace(/\\,$/,\"\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 700$a 1 using expression grel:value.replace(/\\,$/,\"\")" }, { "op": "core/text-transform", "engineConfig": { "facets": [ { "type": "list", "name": "Starred Rows", "expression": "row.starred", "columnName": "", "invert": false, "omitBlank": false, "omitError": false, "selection": [ { "v": { "v": false, "l": "false" } } ], "selectBlank": false, "selectError": false } ], "mode": "row-based" }, "columnName": "700$a 1", "expression": "value.trim()", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 700$a 1 using expression value.trim()" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "700$a 1", "expression": "grel:value.replace(/\\,$/,\"\")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 700$a 1 using expression grel:value.replace(/\\,$/,\"\")" }, { "op": "core/text-transform", "engineConfig": { "facets": [], "mode": "record-based" }, "columnName": "700$a 1", "expression": "grel:value.match(/(.),(.)/).reverse().join(\" \")", "onError": "keep-original", "repeat": false, "repeatCount": 10, "description": "Text transform on cells in column 700$a 1 using expression grel:value.match(/(.),(.)/).reverse().join(\" \")" } ]
Create Den (description in English) new text	[ { "op": "core/column-addition", "engineConfig": { "facets": [], "mode": "row-based" }, "baseColumnName": "genre", "expression": "grel:value + \" by \" + cells.statedAs.value + \", \" + cells.inception.value", "onError": "set-to-blank", "newColumnName": "Den", "columnInsertIndex": 6, "description": "Create column Den at index 6 based on column genre using expression grel:value + \" by \" + cells.statedAs.value + \", \" + cells.inception.value" } ]
label Edit column → add column based on this column Column name: "newItem” GREL value newItem → edit column → move column to beginning	[ { "op": "core/column-addition", "engineConfig": { "facets": [], "mode": "row-based" }, "baseColumnName": "label", "expression": "grel:value", "onError": "set-to-blank", "newColumnName": "newItem", "columnInsertIndex": 2, "description": "Create column newItem at index 2 based on column label using expression grel:value" }, { "op": "core/column-move", "columnName": "newItem", "index": 0, "description": "Move column newItem to position 0" } ]

Reconciliation

You will need to reconcile the following items against Wikidata:

new item
100$a (creator)
260 $b (publisher)
700 fields (likely to be additional authors, illustrators, colorists, letterers, translators, and/or editors)

Basic Explanation of the JSON

For a basic understanding of the JSON above, here is essentially what the code is doing.

The first two pieces rename 245 $a and 245 $b to title and subtitle (essentially copying University of Washington's workflow). From there, the title and subtitle are normalized to match Wikidata standards (sentence case applied, removing extra white space, etc.).
After the title/subtitle have been cleaned up, a label is created.
Following the creation of a label field, the 100 field is fixed and the name inverted (so first name last name) with commas removed
Next, the publisher is tackled. Note these steps will clean up and combine the 260 $b and 264 $b fields (so you're left with a single publisher)
The next bit of code focused on the date of publication. Similar to the publisher, this works on cleaning up and combining the 260 $c and 264 $c.
- Sometimes a copyright symbol slips into these fields; double check your work
And the last bit, which is the most complicated and time consuming, focuses on the 700 field. For this field, you will separate all the 700s and then, you will need to clean each field individually. Because comics often have multiple creators, you will need likely need to repeat this step multiple times. You will need to figure out each person's role, i.e. are they an inker/penciller -->illustrator in Wikidata, or a colorist, letterer, or editor. When you create your schema, you will need to have all your columns properly lined up so that you're illustrators can all be batched separately from your authors and letterers.
- More information on this will be forthcoming, just a note that the project lead has generally starred comics with more than 7 creators to work on them separately.
The last step creates a new item.
Information/steps for the 6xx fields will be coming soon.