Wikidata:WikiProject PCC Wikidata Pilot/University of Washington/Workflow, Trainings, and Resources/MARC21 to Wikidata Workflow

University of Washington Libraries Linked Data Team
Crystal Clements (cec23@uw.edu)
Contributions by Thad Guidry
Last revised February 22, 2022

Workflow

edit

Export MARC Records (there are many right ways to do this)

edit
  • As a batch through OCLC Connexion
  • As a published set through Alma
    • You need the appropriate permissions in Alma to manage and run jobs on sets
    • Create a set using Admin --> manage sets
    • Run a job on the set to export the bibliographic records
    • University of Washington
      • Select job "Export Bibliographic Records"
      • Save to UW FTP, subdirectory "MARC21" (You need permissions to access this file)

Convert Selected MARC Fields to a Tab-Delimited File (there are many right ways to do this)

edit
  • Get the newest version of MarcEdit
  • Open the program "Export Tab Delimited"
    • Set file paths for binary files going in, and tab-delimited files going out
    • Leave default delimiter settings
    • Add MARC fields to extract (this can be saved in a .txt file so you only have to manually enter them once for each type of entity)
    • University of Washington export settings reflected in the table below (versions for MARC exported from Alma and OCLC Connexion available via GitHub)
  • Export

Create a Project in OpenRefine

edit
  • Download the latest version of OpenRefine
    • Note: OpenRefine version 3.4.1 runs with Java versions 8-15, but will not run on Java 16 or higher.
  • Select "choose files", and choose the tab-delimited file(s) you just created in MarcEdit and select "Next"
  • Name columns: title,subtitle,statedAs,creator,genre,submittedTo,inception,language,pages,degreeSupervisor,hdlURI,OCLCID,ISBN (for easy interoperability with our schema)
    • ISBN is included here for other workflows, but is not applicable to our electronic theses and dissertations. Instructions for this data point are italicized and are missing from the JSON history. Thank you to Kyla Jemison from the University of Toronto Libraries for this addition.
MARC21 Field for Export Column Name in OpenRefine
245 $a title
245 $b subtitle
245 $c statedAs
100 creator
502 $b genre
502 $c submittedTo
502 $d inception
008 language
300 $a pages
700 degreeSupervisor
856 $u hdlURI
035 (for Alma)...001 if exporting from Connexion OCLCID
020 $a ISBN
  • Instruct OpenRefine to discard the first row of data
  • Uncheck the option to 'Use character " to enclose cells containing column separators'
  • Name your project
  • "Create project"

Manipulate MARC Data in OpenRefine

edit

Either:
Click "Apply" in the "Undo/Redo" tab
Paste the extracted JSON history from this GitHub repository into the text box
Click "Perform Operations"
Review your data
Or:
Follow the steps in the table below

OpenRefine Cleanup: GUI Instructions OpenRefine Cleanup: Apply JSON
title and subtitle
  • [Broken code needs to be replaced] Exception words = and, or, the, a, of, in, to, an, for, with, by, at, if, as, from, nor, but, yet, so, through, near, across, about, on, amid
  • Edit cells → Common transforms → Trim leading and trailing whitespace
  • Edit cells → Common transforms → Collapse consecutive whitespace
  • Edit cells → Transform
    • GREL
      • value.chomp(" :")
      • value.chomp(" /")
[
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "title",
    "expression": "jython:import re\nexceptions = [\"and\", \"or\", \"the\", \"a\", \"of\", \"in\", \"to\", \"an\", \"for\", \"with\", \"by\", \"at\", \"if\", \"as\", \"from\", \"nor\", \"but\", \"yet\", \"so\", \"through\", \"near\", \"across\", \"about\", \"on\", \"amid\"]\ntitle = value\nlowercase_words = re.split(\" \", title.lower())\nfinal_words = [lowercase_words[0].capitalize()]\nfinal_words += [word if word in exceptions else word.capitalize() for word in lowercase_words[1:]]\nfinal_title = \" \".join(final_words)\nreturn (final_title)",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column title using expression jython:import re\nexceptions = [\"and\", \"or\", \"the\", \"a\", \"of\", \"in\", \"to\", \"an\", \"for\", \"with\", \"by\", \"at\", \"if\", \"as\", \"from\", \"nor\", \"but\", \"yet\", \"so\", \"through\", \"near\", \"across\", \"about\", \"on\", \"amid\"]\ntitle = value\nlowercase_words = re.split(\" \", title.lower())\nfinal_words = [lowercase_words[0].capitalize()]\nfinal_words += [word if word in exceptions else word.capitalize() for word in lowercase_words[1:]]\nfinal_title = \" \".join(final_words)\nreturn (final_title)"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "subtitle",
    "expression": "jython:import re\nexceptions = [\"and\", \"or\", \"the\", \"a\", \"of\", \"in\", \"to\", \"an\", \"for\", \"with\", \"by\", \"at\", \"if\", \"as\", \"from\", \"nor\", \"but\", \"yet\", \"so\", \"through\", \"near\", \"across\", \"about\", \"on\", \"amid\"]\ntitle = value\nlowercase_words = re.split(\" \", title.lower())\nfinal_words = [lowercase_words[0].capitalize()]\nfinal_words += [word if word in exceptions else word.capitalize() for word in lowercase_words[1:]]\nfinal_title = \" \".join(final_words)\nreturn (final_title)",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column subtitle using expression jython:import re\nexceptions = [\"and\", \"or\", \"the\", \"a\", \"of\", \"in\", \"to\", \"an\", \"for\", \"with\", \"by\", \"at\", \"if\", \"as\", \"from\", \"nor\", \"but\", \"yet\", \"so\", \"through\", \"near\", \"across\", \"about\", \"on\", \"amid\"]\ntitle = value\nlowercase_words = re.split(\" \", title.lower())\nfinal_words = [lowercase_words[0].capitalize()]\nfinal_words += [word if word in exceptions else word.capitalize() for word in lowercase_words[1:]]\nfinal_title = \" \".join(final_words)\nreturn (final_title)"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "title",
    "expression": "value.trim()",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column title using expression value.trim()"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "subtitle",
    "expression": "value.trim()",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column subtitle using expression value.trim()"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "title",
    "expression": "value.replace(/\\s+/,' ')",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column title using expression value.replace(/\\s+/,' ')"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "subtitle",
    "expression": "value.replace(/\\s+/,' ')",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column subtitle using expression value.replace(/\\s+/,' ')"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "title",
    "expression": "grel:value.chomp(\" :\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column title using expression grel:value.chomp(\" :\")"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "title",
    "expression": "grel:value.chomp(\" /\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column title using expression grel:value.chomp(\" /\")"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "subtitle",
    "expression": "grel:value.chomp(\" :\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column subtitle using expression grel:value.chomp(\" :\")"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "subtitle",
    "expression": "grel:value.chomp(\" /\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column subtitle using expression grel:value.chomp(\" /\")"
  }
]
Create label column
  • “subtitle” → facet → customized facet → facet by blank (null or empty string)
  • “False”
  • “title” → edit column → add column based on this column
    • New column name: “label”
    • GREL
      • cells.title.value + ": " + cells.subtitle.value
    • “True”
    • “title” → edit column → add column based on this column
      • New column name: “label2”
      • GREL
        • value
    • Remove facet
    • “label” → edit column → join columns
      • Select “label” and “label2”
      • Skip nulls
      • Write result in selected column
[
  {
    "op": "core/column-addition",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "subtitle",
          "expression": "isBlank(value)",
          "columnName": "subtitle",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": false,
                "l": "false"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "baseColumnName": "title",
    "expression": "grel:cells.title.value + \": \" + cells.subtitle.value",
    "onError": "set-to-blank",
    "newColumnName": "label",
    "columnInsertIndex": 1,
    "description": "Create column label at index 1 based on column title using expression grel:cells.title.value + \": \" + cells.subtitle.value"
  },
  {
    "op": "core/column-addition",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "subtitle",
          "expression": "isBlank(value)",
          "columnName": "subtitle",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": true,
                "l": "true"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "baseColumnName": "title",
    "expression": "grel:value",
    "onError": "set-to-blank",
    "newColumnName": "label2",
    "columnInsertIndex": 1,
    "description": "Create column label2 at index 1 based on column title using expression grel:value"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "label",
    "expression": "join ([cells['label'].value,cells['label2'].value],'')",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column label using expression join ([cells['label'].value,cells['label2'].value],'')"
  },
  {
    "op": "core/column-reorder",
    "columnNames": [
      "title",
      "label",
      "subtitle",
      "statedAs",
      "creator",
      "genre",
      "submittedTo",
      "inception",
      "language",
      "pages",
      "degreeSupervisor",
      "hdlURI",
      "OCLCID"
    ],
    "description": "Reorder columns"
  }
]
statedAs
  • Edit cells → transform
    • GREL
      • value.chomp(".")
[
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "statedAs",
    "expression": "grel:value.chomp(\".\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column statedAs using expression grel:value.chomp(\".\")"
  },
 {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "statedAs",
    "expression": "value.replace(\"by \",\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column statedAs using expression value.replace(\"by \",\"\")"
  }
]
creator
  • Edit cells → transform
    • GREL
      • if(value.contains("http:"),

"http://" + value.rpartition("http://")[2], value.rpartition(/\-[a-z]+/)[0].rpartition(/\,author/)[0].rpartition(/\d+/)[0].chomp(",") )

  • StatedAs → Facet → Customized facet → Facet by blank
  • "False"
  • Creator → Edit cells → Transform
    • GREL
      • if(value.contains(","),

(cells.statedAs.value), (value))

[
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "creator",
    "expression": "grel:if(value.contains(\"http:\"),\n\"http://\" + value.rpartition(\"http://\")[2],\nvalue.rpartition(/\\-[a-z]+/)[0].rpartition(/\\,author/)[0].rpartition(/\\d+/)[0].chomp(\",\")\n)",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column creator using expression grel:if(value.contains(\"http:\"),\n\"http://\" + value.rpartition(\"http://\")[2],\nvalue.rpartition(/\\-[a-z]+/)[0].rpartition(/\\,author/)[0].rpartition(/\\d+/)[0].chomp(\",\")\n)"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "creator",
    "expression": "grel:if(value.contains(\",\"),\n(cells.statedAs.value),\n(value))",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column creator using expression grel:if(value.contains(\",\"),\n(cells.statedAs.value),\n(value))"
  }
]
genre
  • Hover over individual cells and click the blue “edit” button
  • Change values to “master’s thesis”, “doctoral thesis”, or other appropriate term
  • Selecting the option to “change all identical cells” expedites this process
  • Make sure you do not miss any values
    • Facet → custom text facet
      • GREL (replace “doctoral thesis” with your most common value(s)
        • value.contains("doctoral thesis")
[
  {
    "op": "core/mass-edit",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "genre",
    "expression": "value",
    "edits": [
      {
        "from": [
          "Ph. D."
        ],
        "fromBlank": false,
        "fromError": false,
        "to": "doctoral thesis"
      }
    ],
    "description": "Mass edit cells in column genre"
  },
  {
    "op": "core/mass-edit",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "genre",
    "expression": "value",
    "edits": [
      {
        "from": [
          "M.S."
        ],
        "fromBlank": false,
        "fromError": false,
        "to": "master's thesis"
      }
    ],
    "description": "Mass edit cells in column genre"
  }
]
inception
  • Edit cells → transform
    • GREL
      • value.chomp(".")
[
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "inception",
    "expression": "grel:value.chomp(\".\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column inception using expression grel:value.chomp(\".\")"
  }
]
Create Den (description in English)
  • genre → edit column → add column based on this column
    • New column name: “Den”
    • GREL
      • value + " by " + cells.statedAs.value + ", " + cells.inception.value
[
  {
    "op": "core/column-addition",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "baseColumnName": "genre",
    "expression": "grel:value + \" by \" + cells.statedAs.value + \", \" + cells.inception.value",
    "onError": "set-to-blank",
    "newColumnName": "Den",
    "columnInsertIndex": 6,
    "description": "Create column Den at index 6 based on column genre using expression grel:value + \" by \" + cells.statedAs.value + \", \" + cells.inception.value"
  }
]
language
  • Edit cells → transform
    • GREL
      • if(value.contains("eng"),

"Q1860", value)

  • Edit cells → Transform
    • GREL
      • if(value.contains("zxx"),

null, value)

[
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "language",
    "expression": "grel:if(value.contains(\"eng\"),\n\"Q1860\",\nvalue)",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column language using expression grel:if(value.contains(\"eng\"),\n\"Q1860\",\nvalue)"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "language",
    "expression": "grel:if(value.contains(\"zxx\"),\nnull,\nvalue)",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column language using expression grel:if(value.contains(\"zxx\"),\nnull,\nvalue)"
  }
]
pages
  • Edit cells → replace
    • Take out all the words, phrases, and punctuation in your data except for the numbers, commas between numbers, and words representing units (“page” or “leaf” for instance)
    • For each unit type represented in your data:
      • pages → facet → custom text facet
        • value.contains("[word representing unit]")
        • Example: value.contains("pages")
      • “True”
      • Edit column → Add column based on this column
        • Name based on unit
        • Example: “page”
      • page (or new column name) → Edit cells → Transform
        • GREL
          • value.rpartition(" pages")[0].rpartition(/\d+/)[1]
  • pages → Edit column → Add column based on this column
    • New column name "romanNumbered[word representing unit]"
      • Example: “romanNumberedPage”
      • GREL
        • if (value.contains (","),

value.rpartition(", ") [0], null )

  • Remove “pages” and “leaves”, any other excess columns, leaving only one column per pagination type per numbering type (page,romanNumberedPage,leaf,romanNumberedLeaf)
  • Convert roman numerals to arabic numbers
    • Edit cells → transform
      • Clojure
        • (defn ro2ar [r]

(->> (reverse (.toUpperCase r)) (map {\M 1000 \D 500 \C 100 \L 50 \X 10 \V 5 \I 1}) (partition-by identity) (map (partial apply +)) (reduce #(if (< %1 %2) (+ %1 %2) (- %1 %2))))) (ro2ar value)

[
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "pages",
    "expression": "value.replace(\"1 online resource (\",\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column pages using expression value.replace(\"1 online resource (\",\"\")"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "pages",
    "expression": "value.replace(\") :\",\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column pages using expression value.replace(\") :\",\"\")"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "pages",
    "expression": "value.trim()",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column pages using expression value.trim()"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "pages",
    "expression": "value.replace(/\\s+/,' ')",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column pages using expression value.replace(/\\s+/,' ')"
  },
  {
    "op": "core/column-addition",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "pages",
          "expression": "grel:value.contains(\"pages\")",
          "columnName": "pages",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": true,
                "l": "true"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "baseColumnName": "pages",
    "expression": "grel:value",
    "onError": "set-to-blank",
    "newColumnName": "page",
    "columnInsertIndex": 11,
    "description": "Create column page at index 11 based on column pages using expression grel:value"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "pages",
          "expression": "grel:value.contains(\"pages\")",
          "columnName": "pages",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": true,
                "l": "true"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "columnName": "page",
    "expression": "grel:value.rpartition(\"pages\")[0].rpartition(/\\d+/)[1]",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column page using expression grel:value.rpartition(\"pages\")[0].rpartition(/\\d+/)[1]"
  },
  {
    "op": "core/column-addition",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "pages",
          "expression": "grel:value.contains(\"pages\")",
          "columnName": "pages",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": true,
                "l": "true"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "baseColumnName": "pages",
    "expression": "grel:if (value.contains (\",\"),\nvalue.rpartition(\", \") [0],\nnull\n)",
    "onError": "set-to-blank",
    "newColumnName": "romanNumberedPage",
    "columnInsertIndex": 11,
    "description": "Create column romanNumberedPage at index 11 based on column pages using expression grel:if (value.contains (\",\"),\nvalue.rpartition(\", \") [0],\nnull\n)"
  },
  {
    "op": "core/column-removal",
    "columnName": "pages",
    "description": "Remove column pages"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "pages",
          "expression": "grel:value.contains(\"pages\")",
          "columnName": "pages",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": true,
                "l": "true"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "columnName": "romanNumberedPage",
    "expression": "clojure:(defn ro2ar [r]\n(->> (reverse (.toUpperCase r))\n(map {\\M 1000 \\D 500 \\C 100 \\L 50 \\X 10 \\V 5 \\I 1})\n(partition-by identity)\n(map (partial apply +))\n(reduce #(if (< %1 %2) (+ %1 %2) (- %1 %2)))))\n(ro2ar value)",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column romanNumberedPage using expression clojure:(defn ro2ar [r]\n(->> (reverse (.toUpperCase r))\n(map {\\M 1000 \\D 500 \\C 100 \\L 50 \\X 10 \\V 5 \\I 1})\n(partition-by identity)\n(map (partial apply +))\n(reduce #(if (< %1 %2) (+ %1 %2) (- %1 %2)))))\n(ro2ar value)"
  }
]
degreeSupervisor
  • Facet → Custom text facet
    • GREL
      • value.contains("degree supervisor")
      • If “false” is an option in facet menu, look at contributors and remove any who are not committee chairs
    • Remove facet
    • Delete any empty columns that result from this process
  • Edit column → split into multiple columns
    • Based on separator ;
  • First column
    • Edit cells → transform
      • GREL
        • if(value.contains("http:"),

"http://" + value.rpartition("http://")[2], value.rpartition(/\-[a-z]+/)[0].rpartition(/\,degree supervisor/)[0].rpartition(/\d+/)[0].chomp(",") )

  • Subsequent columns
    • Facet → customized facet → blank
    • “False”
    • GREL
      • if(value.contains("http:"),

"http://" + value.rpartition("http://")[2], value.rpartition(/\-[a-z]+/)[0].rpartition(/\,degree supervisor/)[0].rpartition(/\d+/)[0].chomp(",") )

  • All degreeSupervisor columns (one at a time):
    • Facet → custom text facet
      • GREL
        • value.contains(",")
      • “True”
    • Edit column → add column based on this column
      • Column name: inverseName
      • GREL
        • value
    • inverseName → edit column → split into several columns
      • by separator “,”
    • degreeSupervisor → edit cells → transform
      • GREL
        • cells["inverseName 2"].value + " " + cells["inverseName 1"].value
    • Remove facet
    • Delete inverseName columns
[
  {
    "op": "core/column-split",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "degreeSupervisor",
    "guessCellType": true,
    "removeOriginalColumn": true,
    "mode": "separator",
    "separator": ";",
    "regex": false,
    "maxColumns": 0,
    "description": "Split column degreeSupervisor by separator"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "degreeSupervisor 1",
    "expression": "grel:if(value.contains(\"http:\"),\n\"http://\" + value.rpartition(\"http://\")[2],\nvalue.rpartition(/\\-[a-z]+/)[0].rpartition(/\\,degree supervisor/)[0].rpartition(/\\d+/)[0].chomp(\",\")\n)",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column degreeSupervisor 1 using expression grel:if(value.contains(\"http:\"),\n\"http://\" + value.rpartition(\"http://\")[2],\nvalue.rpartition(/\\-[a-z]+/)[0].rpartition(/\\,degree supervisor/)[0].rpartition(/\\d+/)[0].chomp(\",\")\n)"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "degreeSupervisor 1",
          "expression": "isBlank(value)",
          "columnName": "degreeSupervisor 1",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": false,
                "l": "false"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "columnName": "degreeSupervisor 1",
    "expression": "grel:if(value.contains(\"http:\"),\n\"http://\" + value.rpartition(\"http://\")[2],\nvalue.rpartition(/\\-[a-z]+/)[0].rpartition(/\\,degree supervisor/)[0].rpartition(/\\d+/)[0].chomp(\",\")\n)",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column degreeSupervisor 1 using expression grel:if(value.contains(\"http:\"),\n\"http://\" + value.rpartition(\"http://\")[2],\nvalue.rpartition(/\\-[a-z]+/)[0].rpartition(/\\,degree supervisor/)[0].rpartition(/\\d+/)[0].chomp(\",\")\n)"
  },
  {
    "op": "core/column-addition",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "degreeSupervisor 1",
          "expression": "grel:value.contains(\",\")",
          "columnName": "degreeSupervisor 1",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": true,
                "l": "true"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "baseColumnName": "degreeSupervisor 1",
    "expression": "grel:value",
    "onError": "set-to-blank",
    "newColumnName": "inverseName",
    "columnInsertIndex": 13,
    "description": "Create column inverseName at index 13 based on column degreeSupervisor 1 using expression grel:value"
  },
  {
    "op": "core/column-split",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "degreeSupervisor 1",
          "expression": "grel:value.contains(\",\")",
          "columnName": "degreeSupervisor 1",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": true,
                "l": "true"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "columnName": "inverseName",
    "guessCellType": true,
    "removeOriginalColumn": true,
    "mode": "separator",
    "separator": ",",
    "regex": false,
    "maxColumns": 0,
    "description": "Split column inverseName by separator"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "degreeSupervisor 1",
          "expression": "grel:value.contains(\",\")",
          "columnName": "degreeSupervisor 1",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": true,
                "l": "true"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "columnName": "degreeSupervisor 1",
    "expression": "grel:cells[\"inverseName 2\"].value + \" \" + cells[\"inverseName 1\"].value",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column degreeSupervisor 1 using expression grel:cells[\"inverseName 2\"].value + \" \" + cells[\"inverseName 1\"].value"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "degreeSupervisor 1",
    "expression": "value.trim()",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column degreeSupervisor 1 using expression value.trim()"
  },
  {
    "op": "core/column-removal",
    "columnName": "inverseName 1",
    "description": "Remove column inverseName 1"
  },
  {
    "op": "core/column-removal",
    "columnName": "inverseName 2",
    "description": "Remove column inverseName 2"
  }
]
hdlURI
  • If several values exist in some cells, split into several columns and remove excess columns
  • Edit column → create new column based on this column
    • Column name hdl
    • GREL
      • value.substring(22)
[
  {
    "op": "core/column-addition",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "baseColumnName": "hdlURI",
    "expression": "grel:value.substring(22)",
    "onError": "set-to-blank",
    "newColumnName": "hdl",
    "columnInsertIndex": 14,
    "description": "Create column hdl at index 14 based on column hdlURI using expression grel:value.substring(22)"
  }
]
OCLCID (applies to Alma exports only; Connexion exports do not need editing)
  • Split multi-valued cells using ; as delimiter
  • On each resulting column:
    • Facet → custom text facet
      • GREL
        • value.contains("(OCoLC)")
      • False
    • Edit cells → replace
      • Regex
        • .*
      • Replace with blank
    • Remove facet
  • Edit column → join columns
    • Select all OCLCID columns
    • Separate each column with ;
    • Write result in new column named OCLCID
    • Delete joined columns
  • Edit cells → replace
    • (OCoLC)
    • Replace with blank
  • Eyeball your values to make sure the column only has one numeric value of about the right length
[
  {
    "op": "core/column-split",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "OCLCID 3",
          "expression": "grel:value.contains(\"(OCoLC)\")",
          "columnName": "OCLCID 3",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": false,
                "l": "false"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "columnName": "OCLCID",
    "guessCellType": true,
    "removeOriginalColumn": true,
    "mode": "separator",
    "separator": ";",
    "regex": false,
    "maxColumns": 0,
    "description": "Split column OCLCID by separator"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [
        {
          "type": "list",
          "name": "OCLCID 3",
          "expression": "grel:value.contains(\"(OCoLC)\")",
          "columnName": "OCLCID 3",
          "invert": false,
          "omitBlank": false,
          "omitError": false,
          "selection": [
            {
              "v": {
                "v": false,
                "l": "false"
              }
            }
          ],
          "selectBlank": false,
          "selectError": false
        }
      ],
      "mode": "row-based"
    },
    "columnName": "OCLCID 3",
    "expression": "value.replace(/.*/,\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column OCLCID 3 using expression value.replace(/.*/,\"\")"
  },
  {
    "op": "core/column-addition",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "baseColumnName": "OCLCID 1",
    "expression": "join ([coalesce(cells['OCLCID 1'].value,''),coalesce(cells['OCLCID 2'].value,''),coalesce(cells['OCLCID 3'].value,'')],';')",
    "onError": "keep-original",
    "newColumnName": "OCLCID",
    "columnInsertIndex": 16,
    "description": "Create column OCLCID at index 16 based on column OCLCID 1 using expression join ([coalesce(cells['OCLCID 1'].value,''),coalesce(cells['OCLCID 2'].value,''),coalesce(cells['OCLCID 3'].value,'')],';')"
  },
  {
    "op": "core/column-reorder",
    "columnNames": [
      "title",
      "label",
      "subtitle",
      "statedAs",
      "creator",
      "genre",
      "Den",
      "submittedTo",
      "inception",
      "language",
      "romanNumberedPage",
      "page",
      "degreeSupervisor 1",
      "hdlURI",
      "hdl",
      "OCLCID"
    ],
    "description": "Reorder columns"
  },
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "OCLCID",
    "expression": "value.replace(\"(OCoLC)\",\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column OCLCID using expression value.replace(\"(OCoLC)\",\"\")"
  }
]
label
  • Edit column → add column based on this column
    • Column name: "newItem”
    • GREL
      • value
    • newItem → edit column → move column to beginning
[
  {
    "op": "core/column-addition",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "baseColumnName": "label",
    "expression": "grel:value",
    "onError": "set-to-blank",
    "newColumnName": "newItem",
    "columnInsertIndex": 2,
    "description": "Create column newItem at index 2 based on column label using expression grel:value"
  },
  {
    "op": "core/column-move",
    "columnName": "newItem",
    "index": 0,
    "description": "Move column newItem to position 0"
  }
]
ISBN
  • Edit cells → split multi-valued cells
    • Delimiter ;
  • Edit column → add column based on this column
    • Column name: “ISBN13”
    • GREL
      • value.match(/.*(\d{13}).*/)[0]
  • Edit column → add column based on this column
    • Column name: “ISBN10”
    • GREL
      • value.match(/.*(\d{10}).*/)[0]

For Wikidata schema: ISBN13 and ISBN10 correspond with Wikidata properties P212 and P957

Reconcile in OpenRefine

edit

Reconcile the following columns against Wikidata:

edit
  • newItem
  • creator
  • genre
  • submittedTo
  • language
  • degreeSupervisor

Plug Your Data into a Wikidata Schema

edit

From the Extensions menu, choose "Import Wikidata Schema"

  • Our current schema is available in this Github repository, and can be copied and pasted into OpenRefine
  • You can edit this base schema or create your own by selecting "Edit Wikidata schema" from the Extensions menu

Review Schema and Data
Specifically,

  • Use the available buttons, which closely mirror the Wikidata user interface, to click and drag or type in properties and values to adjust the schema to your dataset
  • Make sure your schema makes the right statements, without duplicating statements, for items that already exist (reconciled values). Faceting and selectively including/excluding rows using flags/stars is helpful with this
  • Make sure you have included references where they are needed
  • Use “Preview” to make sure your statements are correct
  • Tip: If you’re having trouble dragging a reconciled column into your schema, go back and make sure that you have selected “create new item” or reconciled the values in each row
    • If you still have trouble, make sure you have the latest version of OpenRefine

From the Extensions menu, choose “Upload edits to Wikidata” (or, if you prefer, “Export to Quickstatements” and use Quickstatements to upload your edits. This is especially handy if you need to submit your edits for bot review prior to uploading to Wikidata)
Review new items and statements loaded to Wikidata

Useful Resources

edit

Companion GitHub Repository
OpenRefine User Manual
GREL Functions
Jython and Clojure
RegEx Cheat Sheet
University of Washington Libraries Application Profile for Electronic Theses and Dissertations
University of Washington Libraries Github Repository for this Workflow