User:Jrubashk/PodcastWorkflow

Wikidata Podcast Episode Workflow, Version 1.0

Seeing if a Wikidata page for the Podcast exists

edit
  • It’s imperative to search Wikidata to see if the podcast you would like to upload episode item pages for already exists.
  • If a page already exists here are the applicable fields to look for and or add: Instance Of (P31) - Podcast (Q24634210), Presenter(s) (P371), Distribution Format (P437) - Audio Podcast (Q24633474), Web Feed URL (RSS Feed) (P1019), Apple Podcasts podcast ID (P5842), Podchaser Podcast ID (P7998)
  • Here are additional statements that are nice to have for a podcast page: Part of (Podcast Network) (P361), Number of episodes (P1113) (don’t forget to put a qualifier of point in time with date), Official Website (P856), Spotify show ID (P5916), social media handles & info.

Creating a Wikidata page for a Podcast series

edit
  • Create a new item.
  • Here are the target statements to create for a new item: Instance Of (P31) - Podcast (Q24634210), Presenter(s) (P371), Distribution Format (P437) - Audio Podcast (Q24633474), Web Feed URL (RSS Feed) (P1019), Apple Podcasts podcast ID (P5842), Podchaser Podcast ID (P7998).
  • It is necessary to create the Podcast series page before creating the individual episode pages. These fields will be in use and/or sourced for individual episode pages.

Gathering the info for individual podcast episodes

edit

Scraper Choice

edit
  • Scooter Labs, RSS to CSV Converter
    • What data you get out: Episode Title, Link (depends on hosting service/rss feed if this populates successfully), Description, Publication Date, GUID
    • The additional work that you’ll have to do in Google Sheets:
      • Populating: Part of the Series (Podcast Name), Instance of - Podcast Episode (Q61855877), Distribution Format - Audio Podcast (Q24633474), Apple Podcast ID, Apple Podcast Episode ID, Language, Host(s), Identifier (Podchaser Creator ID (P9743)), Date Retrieved (format: YYYY-MM-DD), Talk Show Guest(s)
    • It will be necessary to do a few common transformations within OpenRefine in order to get information such as the date in an acceptable (to Wikidata) format.
    • This up front is easier, however if you would like links or identifiers there is a lot of extra manual work required.
      • This manual work can include getting the Apple Podcast Episode ID which can be found at the end of each Apple Podcast Episodes URL.
      • The GUID requires some problem solving if you would like to use this to generate a link for each episode. There are various recurring podcast services that RSS feeds are dispersed from.  
  • Joe’s PAWS Scraper
    • What data you get out: Publication Date, Episode URL (Apple Podcast Episode URL), Apple Podcast Episode Identifier, Episode Title, Episode Play Time
    • How to & walk through document
    • The additional work that you’ll have to do in Google Sheets:  
      • Populating: Part of the Series (Podcast Name), Instance of - Podcast Episode (Q61855877), Distribution Format - Audio Podcast (Q24633474), Apple Podcast ID, Language, Host(s), Identifier (Podchaser Creator ID (P9743)), Date Retrieved (format: YYYY-MM-DD), Talk Show Guest(s)
    • This up front has more challenges and requires you to use additional tools, however if you would like links or identifiers this requires significantly less manual work.

Google Sheets Work

edit

Expanding the work from the scraper’s output & preparing it for OpenRefine

  • Static Work: Part of the Series, Instance of, Distribution Format, Language, Apple Podcast ID, Podchaser Identifiers, Hosts
  • Info that will change based on the episode (that may not be included with the scraper): Apple Podcast Episode ID, Talk Show Guests, episode links,

Open Refine Work

edit
  • Getting your spreadsheet data into Open Refine
    • Export your spreadsheet.
    • Download and install OpenRefine. If you are on a managed computer you can use PAWS to do your OpenRefine work. This will make it so OpenRefine does not need to be downloaded locally to your computer.
    • Open the OpenRefine application.
    • Upload the spreadsheet file you’d like to work on
    • Select the sheet(s) you would like to do the reconciliation, schema and upload work on.
    • Create project.
  • Reconciling Data
    • For individual podcast episode titles, reconcile against Podcast Episodes. You will need to go to the column submenu and create new pages for each item.
    • Part of the series - find the Podcast that you are creating said episodes for.
    • Instance of - Podcast Episode (Q61855877)
    • Distribution Format - Audio Podcast (Q24633474)
    • Language
    • Hosts - each host will need their own column. Reconcile the column to get the given hosts Wikidata page.
    • If the host has a Podchaser Creator ID, these codes will not be reconciled. They will be used as is and then implemented within the schema.
    • Guests - each guest will need their own column. In the schema these names will follow under Talk Show Guest (P5030).
    • For date related columns, go into the columns submenu, common transformation to date. If after doing this command none of the dates change to green and the applicable format it will be necessary to do transformations to get the data into shape that will be able to be converted this way. I recommend using either the value.substring or value.slice GREL functions.
  • Schema Creation
    • When your data is reconciled move onto the Edit Wikibase Schema nestled under the Extensions - Wikibase button.
    • Add item. Drag the episode title (that should have a green line underneath it denoting that it has been reconciled).
    • For terms use Description and Alias. Choose language for voice. Description, podcast name episodes. Alias, drag your episode title to this field.
    • Statements:
      • Title - language - episode title. Reference: Apple Podcasts ID, Retrieved
      • Publication date - episode date. Reference: Apple Podcasts ID, Retrieved
      • Apple Podcasts Episode ID - Apple Podcast Episode ID. Reference: Episode URL (this will have been curated if Joe’s scraper was utilized), Retrieved
      • Part of the Series - Podcast Name. Reference: Apple Podcast ID, Retrieved.
      • Instance of - Instance of.
      • Distribution Format - Distribution Format
      • Language of Work or Name - Language.
      • Presenter - Host 1, Host 2. Reference for each host (if they have identifiers or the podcast URL includes their name in the visible description), Retrieved.
      • Talk Show Guest - Guest. Reference: Apple Podcast Episode ID (if this includes the guest’s name in the visible description), Retrieved.
  • Uploading
    • Check to see if the data is being presented correctly in the Preview section. Change as necessary or as you see fit.
    • Sign in with your Wikidata account
    • Give brief description
    • Upload
    • Via Wikidata check your watchlist/contributions page to see your uploaded data work.

Sample Data

edit
  • Sample Google Sheet - download/select for OpenRefine
  • Sample OpenRefine History - copy and paste into the History apply section in the OpenRefine project