Wikidata:WikiProject Cultural heritage/Guidelines/Editing Data in Spreadsheet Mode

IntroductionEdit

Sometimes, you want to make the same type of edit on many or all the items of an entire dataset in Wikidata, e.g. in order to add translations in a given language, in order to add urls of websites, coordinates, etc. Instead of clicking yourself through all the different items, you may prefer to make your edits in a spreadsheet. This also facilitates the harmonization of entries across the entire dataset, as you can go through your dataset several times, use the sort function in order to edit similar items in one go, etc.

On this page we will explain how to edit data on Wikidata in spreadsheet mode. We will guide you through a step-by-step process that you can follow to edit the data. The three tools that are used to accomplish this are:

  • the Wikidata Query Service used to export data from Wikidata into a spreadsheet;
  • the Quick Statements tool used to batch ingest data into Wikidata; and
  • Microsoft Word / Excel used to edit the data and to generate the code to be fed into the Quick Statements Tool (they may be substituted by the corresponding programs of an office suite of your choice that have a mail merge function).  

Our example case: Let us add German labels to all museums in Switzerland (at the time of writing, almost 100 museums in the French and in the Italian speaking parts of Switzerland were still missing German labels).

Your training case: If you want to follow through the step-by-step process yourself on a data set of your own, we suggest that you add labels to a subset of the Swiss museums in the language of your choice or that you do the same for the museums in another country. This will require only minor adaptations to the example code provided for the example case.

The Basic Process in Four StepsEdit

Step 1: Define your data set and generate a corresponding SPARQL queryEdit

  Think about the data set you want to edit: Which items should it comprise? Which data fields do you need for each item? Once you are ready to answer these two questions, you can go to the Wikidata Query Service and enter the corresponding query. Note that you should be including all the data fields that you want to edit, along with the data fields that contain useful information that can be drawn upon when editing the other data fields.


 

ExampleEdit

In our example case, we want a data set that comprises all museums in Switzerland (pretty much all of them already have an item on Wikidata).

The following SPARQL query selects all Wikidata items that describe an instance of "museum" or of any of the sub-classes of "museum". It further restricts the selection to items that are located in an administrative territorial entity in Switzerland. – The relevant instructions can be found in our example code in the two lines following the "WHERE" tag, which contains the information about which items are to be selected from the entire Wikidata data base.

SELECT ?item
       ?Label_en
       ?Description_en
       (group_concat(distinct ?Alias_en;separator="; ") as ?Aliases_en)    #Concatenate the values in order not to get several rows per item.
       ?Label_de
       ?Description_de
       (GROUP_CONCAT(distinct ?Alias_de;separator="; ") as ?Aliases_de)
       ?Label_fr
       (GROUP_CONCAT(distinct ?Alias_fr;separator="; ") as ?Aliases_fr)
       ?Label_it
       (GROUP_CONCAT(distinct ?Alias_it;separator="; ") as ?Aliases_it)
       (GROUP_CONCAT(distinct ?MunicipalityLabel_de;separator="; ") as ?MunicipalityLabels_de)
       (GROUP_CONCAT(distinct ?CantonLabel_de;separator="; ") as ?CantonLabels_de)
  WHERE {
  {?item wdt:P31 ?museum . ?museum wdt:P279* wd:Q33506 } . # Select items that are instance of "museum" or of any of its sub-classes.
  {?item wdt:P131/wdt:P17 wd:Q39} .  # Select items that are located in an administrative territorial entity in Switzerland.
  OPTIONAL { ?item rdfs:label ?Label_en . FILTER (lang(?Label_en) = "en") }
  OPTIONAL { ?item rdfs:label ?Label_de . FILTER (lang(?Label_de) = "de") } 
  OPTIONAL { ?item rdfs:label ?Label_fr . FILTER (lang(?Label_fr) = "fr") }
  OPTIONAL { ?item rdfs:label ?Label_it . FILTER (lang(?Label_it) = "it") }
  OPTIONAL { ?item skos:altLabel ?Alias_en . FILTER (lang(?Alias_en) = "en") }
  OPTIONAL { ?item skos:altLabel ?Alias_de . FILTER (lang(?Alias_de) = "de") } 
  OPTIONAL { ?item skos:altLabel ?Alias_fr . FILTER (lang(?Alias_fr) = "fr") }
  OPTIONAL { ?item skos:altLabel ?Alias_it . FILTER (lang(?Alias_it) = "it") }
  OPTIONAL { ?item schema:description ?Description_en . FILTER (lang(?Description_en) = "en") }
  OPTIONAL { ?item schema:description ?Description_de . FILTER (lang(?Description_de) = "de") } 
  OPTIONAL { ?item schema:description ?Description_fr . FILTER (lang(?Description_fr) = "fr") }
  OPTIONAL { ?item schema:description ?Description_it . FILTER (lang(?Description_it) = "it") }
  OPTIONAL { ?item wdt:P131 ?Municipality . FILTER EXISTS {?Municipality wdt:P31 wd:Q70208} }  
  OPTIONAL { ?item wdt:P131/rdfs:label ?MunicipalityLabel_de . FILTER EXISTS {?MunicipalityLabel_de ^rdfs:label/wdt:P31 wd:Q70208} . FILTER (lang(?MunicipalityLabel_de) = "de")} 
  OPTIONAL { ?item wdt:P131/rdfs:label ?CantonLabel_de . FILTER EXISTS {?CantonLabel_de ^rdfs:label/wdt:P31 wd:Q23058} . FILTER (lang(?CantonLabel_de) = "de") } 
}
GROUP BY ?item                          #List all the variables for which the values are not concatenated!
         ?Label_en ?Description_en 
         ?Label_de ?Description_de 
         ?Label_fr ?Description_fr 
         ?Label_it ?Description_it
Try it!


 

When you click on "Try it", the query opens in the interface of the Wikidata Query Service, where you can either run the query as it is (click on the "Run" button in the menu at the bottom of the page) or modify it. The Wikidata Query Service is absolutely central when it comes to displaying or exporting data that already exists in Wikidata. We therefore suggest that you get acquainted with the SPARQL query language. In addition to the official W3C specification, there is also a User Manual specifically for the Wikidata Query Service, as well as an introduction for beginners.

In order to get started, you may just want to take the example query and modify it step by step according to your needs. You can also find a more complete example query covering all heritage institutions in Switzerland along with additional example queries here. By and by, you will get the idea of it and with the help of the documentation you will write your own queries.

When looking at our sample code, you can distinguish four sections:

  • The first section starts with the tag "SELECT" and lists all the variables (variable names always start with a question mark) that will be displayed in our table; in spreadsheet format, this corresponds to the list of column headers of the table.
  • The second section starts with the tag "WHERE" and contains the information about which items are to be selected from the entire Wikidata data base. On the interface of the Wikidata Query Service, you can use "moseover" on the Q-Numbers and P-Numbers to display their labels. Thus, "Q39" stands for "Switzerland". In order to write the same query for another country, it suffices to replace "Q39" with the Q-Number of another country (e.g. "Q142" for France).
  • The third section contains a series of statements starting with the tag "OPTIONAL". They define the data fields that will be available for display in our table. Thus, every variable to be used in the first section needs to be defined either in the second or third section. While the "WHERE" tag of the second section serves to select the items in our data set (ending up on the rows of our spreadsheet), the "OPTIONAL" tags of the third section allow us to select the values that can eventually be displayed for each item (corresponding to the columns of our spreadsheet).
  • The fourth section starting with the "GROUP BY" element lists all the variables for which the values are not to be concatenated (i.e. variables that can only have a single value). This fourth section is needed only because we are using the "GROUP_CONCAT" tag in the first section for the variables that may take multiple values, such as aliases or administrative territorial entities. If we were not to concatenate multiple values into one variable, our query would produce several rows per item – one for each combination of values across all variables.

You can now start playing around with the code, for example by changing the languages in which the information is displayed. Start by changing the language attribute of the last two "OPTIONAL" tags into your own language (e.g. "en" instead of "de"). Note that if you also change the variable name from "?MunicipalityLabel_de" to "?MunicipalityLabel_en", you will need to make sure that you change the variable wherever it occurs in the code (in this case: 2 instances in the first section, and 3 instances in the third section).

When you are happy with the result, use the menu item “Link” at the bottom of the page to generate a short URL of your query that can be used to re-generate your query at a later point in time. Save the string somewhere if you want to use the same query again later.

And on a final note: If you are confronted with browser performance issues when running your queries, try using the query service from within the Chrome browser. But note that there is a server time-out for longer queries in any case, with the acceptable length varying over time and across browser types.


Step 2: Download the data set, save it to your computer and open itEdit

  Download your data set by using the menu at the bottom of the page: Select “Download” / “CSV”. This allows you to download the data to your computer in spreadsheet format. Save it to your workspace on your computer and open it.
  Use LibreOffice to open your CSV file and save it to Excel format. If you want to open the file directly from Excel, follow the instructions here. Make sure that you select the right column separator. In our example, only the comma should be selected as column separator, as the semicolon is used to separate multiple values within single fields.


 

ExampleEdit

Our data set looks like this:


 


Step 3: Edit the data setEdit

  Edit the data set in the spreadsheet programme of your choice. Before you start modifying or completing your data, you may want to do strip the item URLs in your data set from their path in order to just have the Q-Number and to mark the cells in your document to which you are going to make changes.
  Concretely:
  1. Select the first column with the item URLs, and replace the path “http://www.wikidata.org/entity/” by an empty string; in order to ingest our data into Wikidata at the end, we will need the Q-number without the path.
  2. If you are going to make changes only to a fraction of the items in your data set while leaving the rest untouched, you should mark the items you are going to change so you can filter them later when preparing for the data ingestion, e.g. by adding a separate column containing the value 1 for all the items you are going to modify. Filtering the items at the ingestion stage will save you some time. If you are planning to make changes to a vast majority of the items in your data set, you do not need to bother marking them. Nothing bad will happen if you try to re-ingest the same data into Wikidata; the Quick Statements Tool will skip them.

When you are done with these preparatory steps, you can now change your data or add some missing values.


 

ExampleEdit

We first remove the path from the Q-numbers in the first column of our spreadsheet:

 


We then freeze the top row of the data set, add a filter to it and sort the “Label_de” variable alphabetically, which leaves all the empty cells at the end. As we are going to add a German label to all the items for which such a label is missing, we create an additional column “mod” next to the “Label_de” column and fill it with the value 1 for all the items that do not have a German label.

 


We now carry out our actual editing job by adding the missing German labels in the “Label_de” column.


Step 4: Generate the statements and feed them into the Quick Statements ToolEdit

 

Use the mail merge functionality of your office suite to generate the statements and feed them into the Quick Statements Tool.

 

The mail merge functionality of Microsoft Word and Excel allows you to create templates for a given class of Wikidata items with all the properties, including references you may want to add. Once you have created your template, you can link it to your Excel file with the data and generate the code needed to create new items with their properties or to add additional properties into existing Wikidata items. Once you have completed the mail merge process, you can just copy-paste the resulting document into the Quick Statements Tool. If you are not familiar with the mail merge functionality, follow the instructions here.

The Quick Statements Tool is very handy if you want to ingest larger quantities of data into Wikidata without resorting to a bot of your own. The tool is very easy to use. You can feed it lists of several thousands of triples (statements) at a time; when you interrupt your Internet connection, it will resume automatically upon re-connecting to the Internet. Make sure that you check for error messages in the tool's output (search for the string “ERROR”). Before feeding longer lists of data, start step by step, one item after another, inspecting them to make sure that you do not have any mistakes in your code. Make sure that the string values in your spreadsheet do not include any ordinary quotation marks (" "), as they are used by the Quick Statements Tool to identify string values; replace them by different quotation marks (e.g. « » or whatever quotation marks are commonly used in the given language).

Note that the Quick Statements Tool is presently undergoing a complete overhaul. – You may want to check out the test version of the new tool.


 

ExampleEdit

As we are just going to add German labels to our items, our Word template contains only one line (i.e. one statement). Note that the different elements of the statements are separated by a <tab> character as required by the Quick Statements Tool:

 

We use the mail merge functionality of Word to generate our code, using the "letters" function. Remember to filter out the rows without any modifications. The output looks like this:

 

We then paste our code into the Quick Statements Tool, click on "Do It", and wait for the tool to process our code. The tool logs every edit it makes to Wikidata; from there it is very easy to inspect the various items you have modified. Check some of them to make sure you made the correct changes. If you have not used it before, the Quick Statements Tool may ask you to log in to WiDaR before executing your code – just follow the instructions and log in with your Wikimedia user account.

Making More Complex EditsEdit

By following these guidelines, you have learned how to make simple edits to Wikidata in spreadsheet mode. By following the same basic pattern, you can now move on to more complex cases:

  • In our example case, we have added simple string values; in other cases (e.g. when adding administrative territorial units to your data set), you may want to enter Q-numbers of corresponding Wikidata items instead of string values. In this case, you will have to look up the Q-numbers and add them to your spreadsheet. This can be done either manually or by writing a SPARQL query to get all the Q-numbers and further values needed for matching purposes (e.g. Q-numbers of all Swiss municipalities along with the municipality name) and then by merging the resulting data set with the data set you are complementing. If the needed items do not already exist on Wikidata, you will need to create them beforehand (e.g. when ingesting data from countries where items about municipalities are missing).
  • In our example case, we have just added a single statement to each item. – You may want to edit several columns in your spreadsheet at once. You can do this by following the same procedure. In the Quick Statements Tool you can use the "LAST" tag to add further statements to the same item. Just adapt your Word-Template accordingly and pull your data from several columns of your spreadsheet.
  • In our example case, we did not add a source to our statements. You can do this as well following the same procedure. Refer to the instructions on the Quick Statements Tool page for information on how to add sources. If the source is the same for all the values of a given column of your spreadsheet, just add it to your Word template. If it is different for each value, you will need to add an additional column to your spreadsheet with the source information.
  • In our example case, we have added values only to items that already existed on Wikidata. You may want to add new items. This is also possible by following the same procedure. Refer to the instructions on the Quick Statements Tool page for information on how to add new items. Make sure that the items do not already exist on Wikidata; otherwise you will create duplicate items that need to be merged manually afterwards, which can be a lot of work if you add large quantities of data.
  • For many of these cases, you will have to write different SPARQL queries: Get acquainted with the SPARQL query language and/or refer to the help section of the Wikidata Query Service where you can consult many different example queries or ask for help with your queries.

You may also want to refer to the Case Report about ingesting data about Swiss heritage institutions for guidance with these more complex cases.