Wikidata:WikiProject 20th Century Press Archives/Tools & tasks




Data Structure


Data Sources


Use Cases


Tools & Tasks






on Toolforge




Often, the titles of PM20 folders can be matched only one by one, through manual lookup or browsing, to Wikidata items. Tools like Mix-n-match do not work well for certain types of folders. This is particularly true for the folders in the subject archives (Länder-/Sacharchiv). Therefore a short description of the according manual workflow here:

  1. Search or discover the folder via the web application or via lists of folders, e.g. from the country/topics archive (sortable and filterable folder list).
    We use the folder Deutschland (bis 1945) : Enteignung von Juden, Arisierung (1933-1945) (Expropriation of Jews in Germany 1933-1945, Aryanization) as example here.
  2. Copy the persistent link of the folder, which underlays the icon "Mappen-Zitier-Link". This is normally done by right-click and "Copy link address" (or similar named function) in the browser.
  3. Search the fitting Wikidata item - e.g. Aryanization (Q664017).
  4. Go to bottom of the item page and click "add statement".
  5. Start typing "pm20 folder" in the Property input box and select "PM 20 folder ID".
  6. Paste the persistent URL copied in step 2 and shorten it (e.g., from,208307 to "sh/126128,208307").
  7. Click "publish".

Sometimes, PM20 folders may be a valuable external complement to Wikipedia articles. At the bottom, or the right side column, of the Wikidata item, links to Wikipedias in different languages are displayed. For each Wikipedia, there are rules on when and how to add external links - please check them carefully.

English Wikipedia

  • See the rules on w:Wikipedia:External links.
  • In order to be able to receive feedback on your edits, log into Wikipedia as a named user.
  • If the folder contents looks like a valuable addition to the according WP article, edit its "External links" section (or add == External links ==, at the article bottom, but above categories and the like) - see example.
  • Use the PM20 template with the folder ID described above, e.g.
   * {{PM20|FID=sh/126128,208307}}
for adding an link. By default, the WP article name is inserted into the link. If this does not fit well, you can insert an additional |NAME=... into the curly bracket with a better fitting description of the folder content.
  • Adding a short description of your edit in the "Summary" field helps watchers of the article.

German Wikipedia

  • See the rules at de:Wikipedia:Weblinks.
  • In order to be able to receive feedback on your edits, log into Wikipedia as a named user.
  • If the folder contents looks like a valuable addition to the according WP article, edit its "Weblinks" section (or add == Weblinks ==, at the article bottom, but above the section "Einzelnachweise" (individual citations), categories and the like) - see example.
  • Use the Pressemappe template with the folder ID described above, e.g.
   * {{Pressemappe|FID=sh/126128,208307}}
for adding an link. By default, the WP article name is inserted into the link. If this does not fit well, you can insert an additional |NAME=... into the curly bracket with a better fitting description of the folder content.
  • Adding a short description of your edit in the "Zusammenfassung und Quellen" field helps watchers of the article.

Regular maintenance tasks


Add PM20 ID via GND ID ("pm20 via gnd")


Has been run initially for 1600+ IDs. If GND IDs were inserted into Wikidata items which are known in not-yet-linked PM20 folders, we can automatically add the PM20 ID to the item.

 cd /opt/sparql-queries/bin
 perl ../wikidata/missing_pm20_id_via_gnd.rq qsStatement

The query and the script are available on Github.

Set qualifiers ("pm20 folder name" / "pm20 doc count")


QuickStatements input files for subject named as (P1810), number of works (P3740) and number of works accessible online (P5592) are generated via

 cd /opt/sparql-queries/bin
 perl ../pm20/folder_names_qs.rq qsStatement
 perl ../pm20/folder_doc_total_count.rq qsStatement
 perl ../pm20/folder_doc_online_count.rq qsStatement

Because company names are cleaned up currently, creation of "named as" qualifiers is restricted to sh wa pe for now.

The folder names / doc counts queries and the script are available on Github.

Consistency checks for the PM20 Subject Categories system


The PM20 Subject Categories system is kept as a set of interlinked items in Wikidata, insofar the categories are linked to PM20 folder items.

Various queries for checking

Fix folder "main subject" statements (if checks reveal errors)


Remove main subject (P921) properties linking to non-PM20 Subject Category items:

 cd /opt/sparql-queries/bin
 perl ../pm20/folder_subject_remove_qs.rq qs

Add correct statements:

 perl ../pm20/folder_subject_add_qs.rq qs

One-time tasks


Add items for all un-linked person folders


After extended M-n-m and looking up heads of state and multiple-documents folders manually, and some testing, items for all 346 remaining person folders were created automatically. As discussed on the talk page,

 perl pm20_pe create

(script, query) was executed and the output pasted into Quickstatements. Jneubert (talk) 15:08, 13 June 2019 (UTC)[reply]

Rather minimal example item: Albert Hopff (Q64589732)

Add person information from PM20 to WD

 perl pm20_pe enhance P106

Create Mix-n-match catalog for newspapers


DONE A mnm catalog for newspapers and journals from PM20 was created, comprising 1359 entries from the internal "publikation" database table, with the ZDB ID is key. Records without ZDB ID were omitted, some duplicates (e.g. same ZDB ID for paper and supplement) were skipped. (input file) --Jneubert (talk) 06:52, 8 September 2019 (UTC)[reply]


Links to (webopac|webopac0) and will become obsolete, probably by end of 2020. Therefore, all references to such links have to be replaced.


Direct links to documents or pages have to be replaced, too. Depends on the introduction of persistent addresses for documents. DONE

Add PM20 geo/subject folders

  • Add PM20 geo codes to linked items according to existing mapping
  • Upper level categories (first and second level)
    • DONE Translate subject category labels to (British) English
    • DONE Create items for PM20 subject categories (160 in total)
      perl pm20_subject_category
      perl pm20_subject_category enhance P361 (partOf hiearchy)
      Two dozend items which link to special intermediate levels not transferred to Wikidata got no partOf link and need to be fixed
    • DONE Create items for folders (3776 in total)
      perl pm20_subject_folder - temporarily interrupted because of Quickstatements creating duplicates
  • All remaining categories
    • DONE Translate subject category labels to English
    • DONE Fix hierarchy
    • DONE Create items for PM20 subject categories (exactly the 1452 categories from "klassifikator WHERE klass_code='JE' and mappen_anzahl is not null")
      perl pm20_subject_category
    • DONE Create category hierachy
      perl pm20_subject_category enhance P361 (partOf hiearchy)
    • DONE Create category sort label
      perl pm20_subject_category enhance P8484
      perl pm20_geo_category enhance P8483
    • DONE Create items for folders
      perl pm20_subject_folder
    • DONE Set document counts
      perl ../pm20/folder_doc_total_count.rq qsStatement
      perl ../pm20/folder_doc_online_count.rq qsStatement
  • OPTIONAL (later)
    • Map subject categories to WD items (via main subject (P921))
    • Create all known geo and subject categories, even when for now without folders (for later use in film sections)
    • Create reverse has part statements (issues: meaningful order, completeness)
    • Create film sections for countres not or incompletely represented as folders, create pages and add according geo codes

Add company/institution folders

  • Retrieving and using direct links
    • DONE via GND
    • DONE via linked Wikipedia page in PM20
  • DONE for each segment
    DONE for Dutch, for English, for German, for French, for div (Mnm, search, QS, errors), ...
    • Mapping
      • Rules for in-exact matches, expressed via mapping relation type (P4390):
      • Create mnm catalog for company folders with documents, order by document count, matching against organization and wikipedia for the according language
        • for all entries, including already mapped
        • with synonyms (altLabel), names with GND excluded
          ./ pm20 nl
      • Map from top
        • Create openrefine in same order (matching English labels) (???)
        • only for unmapped entries (after mnm automatch)
      • Create list of QS insert statements and use in parallel for creating missing items
    • Create QS inserts for all unmapped entries (using the country code lists above)
      TODO exclude only exactly or unqualified mapped items
      ./ pm20_co
    • Update Mix-n-match (Action -> Katalog manuell synchronisieren -> Mix-n-match aktualisieren)
  • DONE Cleanup large intersected companies (e.g., Deutsche Reichsbahn)
  • DONE Add standard qualifiers for P4293 (name, counts)
  • Cleanup / extension re. inexact mappings
    • DONE set "related match" mapping relation qualifier for all co/person, co/building etc. mappings
    • DONE cleanup duplicate exact/unqualified mappings
    • DONE fix missing folders (~50, lost with change from to
    • DONE match items for missing folders
    • DONE repeat creation of QS inserts
  • DONE Fix missing French labels
  • DONE Adding inception/demoliton date
    • DONE check inception before demolition date
  • DONE Adding GND
  • DONE Adding instance-of statements (if not existant)
  • DONE Interlinking with persons
    • founder: perl pm20_co enhance P112
    • board: perl pm20_co enhance P3320
    • advisory board: perl pm20_co enhance P5052
  • DONE Interlinking with companies
    • successor/predecessor
    • subsidiary/mother
  • DONE Mapping (via Geonames ID) and import of headquarter location
    • POSTPONED Add items derived from missing Geonames IDs
    • DONE Add derived country
  • DONE Mapping and import of industry sector
    • DONE Assign industry according to PM20 NACE code (add all assignments)
    • DONE Map SK values to WD industries, based on German label (sometimes very coarse-grained)
      • DONE Map to targets with NACE code if some confidence in the mapping to NACE is given
      • POSTPONED Derive a partial SK-NACE mapping at the end (could this be extended with NT relationen to include in-exect mappings for e.g. Metallindustrie?)
    • DONE Assign more industries derived from SK mapping
      • DONE For NACE-equivalent, do not add if the same assignment with PM20 as source already exists
      • For very broad targets, do not add if any assignmet exists
    • DONE Create systematic display of industries used in SK mapping
  • POSTPONED Fix missing German and English labels (use existing or PM20 label for pre-existing entity?)
  • POSTPONED Which synonyms can be added safely?
  • POSTPONED Perhaps, link or create separate items for companies indentified by GND (zbwext:includesInstitutionNamed)
    • Note as part of the description

Add wares folders

  • DONE Add property PM20 ware ID (P10890)
  • DONE Create Openrefine mapping of ware names
  • DONE Add PM20 ware ID (P10890) links to ware items
  • DONE Create special categories items (w/o existing ware items)
    perl pm20_ware_category
  • DONE Create folder items
    perl pm20_ware_folder
  • DONE Add counts
    perl ../pm20/folder_doc_total_count.rq qsStatement
    perl ../pm20/folder_doc_online_count.rq qsStatement
  • DONE Add reverse WD links in PM20
  • DONE Add mapping for missing country names (countries not required for subjects)
  • DONE Add PM20 geo code (P8483) links to geo items
  • DONE Create folder items
    • DONE Remove duplicates created by QS
  • DONE Add names and counts
    perl ../pm20/folder_doc_total_count.rq qsStatement
    perl ../pm20/folder_doc_online_count.rq qsStatement
  • DONE Verify completeness of mapping
  • DONE Check completeness of wikidata extract in PM20 endpoint
  • Fix missing hierarchy levels in PM20 dataset (
  • Recreate category pages with reverse links to WD

Activity log


Rough log of PM20-related activities