Wikidata:WikiProject Finland/Saami place names to Wikidata as lexicographical data

Project

edit

This is the documentation page for a project by AvoinGLAM at WMFI in collaboration with WMNO. See the project page in WikiProject AvoinGLAM/Saami place names to Wikidata as lexicographical data.

Scope of the project

edit

Import the Saami place names in the National Land Survey Place Names Register to Wikidata, and to produce conjugations for them as lexicographical data.

About places

edit

How have the National Land Survey place names been collected and verified? What other resources still exist that are not part of the register?

Sacred places and how they should be treated

Reading the data from the National Land Survey

edit

Geographic features and place names

edit

Choice of data

edit
  • All places in Lapland in the NLS database
  • 74 428 places
  • Place names in one or more Saami languages
    • 700 names in Skolt Saami
    • 4 845 names in Inari Saami
    • 6 328 names in Northern Saami

Place type ontology

edit

The National Land Survey has produced a place type ontology as part of the Place Names Register, that is adopted widely among the Finnish GLAM sector. The ontology has been mapped by the Semantic Computing Research Group to the Kotus Names Archive place type corpus, and by the Finnish National Library to the YSO places ontology.

The Place Names Register place types are difficult to map to Wikidata items, as they are very often compounds, such as: village or neighbourhood / island or islet. This requires to create new artificial place types and attach those to the Wikidata "ontology". A more useful approach would be to use more precise terminology or broader terms instead.

We will coordinate mapping the concepts with Finto / YSO, and with SeCo.

Wikidata data model and properties used

edit

Labels

edit

Add labels in as many languages as possible. The dataset has labels in Finnish, Swedish, Northern Saami, Inari Saami, and Skolt Saami.

Based on the data about the language dominance and the language officiality, the labels can be added to any latin script language. The primary international name for a place (unless there is a native/localized name) is the name in the dominant language of the municipality. This data can be retrieved from Tilastokeskus, and it is also present in the dataset.

Russian and Norwegian, possibly also Kven would be useful in the area, and possibly have their own variants. Kven is often equivalent to Finnish on the Norwegian side of the border. Similarly, Kven names are not used in Finland.

Aliases

edit

Aliases will be added in the same language. There is no need to additionally add them to every language. If the existing labels differ from the new authority data, the existing labels will be added as aliases instead.

Descriptions

edit

Descriptions for new items are constructed from the original title of the place type property and the name of the municipality, which in Finnish needs to be in the locative form. The locative forms have been constructed using a dataset from Kotus. Often these descriptions are not distinctive enough, and more detail would be needed. For example a bay and a strait are both of type "part of lake" and all bays and straits within the same municipality would be described as "part of lake in Municipality X". This will include many namesakes and illegal identical combinations of label and description. It is possible to use GIS to identify the lake which the item is part of. This will produce "part of lake X in Municipality X". To identify the more precise type, names could be analyzed for best guesses. "Koivulahti" could be interpreted as "bay, part of lake X in Rovaniemi". There are cases where lake parts are called for example ponds, but this can be tackled manually. The descriptions can be added in Finnish, Swedish, and English. We will extend this to the Saami languages either in the first import of the places, or the second import of the place names.

  • Parts of sea
  • Parts of lakes, ponds, and artificial lakes
  • Parts of watercourses
  • Islands and islets

Place types

edit

How to use NLS place type ID (P9230)? Preferably as a source statement on the place type, such as village or neighbourhood (Q103910453). The instance of (P31) value can be replaced with a more granular one, while making sure that the new place type is added as a subclass to the NLS place type.

instance of
  lake or pond
Use the National Land Survey place ontology place type for all the places. They need to be imported to Wikidata first. The granularity of the place types is rather broad. Replace or add more precise types later. TBD.


add value

Native labels

edit

Labels exist optionally in Finnish, Swedish, Northern Saami, Inari Saami, and Skolt Saami. The native labels import is done separately from adding the items to allow a straightforward use of the source data table.

native label
  Mutusjärvi (Finnish)
1 reference
  Mudusjävri (Inari Sami)
1 reference


add value

Location data

edit
  • coordinate location (P625) Much of the existing coordinates for hills and lakes are of extremely poor quality. We must find a way to rank the imported National Land Survey coordinates as primary and deprecate or mark the poor quality ones. See Quality control.
country
  Finland


add value
coordinate location
  65.0124, 27.5529


add value

Groups

edit

Create part–whole relations for groups of terrain features. Unfortunately, the dataset does not have information about these relations, and GIS could be used to make best guesses.

Authority ids

edit

It is not necessary to add source statements to ids, as the ID information contains even more information about the data source. However, adding the source statement will add clarity, especially if the data in the source database changes. These source statements can be used as models for sourcing other statements. YSO ID and Finnish Lake ID are here for convenience, they are not part of the import.

YSO ID
  106976
1 reference
add reference


add value
Finnish Lake ID
  71.241.1.001
1 reference
stated in Järviwiki
Finnish Lake ID 71.241.1.001
retrieved 2021-04-18
add reference


add value

Reconciliation data

edit

The data already in Wikidata for these items is highly heterogenous. Being able to reconcile to them requires a combination of several strategies.

Reconciliation workflow

edit

1. Reconciled Finnish labels to find places that do not match

edit
  • Did not auto-match, as it produces a lot of false matches. This way it was possible to identify the items that do not have any matches. Names may be present in labels in different languages, as values in properties, or as aliases. How to make sure that the reconciliation search takes all of these into account? Making several reconciliation rounds in different languages seems like an answer.
  • Did not constrain by type in the first reconciliation round, because many items originating from Wikipedia may lack type.
  • Created new items for places without any matches (OpenRefine).
  • Created new items for places that had only one candidate, and the type was clearly different (OpenRefine).

2. Filtered out previously reconciled items in OpenRefine

edit
  • Previously added NLS place ids (existing1)
  • Lookup from an existing project to identify previously imported lakes (existing2).

3. Geospatial matches in QGIS

edit
  • For lakes, identified existing Wikidata items whose coordinates are within a lake geometry. The geometry used was Finnish Land Survey Topographical database that first needed to be joined with the Place Names Database. 240 new matches. See the description below.
  • Checked the remaining 3 000 remaining lake items manually for existing Wikidata items whose coordinates are outside the lake boundaries. These coordinates should be marked as incorrect, if the import tools would allow extending the statements. Produced 263 new matches and left 2 737 lakes to be created.

Different possible reconciliation methods

edit

Computational matching methods

edit
  • Categorize remaining items based on missing properties
    • P31 (instance of, type). Properties or descriptions may not have been added to Wikidata from a Wikipedia article. You can try to auto-extract the first sentence from the Wikipedia article to add the type manually. A mass import from GeoNames has produced a lot of low-quality data, which results in a lot of labour used for reconciliation.
    • P131 (located in the administrative unit, municipality). If coordinates exist, you can try using GIS and an external data source such as OSM or the Topograpgic Database from the National Land Survey to find the containing municipality or area, such as lake or island.
    • P625 (coordinates). Items are not shown on the map, and need to be investigated manually. Usually this type of items have Wikipedia articles.
  • Run focused rounds of reconciliation with constraints for items that had several candidates. Use municipality and type.
  • Find a way to measure distance to reconciliation candidates with coordinates. Auto-match nearby items with the same name and type.
    • There is a feature proposal to add distance as a reconciliation consideration in OpenRefine: https://github.com/wetneb/openrefine-wikibase/issues/101
    • The same can be achieved by producing a dataset with coordinates for each item and all the reconciliation candidates and measuring their distance in QGIS. As extending data is extremely slow, a simple script would work better here.
    • Looking into making this a Jupyter notebook in collaboration with the project partners.

Manual methods

edit

In OpenRefine

  • Create a Wikishootme link based on the coordinates to locate the item on a map within existing Wikidata items. Example.
  • Create a Wikipedia search link based on the name to spot articles without sufficient properties in Wikidata. Example.
  • Create a Wikidata Nearby link to reveal overlapping items that cannot be clicked in Wikishootme. Example.
  • Create a Geohack link to see any map service with the coordinate. Example.
  • Create a direct link to the NLS map via a script in wmflabs. Example.

Mix'n'Match

edit
  • Create a special import that adds all the required information when an item is created or matched
  • Advantages: Many people can participate. Can be linked to other tools, like Wikishootme. Searches also Wikipedia for matches. It is therefore very useful for items that have poorly developed Wikidata items.
  • Downsides: Not all information that is provided is currently added to Wikidata (bug). The amount of data is too big for manual work only.

QGIS

edit
  • Use shapes from the NLS Topographic Database for lakes to contain point items already in Wikidata. The existing POIs from GeoNames have poor quality and often do not fall within the water body.
  • QGIS on a laptop is slow to manage the amount of data. The use of the data via API is slow and unstable, and a downloaded local database would work better. Data can also be read dynamically, and saved locally for further actions.
 
This map shows an attempt to match unreconciled names from the Place Names Register to existing Wikidata items for lakes.

The pale blue lakes are shapes from the NLS Topographic Database that matched the lake items already in Wikidata. Wikidata items that did not align with any lake geometry (low-quality GeoNames import) are shown in yellow. The NLS lake geometry is split for large lakes, therefore only sections of some lakes are matched. Uniting the parts before matching with points would require another process, but that has its own challenges.

The bright red dots are place names from the NLS Place Names Register, that were not matched with lake shapes with Wikidata correspondence.

Purple dots represent matches between place names from the NLS Place Names Register and lake shapes with Wikidata correspondence.

This process produced 240 new matches and additional 38 false ones. The main reason for false matches were multipart lakes. Since the lakes are only described with coordinate points, it is difficult to distinguish if a point was defining a part or a whole, especially as the coordinates already in Wikidata are strikingly off. After this process 3 000 lakes were still left unmatched. 1098 lakes had been matched with the help of GIS previously. There are altogether 10 112 lakes to be matched.

Producing and importing lexemes

edit

Import log

edit
  • 2020-04-18 Imported non-existing places of type lake or pond in Lapland from the NLS Place Names Register, 8510 items, edit group.
    • There was a problem with the Saami labels in the source data, and they will need to be added again.
    • There were 1835 items that were not added. These places have identical label–description combinations with existing items. First, the descriptions need to be extended in some way. After that, the skipped items need to be added, and finally, Saami labels will be added to all.

Further work with the place names

edit

Quality control

edit
  • Check spelling, especially in Skolt Sámi
  • Marking low quality coordinates. The low quality coordinates from GeoNames could be imported to QGIS along with the Land Survey coordinates and the distance between them could be calculated. This distance could be added to the statements with the low quality coordinates as a qualifier for deprecating the statement.
edit

Linking to OSM is especially useful for places that are areas by nature. This way they can be connected to their proper geographic representation, and displayed as shapes in many cases. The OSM ↔ Wikidata matcher is a perfect tool for this. The geometries in OSM may need maintenance before sensible matches can be made.

Using native labels in OSM

edit
  • Jon Harald Søby has done this for some names. The items could first be chosen based on the CLDR collection, and expanded to other place names.

Proposals for tools, features or add-ons for existing tools

edit

Existing tools & technologies

edit

Reconciliation features

edit
  • Reconcile based on distance first.
  • A map matching view of the locations around the place to be matched displaying nearby items and name matches. (OR, standalone, M&M). Much like Wikishootme.
  • Names may be present in labels in different languages, as values in properties, or as aliases: Make sure that the reconciliation search takes all of these into account.
  • Limit reconciling to subclasses instead of instances when reconciling place types (classes).
  • Allow choosing any property-value pair for a reconciliation key. Example: Reconcile by external id.
  • Allow manually adding (auxiliary) reconciliation types/properties for the whole dataset that are not presented in the dataset. Example: country.
  • Read additional properties for the reconciliation candidates at the same time, as opposed to a separate data extension process.
  • Return Wikipedia blurbs.
  • Include Wikipedia article name matches in reconciliation.
  • Reconcile explicitly with authority ids.
  • Make ranking data transparent and structured.
  • Allow ranking locally based on configurable criteria.

OpenRefine

edit
  • Information about sitelinks and Wikipedia blurbs will help to match items that originate from Wikipedia but are poorly described in Wikidata.
  • If properties from reconciliation candidates would be available, allow faceting and sorting based on them.
  • Configurable popup in OpenRefine to check and verify matches, for example a coordinate proximity view or a display of name variants and their languages. Configure display of chosen properties, display missing ones (such as missing language labels, descriptions, type, article, coordinate etc.)
  • Make single match the default. It produces lots of mistakes when all similar names are accidentally reconciled with the same value.
  • Allow adding primary rank to an imported statement.
  • If the tool worked in the internet, it would be possible to use external tools with links to correct interfaces for actions. For example, a match was found in QGIS, and the data would include a link to an interface where the match could be verified.