Wikidata:Property proposal/GB1900 ID

GB1900 IDEdit

Originally proposed at Wikidata:Property proposal/Authority control

   Done: GB1900 ID (P9284) (Talk and documentation)
DescriptionIdentifier for a place in the GB1900 project transcription of the Ordnance Survey six-inch-to-one-mile map, 1888-1913
RepresentsGB1900 dataset (Q105554422)
Data typeExternal identifier
Domaingeographic location (Q2221906)
Allowed values[0-9a-f]{24}
Example 1Palace of Westminster (Q62408)57e3dd302c66dca408000096 (as "HOUSES OF PARLIAMENT")
Example 2Chelsea (Q743535)57f17b952c66dca32201b66d
Example 3Thorney House (Q26458246)58e7cb752c66dcf8fa0b4ada
Expected completenessalways incomplete (Q21873886)
Applicable "stated in"-valueGB1900 dataset (Q105554422)

Source informationEdit

The GB1900 project (Q105554350) was a mass-participation community transcription project that between September 2016 and January 2018 transcribed every text string that appears on the 6-inches-to-the-mile (ie 1:10560) series of maps made by the Ordnance Survey (Q548721) between 1888 and 1913. Every string was attached to a geo-coordinate, corresponding to the bottom-left of the first letter of its first word. (See [1] for more information).

The output of the project is available from https://www.visionofbritain.org.uk/data/#tabgb1900 in three forms:

  • Full final raw dump (CC0) -- 2666342 location rows, 8043679 transcription rows
  • Complete gazetteer (CC-BY-SA) -- 2552460 location + transcription rows
  • Abridged gazetteer (CC-BY-SA) -- 1174450 location + transcription rows

Note that the files are in 16-bit unicode, so to use 'grep' on eg a standard cygwin install something may be needed like

grep -Pa `echo -n "58e7cb752c66dcf8fa0b4ada" | iconv -f utf-8 -t utf-16le | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` gb1900_transcriptions.csv

Also be aware that the coordinates in the 'locations' file in the raw dump are stored as hexadecimal strings in Well Known Binary (WKB) format -- see Simple Features (Q365034) (Thanks to Nikki on the wikimaps telegram group for identifying this format for me, with Stefano not far behind).

The "complete gazetteer" includes reconciliation of about 1.5% of the original transcriptions, where these were in disagreement; and also added (modern) parish and (modern) local authority fields. The "abridged gazetteer" excludes a large numbers of frequently repeating labels, such as "F.P." for footpath, but "still contains many transcriptions which are not necessarily place-names".

MotivationEdit

The GB1900 ID would be a slightly unusual external-id for wikidata for several reasons:

  1. there is at the moment no site that accepts the IDs as part of a url, so no url-formatter is possible. (In the above I have linked directly to the coordinates given in the data files).
  2. the coordinates do not quite represent the coordinates of the actual object, but instead that of the start of its label
  3. the files come with no information as to what sort of thing the coordinates are for.

(This last is slightly surprising, because given the location and the character-string for each label, it might be not be thought too difficult a machine-learning proposition to identify the font-size and style the label was in, and at least extract that. But it seems this has not been done, or not been possible. On the first point, paper [2] (2019) describes a site that will accept the identifiers, including a pre-beta screenshot (figure 7). This is still intended, according to the Vision of Britain team, however as of 2021 it has not yet been possible to put that part of the site into production availability, "but the tech people are making progress".)

Nevertheless in my view this would be a useful dataset to be able to match into, for at least two reasons. In particular it may have been used in various external geotagging projects to eg provide ready coordinates for places ending in the word 'Castle' or some such similar word. Also such matching would provide useful sanity-checking for our own coordinates, even though it should be kept in mind that GB1900 coordinates do not exactly represent the location of the underlying place.

Proposed Use-styleEdit

I would propose to add the property, for items that can be matched into the dataset, with qualifiers named as (P1810) to give the GB1900 preferred name for the place, and coordinate location (P625) to give the GB1900 coordinates for the place, if the name and coordinates can be found in the CC0 "raw dump" files.

I would not propose that coordinates from GB1900 be used in a main coordinate location (P625) statement, as in general they will not (quite) match the location of the actual place. Therefore IMO it is better to give them as a P625 qualifier. This is also will also prove a convenient place to see the places on the underlying OS 1900 maps, which are available on the National Library of Scotland website, via geohack.

Despite the complications, I do think this will be a useful property to have, and a useful dataset to be able to reference into. Jheald (talk) 19:31, 1 March 2021 (UTC)

DiscussionEdit

  • Proposed. Jheald (talk) 19:31, 1 March 2021 (UTC)
  •   Support Very interesting idea. PKM (talk) 20:07, 1 March 2021 (UTC)
  • Suggest contact National Library of Scotland to see if they would be interested in applying the machine learning approach mentioned by Jheald - ColinStuartGreenstreet (talk) 20:46, 1 March 2021 (UTC)
Good suggestion. But we can probably get quite a long way matching smaller-scale items to GB1900 hits -- eg listed buildings, scheduled monuments, heritage sites, etc -- even without any 'type' information, given the rough coordinates and that the names will be reasonably unique. It may become a bit trickier for items on the scale of villages and towns, to be sure that we're distinguishing which GB1900 label corresponds to the settlement, to the parish, to the constituency etc. But my feeling would be to start matching what we can, and then talk to the NLS, once we already have some matches to show for ourselves. Jheald (talk) 21:12, 1 March 2021 (UTC)
  •   Support The use of this dataset in other projects e.g. Viae Regiae, is going to make matching it to Wikidata items increasingly useful.DrThneed (talk) 22:46, 1 March 2021 (UTC)
  • UPDATE. I'm currently doing some work on manors and manor-houses, relating to WD:WP EMEW/Manors. I am currently comparing items for 'manor houses' here with hits for the string 'manor' in GB1900. Where matches do exist, it would be nice to be able to start recording them here. That would make it easier to query for matches that don't currently exist. One additional use-note to the above: I would also be adding URL (P2699) qualifiers to link straight to the OS1900 map at the National Library of Scotland, as per the links in the examples at the top, to make it possible to see the string on the map in a single click. Thanks! Jheald (talk) 08:40, 10 March 2021 (UTC)

  SupportSusanna Ånäs (Susannaanas) (talk) 08:42, 10 March 2021 (UTC)