Wikidata:Property proposal/Open Civic Data Division Identifiers

Open Civic Data Division ID (alias OCD-ID) edit

Originally proposed at Wikidata:Property proposal/Authority control

DescriptionAn identifier scheme for assigning globally unique identifiers to divisions used in civic datasets and tools.
Data typeExternal identifier
Domainhuman-geographic territorial entity (Q15642541)
Example 1Madison (Q43788) → ocd-division/country:us/state:wi/place:madison
Example 2United States of America (Q30) → ocd-division/country:us
Example 3Fundy Royal (Q267073) → ocd-division/country:ca/ed:13004
Example 4Wisconsin's 2nd congressional district (Q8027089) → ocd-division/country:us/state:wi/cd:2
Sourcehttps://github.com/opencivicdata/ocd-division-ids
Number of IDs in sourceca 797,509 (based on all files in identifiers/country-*/*.csv)
Expected completenesseventually complete (Q21873974)
See alsoFIPS 55-3 (locations in the US) (P774)

Motivation edit

Open Civic Data Division Identifiers (OCD-ID for short) are a common identifier used in the open data and civic technology space to catalog and identify "divisions" - states, cities, city council districts, judicial districts, etc. For example, the Google Civic Info API, which is the API behind many "who are my representatives" websites, return returns information about the current elected officials of a district identified by an OCD-ID - see https://developers.google.com/civic-information/docs/v2/divisions#resource The OpenElections Project, which is creating a central repository of all certified election results in the United States, uses OCD-IDs to identify what districts the election covered. There are many other civic and open data tools that also use OCD-IDs, especially around scraping pending legislation sites to catalog what bills are before a state or local government.

OCD-IDs are global in scope, but the United States has the best coverage thus far.

The format is described Open Civic Data Enhancement Proposal 2, and the governance process of edits to the set of OCD-IDs is documented in Open Civic Data Enhancement Proposal 8. The actual set of OCD-IDs are managed in a Github Repo

OCD-IDs describes an identifier scheme for assigning globally unique identifiers to divisions. The OCD-ID standard explicitly does not intend to describe any scheme for boundaries and includes minimal other information about a division.

The Wikidata Project includes many of the divisions named in the OCD-ID dataset, often describing them in great detail. Tools that can fetch data about an OCD-ID could use a common identifier to look into Wikidata for additional facts about that division. Other datasets are indexed by OCD-ID, so it could be possible to build a tool that uses the OCD-ID of a division, and query the Google Civic Info API to add or update data about the current elected officials of a division automatically.

There are other identifiers in the OCD specs that could be of interest of Wikidata, such as people, elections, and jurisdictions (for example, OCD would also differentiate between United States of America (Q30) and Federal Government of the United States (Q48525) - the former is an OCD Division, the latter a jurisdiction) but OCD Division Identifiers are the most widely used type of identifier and would be most immediately valuable to Wikidata.

I'm not sure if this will ever be complete in Wikidata - OCD-IDs go down to divisions that aren't tracked in Wikidata (there are not many city council districts in Wikidata, for example) - but many Wikidata entities do or should have OCD-IDs.

As an example of how this is used, OCD-IDs are keys into Google's Civic Information API. You can query all statewide elected officials for Wisconsin using this API endpoint. You do need an API key from Google first though, but the API is free for up to 25,000 calls a day.

curl 'https://civicinfo.googleapis.com/civicinfo/v2/representatives/ocd-division%2Fcountry%3Aus%2Fstate%3Awi?key=[YOUR_API_KEY]' --header 'Accept: application/json' --compressed

The response is in JSON and is a bit too long so I put a sample at this link

You could imagine editors using WDQS SPARQL to query Wikidata for divisions with this property, then using that property against the Google API, and updating officeholders that might be out of date. Erik s paulson (talk) 23:20, 1 August 2020 (UTC)[reply]

Discussion edit

  •   Support looks good, but is there a webpage for each identifier that we could link to? --Hannes Röst (talk) 21:36, 2 August 2020 (UTC)[reply]
    • Unfortunately there is not a webpage for each identifier, it's just a big static database. There is a real-world entity for each element on the list, and the idea is if in the open civic data community you describe that entity then the OCD-ID is the name you should use for it. (Wikidata could be the page for each identifier!). Authority Control might be the wrong section to make this property proposal request in, if it is, my apologies. Erik s paulson (talk) 02:53, 3 August 2020 (UTC)[reply]
      • I believe authority control is the correct place for this and I happy to support this. I think wikidata could be the right place to do this, however I also wonder who will do the actual work and link all these identifiers (it seems there are on the order of 800k such identifiers). Do you have technical capacity to do the matching yourself?
      • I think you could build a REST API that would direct somebody to the correct wikidata item given an OCD identifier (or display a webpage populated with Wikidata entries when available)
      • Is there a plan to add backlinks into OCD itself?
      • Regarding your point of completeness: I think this should be eventually complete (Q21873974) and missing items should simply be added to Wikidata (there is no reason not to).
      • Q: How do you plan to handle depreciation? Often districts merge/split but may still be important for historic tracking, do they keep their identifier and are marked as historic or is this tracked outside the database? Are the identifiers stable and will new identifiers be introduced in such cases? it seems there is a validThrough/validFrom mechanism but it only seems to be used in NZ, US and CA.
      • Q: Do you track completeness? It seems there are currently 32 countries, is there an estimation of how complete this is for each country? Is the plan to eventually cover all countries? --Hannes Röst (talk) 14:54, 3 August 2020 (UTC)[reply]
        • (Not sure if etiquette is to respond inline or in bulk, so I'm responding to all of your points in one big block) First a metacomment: I am not representing the OCD project, so I am not speaking for any of their plans. But I will try to answer your questions as best I understand what the Open Civic Data project is thinking. On matching all identifiers, I think there will be interest in the civic data community to work on matching OCD-IDs to Wikidata entities so I am hopeful that it would not just be me, but I am capable of using or building tools to help do bulk matching. The REST API you're describing is a good example of something that useful (also perhaps as a reconciler for OpenRefine.) I would not expect OCD-ID to incorporate backlinks, they are focused on sticking to identifiers only and be used as lookup keys in other databases. OCD-IDs are usually maintained for historic purposes - for example, Wisconsin's 9th congressional district (Q8027097) was dissolved in 2002 but the name still makes sense and is valid. Other times districts change boundaries - sometimes radically so - but they would keep the same identifier. I do not know how the OpenCivicData community measures completeness. The scheme is designed to be extensible and to grow independently in different parts of the namespace, and the governance process is explicitly organized so the management of OCD-IDs in a given country is driven primarily by residents of that country who know it best and not imposed on them by non-residents who may not have a proper understanding of the issues in that country. Erik s paulson (talk) 01:22, 6 August 2020 (UTC)[reply]
  •   Comment I updated the description a bit to include a sample Google API call that uses the identifier. It's not really a formatter URL, but hopefully that helps folks see how this property is useful. Erik s paulson (talk) 01:32, 24 August 2020 (UTC)[reply]

  WikiProject every politician has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. I think this would be a useful property for the EP project, hopefully this proposal can get a few more supports and be accepted/marked as 'Ready' Erik s paulson (talk) 21:25, 30 August 2020 (UTC)[reply]

  •   Support definitely a worthwhile thing. how we populate it is a different and harder question. there are some open databases we can use I think but I think the best quality data here are behind paywalls. BrokenSegue (talk) 21:29, 30 August 2020 (UTC)[reply]
  • @Erik s paulson: Is this to be defined as being mapped specifically to the `id` field in files found at https://github.com/opencivicdata/ocd-division-ids/tree/master/identifiers/country-XX.csv files? Or can it be broader than that (e.g. files hosted outside of that project?), or perhaps narrower (files in the individual-country directories below that?) In other words, is it sufficient to only have a 'bare' ID, or would it also be required to provide the source of that ID? I'm also unclear as to which Wikidata item the OCD-ID would map to in cases where we differentiate between the different 'roles' a division plays via distinct items: for example it's good practice in Wikidata to differentiate between administrative and electoral areas even when they have the same geography: thus for example we have Queensland (Q36074) for the Australian state; but Queensland (Q56649111) for the constituency of the US Senate, and so any usage of `ocd-division/country:au/state:qld` could presumably map to either Wikidata item. (Australia is currently very sparsely populated in OCD, so this possibly isn't a great example there, and US "states as Senate constituencies" haven't yet been split out into separate items, but that's partially because the discussion is around whether there should be just a single item — e.g. "Alaska (US Senate Seat)" — or two per state, to reflect the two classes of senator in each state.) Similarly, in the other direction, my understanding is that a given 'place' can have multiple OCD-IDs linked with a 'sameAs': would we store all of these against the relevant Wikidata item? --Oravrattas (talk) 06:15, 31 August 2020 (UTC)[reply]
    • It would use the country-XX.csv files - that’s the stable location that OCDEP-2 defines for identifiers. The individual country directories are meant so each group maintaining a country’s data can keep data in the git repo, but paths in that directory are not stable. Country Maintainers use the data in the directories to create the top-level country-XX.csv files. It’s also not broader - the agreement is everyone who uses OCD-IDs as keys into their database uses what’s stored in that git repo, so there’s not a need to track the source of the ID. As for what item to map to, we’ll have to watch for cases where data models don’t quite line up. I don’t think it’s the end of the world if we give two wikidata items of different types the same OCD-ID, if other datasources would have a single item to represent both (the state/senate constituency example) - people who read wikidata can use the typing of wikidata to figure out what they need. Or, that could be a good reason to go back and create additional identifiers in OCD. (A few years into the OCD project, for US House seats for low-population states they created “At-large districts” to represent the entire state but also differentiate between the state and House district, which made everyone’s data model a little easier On the other hand, the Google Civic API treats US senators as simply being from the state and doesn’t have a separate constituency for them, and I think OpenElections treats it the same way.) Finally, there’s a lot of items where there’s a clear OCD-ID that we can assign now and get started with, and figure out the harder cases later. The sameAs bit I think is more meant for historic identifiers, so we can either ignore anything except the current identifier, or we can add a ‘start time’/’end time’ qualifier and keep track of older OCD-IDs, but I would start with just focusing on getting the current identifiers in place so we make some progress. Erik s paulson (talk) 01:48, 25 September 2020 (UTC)[reply]

  WikiProject every politician has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. @Erik_s_paulson, Hannes_Röst, BrokenSegue: @Oravrattas:   Done Iwan.Aucamp (talk) 18:09, 26 September 2020 (UTC)[reply]