User:ElanHR/Draft:Signed Statements

This RfC aims to solicit ideas and increase visibility for an existing proposal for Signed Statements. I believe a version of Signed Statements can currently be implemented using the existing API by simply introducing new schema.

Problem/Motivation edit

In the Wiki community (and beyond) the presence of references is often used as a proxy for the trustworthiness of a fact. For example many Wikidata-driven infoboxes will by default only use statements with references. This heuristic of trustworthiness makes the assumption that if a consumer were inclined they could follow the reference and verify themselves that the source indeed supports the stated claim. While this is certainly a valid approach to verify individual statements it is not at all scalable and is often stymied by the fact that the underlying sources are either not readily available or not easily parseable (for either humans or machines).

Using A Bleaching Ground in a Hollow by a Cottage (Q20870755) as an example we can see that not all references are created equal:

reference URL (P854) (external database output supporting the claim) (reference for title (P1476) claim): Raw database output from a random URL, quite possibly an authoritative source but for this to be deemed trustworthy the consumer must both trust the provider and verify that the Wikidata claim matches the source info.
stated in (P248) Jacob van Ruisdael catalog raisonné, 1911 (Q21004638) (reference for location of creation (P1071) claim): Points to a primary source for the data but since the target is a physical book depending on whether the source is digitized a casual user cannot easily verify this information themselves (in this example it is but this isn't always the case).
imported from Wikimedia project (P143) Wikimedia Commons (Q565) (reference for width (P2049) claim): Useful for providing provenance of where a fact comes from but does by itself improve trustworthiness as it does not point to any external source to validate the claim.

As one can see, existing approaches to references provide varying ability for users to validate the data for themselves and by proxy should convey varying degrees of trustworthiness. It is also prohibitively time consuming to do at scale but fortunately another common (and often reliable) heuristic can be used which is relying on the claims of a trusted authority. Rather than verify a fact myself I will more often than not use the source as an indicator of trustworthiness without explicitly checking that it matches the stated claim.

Using Barack Obama (Q76) as an example I put varying amounts of trust in different claims based on the reference sources:

https://www.whitehouse.gov, https://www.nobelprize.org, https://www.nytimes.com, https://www.cnn.com : Known authorities (to me), unlikely to be vandalized/impersonated, vested interests in stating correct information.
https://d-nb.info: Not a familiar source to me (but looks reliable from a cursory glance).
http://www.surveyusa.com: Seems like an odd source of truth for a family name (P734) claim...

Even in the case that the URL is no longer active (as in the case of [[1]]) simply based on the source I still generally consider it trustworthy and will think it's likely that the page once supported the stated claim.

In these cases where I am not independently verifying the referenced source I am introducing another link into the chain of trust. Not only do I have to trust the validity of the underlying source, I also have to trust that the reference properly points to a source that supports the claim. There are a number of situations with varying degrees of severity in which this assumption can be broken including:

A claim gets updated with new/more correct information - e.g. a source may support a Wikidata claim that a person was born in the 1300s and the claim gets updated to a specific year.
Qualifiers get added which cause a claim to be more specific than the reference actually implies - e.g. a referenced news article from years ago may only state a CEO's start time and the addition of an end time qualifier creates a claim no longer supported by the source.
A bad actor may add a reference which does not actually support the claim at all.
A bad actor may change a valid claim with a reference to an incorrect value implying support by the referenced source.

The combination of all of these factors gets us to a rough estimation of trustworthiness where:

( Trustworthiness that Wikidata claim is correct and actually supported by the cited source A ) ~ ( Trustworthiness of source A ) x ( Trustworthiness of User B correctly sourcing reference for this claim ) x ( Probability that the claim has not been altered since the reference was added )

Signed statements can help simplify this for all cases to:

( Trustworthiness that Wikidata claim is correct and actually supported by the cited source A ) ~ ( Trustworthiness of source A ) x ( Trustworthiness of User B correctly sourcing reference for this claim )

and for cases where the source provides the data themselves (as is the usually the case for data donations):

( Trustworthiness that Wikidata claim is correct and actually supported by the cited source A ) ~ ( Trustworthiness of source A )

What is a Signed Statement? edit

At a high level a Signed Statement is a cryptographic method of endorsing that a Wikidata claim is supported by the cited source/reference and the claim has not been altered. When the endorser and the source of the claim are the same entity (which would be the case in data donations) this reduces the number of points where error could be introduced, shortening the chain of trust improving verifiability and hopefully improving trustworthiness for individual Wikidata statements.

Overall the process of generating a signed statement is as follows:

Serialize the claim+reference_to_be_endorsed in a consistent way removing fields that do not affect the claim such as "hash", "rank", as well as auxiliary reference such as reason for deprecated rank (P2241) and reason for preferred rank (P7452).
Using the serialized claim+reference generate a hash which is appended to the reference. This can be used to verify the claim has not been changed since the reference has been added.
The generated hash function is then "signed" by encrypting it with the author's private key. This signature is also appended to the reference and can be used to verify that the signer indeed endorsed the statement using the author's public key.

HashingFunction(serialized_claim, reference) → hash_of_statement: (If a claim is altered without updating references then the hash will not match and the reference cannot be assumed to support this fact)

EndorseStatement(hash_of_statement, endorsers_private_key) → signature_endorsing_statement

VerifyEndorsement(signature_endorsing_statement, endorsers_public_key) ≈ hash_of_statement: If the decoding the signature using the endorser's public key matches hash_of_statement then it can be assumed endorsed.

For illustration purposes, I put together some demo schema on Wikidata-test based on the original proposal by User:Lydia_Pintscher_(WMDE).

 - [Item of Signing Authority]
 - [URL of Signing Authority]
 - [Claim hash]
 - [Endorsing signature]
 - [revision id of subject/signed item]
 - [revision id of object value]
 - [PGP public key] (should probably be available on the signing authorities website, added this for convenience/illustrative purposes)

And some example items using this schema:

 - [Example Item ] ([claims in JSON format])
 - [Example authoritative source ]

How would Signed Statements be verified? edit

Signed Statements would enable two forms of verification to make sure a claim is still supported by a given reference:

Simply checking that the claim matches the provided hash (eg. the object IDs and qualifiers have not been altered). This verification would not require looking at revision history.
Checking that the identity of subject/objects have not been altered since the claim was made. This would involve comparing the current revisions to the marked revision and determining the identity has not changed. How to do this in an automated way is out of scope for this proposal but this provides one way to flag candidates to be checked.

For example, let's say a bad actor wanted to change Wikidata to state "Barack Obama was born in Kenya" instead of Barack Obama (Q76)place of birth (P19)Kapiolani Medical Center for Women and Children (Q6366688) (which is currently supported by a reference to a birth certificate), There are a few ways they could do this:

Replacing the object ID with Kenya (Q114). This would immediately be flagged by HashingFunction(serialized_claim, reference) no longer matching the provided hash_of_statement from the reference.
Changing the identity of Kapiolani Medical Center for Women and Children (Q6366688) to match Kenya (Q114). In this case hash_of_statement would still match but could be detected by checking if the identity of Kapiolani Medical Center for Women and Children (Q6366688) still matches the identity of the flagged revision in the reference.
Change the identity of Barack Obama (Q76) to something else. Same as case #2

While this proposal would in no way solve the problem of vandalism it does address these specific forms and lets users have confidence in any signed facts endorsed by sources they trust.

Why is this important? edit

Signed Statements provide a number of benefits all supporting the larger goal of improving Data Quality and Trust (phab:T76230):

It enables institutions donating data to ensure they will not be misattributed down the road.
It shortens the chain of trust by removing the question of whether a middleman(reference) is trustworthy. Essentially this would move from the current scheme of "someone(via this reference) says authoritative source X claims Y" to "authoritative source X claims Y (and you can immediately verify it)".
It allows for data audits providing an additional tool to counter these forms of vandalism.

Who is considered an authority? edit

Good question! For a trial run scope should probably be limited to interested GLAM institutions doing batch uploads but in theory claims could also be endorsed by authoritative individuals or simply Wikidata users.

Disclaimer: Data donations is also the use-case that piqued my interest in this feature as a coworker and I are planning to assist a cultural institute upload/connect their database with Wikidata and I thought it would be neat to be able to more closely attribute the data to the source.

How would a user find/verify an endorsers' public key? edit

For an initial run, there could be a protected page listing information about the data providers and clear sourcing to the public key hosted on the providers' site.

Caption
Institution Name	Wikidata Item	Main site	Site with public key (hosted on main site)	Public key for convenience purposes
University of FakeAuthorities	[(test)Q214701]	https://test-signed-statement.org	https://test-signed-statement.org/pub_key	-----BEGIN PGP PUBLIC KEY BLOCK----- ...

Do we plan to lock down signed statements? edit

This shouldn't be strictly necessary since signed statements can be immediately flagged once they are changed. If an endorsed statement gets replaced by more accurate/specific/update-to-date information The question of whether to keep the old statement is probably one best left to the larger community.

How will we create a chain of trust? edit

This is an open question that I would love to hear your thoughts about!

This should hopefully be an easy sell to institutions that already want to make their data available and useful. Particularly as it would be relatively simple to incorporate this scheme into existing batch upload frameworks.