Wikidata:Requests for comment/Signed Statements (T138708)

This RfC aims to solicit ideas and increase visibility for an existing proposal for Signed Statements. I believe a version of Signed Statements can currently be implemented using the existing API by simply introducing new schema.

Problem/Motivation edit

In the Wiki community (and beyond) the presence of references is often used as a proxy for the trustworthiness of a fact. For example many Wikidata-driven infoboxes will by default only use statements with references. This heuristic of trustworthiness makes the assumption that if a consumer were inclined they could follow the reference and verify themselves that the source indeed supports the stated claim. While this is certainly a valid approach to verify individual statements it is not at all scalable and is often stymied by the fact that the underlying sources are either not readily available or not easily parseable (for either humans or machines).

Using A Bleaching Ground in a Hollow by a Cottage (Q20870755) as an example we can see that not all references are created equal:

reference URL (P854) (external database output supporting the claim) (reference for title (P1476) claim): Raw database output from a random URL, quite possibly an authoritative source but for this to be deemed trustworthy the consumer must both trust the provider and verify that the Wikidata claim matches the source info.
stated in (P248) Jacob van Ruisdael catalog raisonné, 1911 (Q21004638) (reference for location of creation (P1071) claim): Points to a primary source for the data but since the target is a physical book depending on whether the source is digitized a casual user cannot easily verify this information themselves (in this example it is but this isn't always the case).
imported from Wikimedia project (P143) Wikimedia Commons (Q565) (reference for width (P2049) claim): Useful for providing provenance of where a fact comes from but does by itself improve trustworthiness as it does not point to any external source to validate the claim.

As one can see, existing approaches to references provide varying ability for users to validate the data for themselves and by proxy should convey varying degrees of trustworthiness. It is also prohibitively time consuming to do at scale but fortunately another common (and often reliable) heuristic can be used which is relying on the claims of a trusted authority. Rather than verify a fact myself I will more often than not use the source as an indicator of trustworthiness without explicitly checking that it matches the stated claim.

Using Barack Obama (Q76) as an example I put varying amounts of trust in different claims based on the reference sources:

https://www.whitehouse.gov, https://www.nobelprize.org, https://www.nytimes.com, https://www.cnn.com : Known authorities (to me), unlikely to be vandalized/impersonated, vested interests in stating correct information.
https://d-nb.info: Not a familiar source to me (but looks reliable from a cursory glance).
http://www.surveyusa.com: Seems like an odd source of truth for a family name (P734) claim...

Even in the case that the URL is no longer active (as in the case of [[1]]) simply based on the source I still generally consider it trustworthy and will think it's likely that the page once supported the stated claim.

In these cases where I am not independently verifying the referenced source I am introducing another link into the chain of trust. Not only do I have to trust the validity of the underlying source, I also have to trust that the reference properly points to a source that supports the claim. There are a number of situations with varying degrees of severity in which this assumption can be broken including:

A claim gets updated with new/more correct information - e.g. a source may support a Wikidata claim that a person was born in the 1300s and the claim gets updated to a specific year.
Qualifiers get added which cause a claim to be more specific than the reference actually implies - e.g. a referenced news article from years ago may only state a CEO's start time and the addition of an end time qualifier creates a claim no longer supported by the source.
A bad actor may add a reference which does not actually support the claim at all.
A bad actor may change a valid claim with a reference to an incorrect value implying support by the referenced source.

The combination of all of these factors gets us to a rough estimation of trustworthiness where:

( Trustworthiness that Wikidata claim is correct and actually supported by the cited source A ) ~ ( Trustworthiness of source A ) x ( Trustworthiness of User B correctly sourcing reference for this claim ) x ( Probability that the claim has not been altered since the reference was added )

Signed statements can help simplify this for all cases to:

( Trustworthiness that Wikidata claim is correct and actually supported by the cited source A ) ~ ( Trustworthiness of source A ) x ( Trustworthiness of User B correctly sourcing reference for this claim )

and for cases where the source provides the data themselves (as is the usually the case for data donations):

( Trustworthiness that Wikidata claim is correct and actually supported by the cited source A ) ~ ( Trustworthiness of source A )

What is a Signed Statement? edit

At a high level a Signed Statement is a cryptographic method of endorsing that a Wikidata claim is supported by the cited source/reference and the claim has not been altered. When the endorser and the source of the claim are the same entity (which would be the case in data donations) this reduces the number of points where error could be introduced, shortening the chain of trust improving verifiability and hopefully improving trustworthiness for individual Wikidata statements.

Overall the process of generating a signed statement is as follows:

Serialize the claim+reference_to_be_endorsed in a consistent way removing fields that do not affect the claim such as "hash", "rank", as well as auxiliary reference such as reason for deprecated rank (P2241) and reason for preferred rank (P7452).
Using the serialized claim+reference generate a hash which is appended to the reference. This can be used to verify the claim has not been changed since the reference has been added.
The generated hash function is then "signed" by encrypting it with the author's private key. This signature is also appended to the reference and can be used to verify that the signer indeed endorsed the statement using the author's public key.

HashingFunction(serialized_claim, reference) → hash_of_statement: (If a claim is altered without updating references then the hash will not match and the reference cannot be assumed to support this fact)

EndorseStatement(hash_of_statement, endorsers_private_key) → signature_endorsing_statement

VerifyEndorsement(signature_endorsing_statement, endorsers_public_key) ≈ hash_of_statement: If the decoding the signature using the endorser's public key matches hash_of_statement then it can be assumed endorsed.

For illustration purposes, I put together some demo schema on Wikidata-test based on the original proposal by User:Lydia_Pintscher_(WMDE).

 - [Item of Signing Authority]
 - [URL of Signing Authority]
 - [Claim hash]
 - [Endorsing signature]
 - [revision id of subject/signed item]
 - [revision id of object value]
 - [PGP public key] (should probably be available on the signing authorities website, added this for convenience/illustrative purposes)

And some example items using this schema:

 - [Example Item ] ([claims in JSON format])
 - [Example authoritative source ]

How would Signed Statements be verified? edit

Signed Statements would enable two forms of verification to make sure a claim is still supported by a given reference:

Simply checking that the claim matches the provided hash (eg. the object IDs and qualifiers have not been altered). This verification would not require looking at revision history.
Checking that the identity of subject/objects have not been altered since the claim was made. This would involve comparing the current revisions to the marked revision and determining the identity has not changed. How to do this in an automated way is out of scope for this proposal but this provides one way to flag candidates to be checked.

For example, let's say a bad actor wanted to change Wikidata to state "Barack Obama was born in Kenya" instead of Barack Obama (Q76)place of birth (P19)Kapiolani Medical Center for Women and Children (Q6366688) (which is currently supported by a reference to a birth certificate), There are a few ways they could do this:

Replacing the object ID with Kenya (Q114). This would immediately be flagged by HashingFunction(serialized_claim, reference) no longer matching the provided hash_of_statement from the reference.
Changing the identity of Kapiolani Medical Center for Women and Children (Q6366688) to match Kenya (Q114). In this case hash_of_statement would still match but could be detected by checking if the identity of Kapiolani Medical Center for Women and Children (Q6366688) still matches the identity of the flagged revision in the reference.
Change the identity of Barack Obama (Q76) to something else. Same as case #2

While this proposal would in no way solve the problem of vandalism it does address these specific forms and lets users have confidence in any signed facts endorsed by sources they trust.

Why is this important? edit

Signed Statements provide a number of benefits all supporting the larger goal of improving Data Quality and Trust (phab:T76230):

It enables institutions donating data to ensure they will not be misattributed down the road.
It shortens the chain of trust by removing the question of whether a middleman(reference) is trustworthy. Essentially this would move from the current scheme of "someone(via this reference) says authoritative source X claims Y" to "authoritative source X claims Y (and you can immediately verify it)".
It allows for data audits providing an additional tool to counter these forms of vandalism.

Who is considered an authority? edit

Good question! For a trial run scope should probably be limited to interested GLAM institutions doing batch uploads however this schema also enables endorsements by authoritative individuals, Wikidata users, or anyone who generates a public/private key pair.

How would a user find/verify an endorsers' public key? edit

For an initial run, there could be a protected page listing information about the data providers and clear sourcing to the public key hosted on the providers' site.


Institution Name	Wikidata Item	Main site	Site with public key (hosted on main site)	Public key for convenience purposes
University of FakeAuthorities	[(test)Q214701]	https://test-signed-statement.org	https://test-signed-statement.org/pub_key	-----BEGIN PGP PUBLIC KEY BLOCK----- ...

A crowd-source ad-hoc approach of sharing trusted endorsers' public keys using Wikipedia / Wikibase Transclusion edit

1. There could be a list of keys trusted by a group of users. 2. For each group of users (e.g. WikiProject) they can maintain their own trusted endorser list 3. End user can curate their own trusted endorse list, which takes the transclusion of groups they trust.

Do we plan to lock down signed statements? edit

This shouldn't be strictly necessary since signed statements can be immediately flagged once they are changed. When an endorsed statement gets replaced by more accurate/specific/update-to-date information the question of whether to keep the old statement is probably one best left to the larger community.

How will we create a chain/web of trust? edit

This is an open question that I would love to hear your thoughts about!

This should hopefully be an easy sell to institutions that already want to make their data available and useful. Particularly as it would be relatively simple to incorporate this scheme into existing batch upload frameworks.

Milestones edit

Create properties to support statement hashes/signatures.
Write and release python library to generate and check statement signatures.
Develop bot that periodically checks for and flags signature mismatches.
Cooperate with interested institution to upload first batch of signed data.
Get additional community feedback on value based on initial trial run.
(stretch) Propose/write pull request for QuickStatements that adds the option of generating signatures (client-side) using a provided private key.

Acknowledgements edit

Thank you to @Lydia_Pintscher_(WMDE): for making the original proposal as well as @Harej:, @Jura1:, @BrokenSegue:, and @Jheald: for their thoughtful questions which helped me clarify my thoughts and pointed me to additional cases I hadn't considered.

Project Chat Discussion (for context) edit

Edit: In order to provide more context and consolidate discussion on this topic I put together the following RfC: Wikidata:Requests_for_comment/Signed_Statements_(T138708) ElanHR (talk) 06:50, 16 March 2021 (UTC)[reply]

For anyone interested, I put together a brief proposal (https://phabricator.wikimedia.org/T138708#6911561) and some demo schema on wikidata-test to support signed statements and am interested in any thoughts/feedback people have. :)

I'm primarily interested in the use case of a batch data donation by an authoritative source (e.g. GLAM institutions) and enabling them to sign uploaded statements.

Example item with signed statements: https://test.wikidata.org/wiki/Q214700
Example item for authoritative source: https://test.wikidata.org/wiki/Q214701

Cheers, ElanHR (talk) 02:12, 15 March 2021 (UTC)[reply]

So to check if one actually has the signed version of the statement one would need to retrieve a page in an item's revision history? --- Jura 11:34, 15 March 2021 (UTC)[reply]
- Thanks for asking, how verification is done was certainly not clear. This would enable two levels of verification to make sure a claim is still supported by a given reference:

Simply checking that the claim matches the provided hash (eg. the object IDs and qualifiers have not been altered). This verification would not require looking at revision history.
Checking that the identity of subject/objects have not been altered since the claim was made. This would involve comparing the current revisions to the marked revision and determining the identity has not changed.

For example, let's say a bad actor wanted to change Wikidata to state "Barack Obama was born in Kenya" instead of Barack Obama (Q76)place of birth (P19)Kapiolani Medical Center for Women and Children (Q6366688) which is currently supported by a reference to a birth certificate. There are a few ways they could do this:

Simply replace the object ID with Kenya (Q114). This would immediately be flagged by f(claim, reference) no longer matching the provided hash.
Change the identity of Kapiolani Medical Center for Women and Children (Q6366688) to match Barack Obama (Q76). In this case the hash would still match but could be detected by checking if the identity of Kapiolani Medical Center for Women and Children (Q6366688) still matches the identity when the reference was made.
Change the identity of Barack Obama (Q76) to something else. Same as case #2

Per @BrokenSegue:'s comment I will put together an RfC on this and make sure to distinguish these cases. ElanHR (talk) 20:46, 15 March 2021 (UTC)[reply]

Is there an RfC or document outlining the value of such a proposal? I see the top comment for for the ticket but it doesn't explain how they imagine working in practice. Is there demand for this? Do we plan to lock down signed statements? How will we create a chain of trust? Etc. BrokenSegue (talk) 13:36, 15 March 2021 (UTC)[reply]
- Not that I've seen but I agree that would be a better place to hold a discussion on this. I will try to flesh this out some and put together an RfC either today or tomorrow!

To somewhat address the "chain of trust" question, I see a number benefits of this all supporting the larger goal of improving Data Quality and Trust (https://phabricator.wikimedia.org/T76230):

It enables institutions donating data to ensure they will not be misattributed down the road.
It shortens the chain of trust by removing the question of whether a middleman(reference) is trustworthy. Essentially this would move from the current scheme of "someone(via this reference) says authoritative source X claims Y" to "authoritative source X claims Y (and you can immediately verify it)". That said, whether or not a user trusts a particular source or not is still up to them!
It allows for data audits providing an additional tool to counter these forms of vandalism.

These are great questions and I will definitely try to tackle them more in depth in the RfC. ElanHR (talk) 20:46, 15 March 2021 (UTC)[reply]

@ElanHR: One of the issues with signed statements is that there may be legitimate changes that can and should be made to a statement without invalidating its reference. For example, suppose a qualifier is added -- should that always invalidate the signature? (Perhaps the signature should indicate which qualifiers it does or does not encompass). Or suppose there is a legitimate merge of the value of the item. While that could be an attack, it's also often just housekeeping. Or suppose a statement gets deprecated with "reason for deprecation"; or preferred, with "reason for preferred rank". Would either of those affect the signature? Curious as to your thoughts on this. Jheald (talk) 21:25, 15 March 2021 (UTC)[reply]

I think in the case of making a claim more specific (e.g. adding qualifiers, updating a value to be more specific) we should be wary about attributing more to a source that it actually says. For instance if one source says someone "was born in the 1300s" and later information suggests that they "were born in 1317" we should be wary of updating the claim without removing the reference to avoid misattribution. In these cases I feel the proper course of action would be to leave both to accurately represent a source's claim.

In the second case preferred/deprecation reasons are a case that I definitely overlooked and you're right these should not affect the signature. In order to avoid this the serialization function should be defined to ignore these housekeeping properties. Thanks for pointing this out! ElanHR (talk) 06:50, 16 March 2021 (UTC)[reply]

@ElanHR: It's not just qualifiers like reason for preferred rank (P7452) or reason for deprecated rank (P2241) though, is it? What about subject named as (P1810), object named as (P1932), applies to part (P518), nature of statement (P5102), subject has role (P2868), object has role (P3831), statement is subject of (P805), statement disputed by (P1310), statement supported by (P3680), follows (P155), followed by (P156), replaces (P1365), replaced by (P1366), relative position within image (P2677) and all sorts of other qualifiers that we may think it is legitimate to add to a statement, although they go beyond just 'housekeeping'. Jheald (talk) 08:01, 16 March 2021 (UTC)[reply]

@Jheald:I see your point. I think the easiest way to do this would be to add a separate reference with the additional information. I feel this is a bit clunky as an approach but this provides a way of adding additional context without invalidating an endorsement. In my test item I've added an "official web site" claim to illustrative this approach.

While I don't think the addition of any of the properties you listed are likely to be particularly harmful, I am wary of a final result where someone "endorses" something they did not explicitly say.

Alternatively there is a field for "snaks-order" so it may be possible to only serialize the claim + only reference statements made before the hash/signature property but I think this approach would be brittle and unintuitive. ElanHR (talk) 04:46, 17 March 2021 (UTC)[reply]

@ElanHR: Particularly on a property like position held (P39) (which can already have a lot of values) I think it would be unfortunate to say that adding replaces (P1365) or follows (P155) or various other not-very-controversial qualifiers should require a new statement. Pinging @Andrew Gray:, who does a lot of work in this area. Slightly OT, it does seem to be an issue that surfaces from time to time, how to specify that a particular reference is supporting a particular sub-set of qualifiers on the statement -- see eg this in Project Chat last December (and independently somebody had asked me exactly the same thing at just the same time), or this 2019 property proposal, which fell principally because no examples were given. Jheald (talk) 10:46, 17 March 2021 (UTC)[reply]

I agree adding a new value isn't a particularly elegant way to show which qualifier are supported by a reference but I'm not sure of a better way of doing it. Additionally I think in the worst case this could contribute to ambiguity.

For example: Barack Obama (Q76)position held (P39)United States senator (Q4416090) occurs twice with different qualifiers. If I have a reference that supports the qualifier electoral district (P768) Illinois Class 3 senate seat (Q101499034) which one should it support? In this case it's valid for both but I don't think that would always be the case.

At a high level I think we might be hitting a wall with what is possible to model with being able to point to individual references. Even for the "Supports qualifier" proposal I could imagine problem cases that are not possible to model because a qualifier property could potentially occur multiple times for a value. E.g. For a cast member (P161) statement, what if the actor had multiple character role (P453)? ElanHR (talk) 05:41, 19 March 2021 (UTC)[reply]

@ElanHR: I am also suspicious of the whole rationale here, which has always seemed to smack of "Ooooh, crypto. Shiny!". We don't actually need a cryptographic signature to say who added a statement -- we have an edit history for that, and it applies to everybody's edits, not just some hallowed few. What we actually need are better tools to look at the edit history of a single statement, whoever added it. Jheald (talk) 08:01, 16 March 2021 (UTC)[reply]
Well I certainly can't bemoan anyone's skepticism to crypto-hype (that is essentially my life in the tech industry lol). In this case I think it might actually be an elegant solution to trusting facts that anyone can alter.

As for the "hallowed few", the proposed schema would definitely be available for anyone who is interested in using it. The reason for highlighting data donations from GLAM institutions is this would likely results in a higher claims/endorser ratio and make it easier to showcase the schema's value.

I totally agree that improved tools for reviewing edit history would be desirable. Unfortunately reviewing edit history to track provenance has some complexities that make signed statements appealing (keeping track of this small amount of extra data to make it computationally more efficient). I go into (perhaps too much) detail on this under [#Discussion on the RfC] ElanHR (talk) 05:06, 17 March 2021 (UTC)[reply]
This is actually something that comes up fairly frequently, but unfortunately hasn't been tackled by devs yet. Accordingly, I don't quite see why an RFC would be needed. Maybe it's a solution to a problem we have now when a statement that is supported by several references gets changed to something slightly or completely different. --- Jura 09:14, 16 March 2021 (UTC)[reply]

@Jura1: One thing that is different this time is User:ElanHR trying to see just how far we can get towards this in user-space, without dev input; and trying to get the community to think about what we would actually want, & what issues come up, rather than devs just just presenting a finished package for the community to take or leave.

I have to say I do find the additional reference claims on https://test.wikidata.org/wiki/Q214700 a bit ugly, and would make the regular reference claims harder to see -- but that's a UI thing that could be easily enough fixed with an appropriate UI patch, either as a gadget or in the main code.

The one thing I think ultimately would need dev intervention is that one would probably want the system to keep track of whether the statements still meet the signed hash (and to subtly modify the statement presentation if they do), rather than that calculation having to be done by every browser for every signed statement every time the page was opened. But the later approach is okay as a userland proof-of-concept.

The real thing the community needs to think about is the social dimension -- is this a thing we think is actually worth spending any time on (& a justifiable complexity increase) ? and secondly, as I have been trying to do above, how well does it play with how we actually use references in reality?

A worked demo should certainly help us focus on those two questions, but should not be given any presumption of inevitability that it would actually go forward. Jheald (talk) 11:29, 16 March 2021 (UTC)[reply]

I don't really care who ultimately does the development and if WMDE hasn't done in 8 years, they are unlike to do it. Personally, I think the problem I mentioned above needs to be addressed, semi- or fully protecting MediaWiki pages isn't going to solve it. --- Jura 11:35, 16 March 2021 (UTC)[reply]
@Jheald:/@Jura1:I totally agree that the demo endorsement scheme is uglier than I'd like - this is the minimum information necessary to make it work and all that being exposed in userspace is a bit much. :\

I was considering assigning the phab ticket to myself because I think it'd be useful for the community but wanted to wait for a response to my proposal from the community plus the devs who are already on the ticket (that and my plate is already moderately full). While I can't say I've coded a gadget before I'm fairly confident I could eke one out given some time.

As for keeping track of violations: one of my first goals for this would be to populate a list of violations via a cron job that:

runs a SPARQL query for references that use this schema
check for hash/signature match
verify the identity of Sub/Obj have not been changed (probably at a lower frequency because it's more computationally intense).

Violations could then be reviewed manually. For cases where the change is valid the endorsement could either be moved to a deprecated claim or simply removed.

I also flushed out some of the examples and motivations on the Wikidata:Requests_for_comment/Signed_Statements_(T138708)#Discussion. ElanHR (talk) 05:30, 17 March 2021 (UTC)[reply]

I'm completely unconvinced by what I've read so far. I don't see how the supposed 'chain of trust' is shortened or strengthened by signed statements, as opposed to well-referenced statements. I have concerns that 'signed statement' = 'owned statement' ... institutions that submit data to Wikidata do not own the item nor the statement, and even a whiff of protecting items or statements is anathema for the "do not own the item nor the statement" reason. I'm not even clear whose interests are being served here. Refs work well enough for the uninvolved user. The institution should be well enough served by periodically checking their database holdings against WD holdings to see what's changed. Uninvolved users must be free to amend statements, not least since data originating from institutions often enough has errors within it. --Tagishsimon (talk) 06:16, 17 March 2021 (UTC)[reply]
@Tagishsimon:I disagree wholeheartedly with the characterization that signed statements are 'owned' statements and would argue a more accurate description is 'endorsed'. This signature in no way implies protection from edits, it is just a quick way of easily recognizing if a statement is what was said by an individual/institution. The data donation process is still the same: I have some data I would like to make freely available, I put it in shared schema, and then put it out for anyone to use/modify/edit/remove/etc.

I think an appropriate analogy is an open letter. I can craft a letter and put it out to the world for anyone can sign on to however if someone changes what the letter says I want my signature removed until I can read it (people say some crazy stuff). That said no one is stopping you from copying/editing or writing your own conflicting letter.

Per your comment "The institution should be well enough served by periodically checking their database holdings against WD": This can be an incredibly laborious/technically difficult process depending on what has changed and I think you are overestimating the technical expertise of institutions and willingness to put forward the effort necessary for this ongoing process. My impression from attending Wiki conferences is the most common batch uploaders are small GLAM institutions (think local historical society/libary) that just want to put their data into a CSV, upload it, and be done.

Re: "whose interests are being served here": People using, producing, and curating data - see my comment in the RfC. ElanHR (talk) 06:59, 17 March 2021 (UTC)[reply]
@Tagishsimon:: Just to clarify I don't think you're incorrect to say that "Refs work well enough for the uninvolved user" (heck a lot of use cases don't even require references), it's just not everyone falls into this bucket and I want to make Wikidata as usable as possible. Vandalism exists and when used this feature could be one step to counter it, help avoid misattribution, and help users who can't blindly trust references without verifying. ElanHR (talk) 07:10, 17 March 2021 (UTC)[reply]
My personal opinion is that this will do little for vandalism detection. I can just add a new statement or change the label or... We would need a very high density of signed statements for someone to reasonably only consider signed statements. The answer to vandalism detection here is better ML. BrokenSegue (talk) 13:25, 17 March 2021 (UTC)[reply]
The usage of "Revision ID of signed subject/object value" properties could help somewhat with detecting this form of vandalism (see #2+3 of #How would Signed Statements be verified?) but your point stands.

From personal experience a major bottleneck in developing ML solutions is that labeled positive examples of vandalism are exceedingly few compared to good examples and labeling new examples is incredibly laborious/time consuming and often requires domain expertise. I totally agree that this would not solve the problem of vandalism on its own but it could help find new cases and hopefully reduce the workload of items being reviewed. As for the density question fortunately (or unfortunately) bots automatically importing datasets are becoming the majority of edits/contributions (http://datakolektiv.org/app/WD_HumanEdits) so it may take only a handful of sources to adopt this schema to have noticeable impact.

While I agree ML approaches will be necessary to tackle the more general problem of vandalism I believe both approaches can be used in conjunction resulting in an improvement over our current process.

PS: The following isn't a criticism specifically against ML-based vandalism detection but just a fun example that recently came up in a talk where we were discussing how confident we should be in current ML solutions: https://www.theverge.com/2021/3/8/22319173/openai-machine-vision-adversarial-typographic-attacka-clip-multimodal-neuron ElanHR (talk) 05:58, 19 March 2021 (UTC)[reply]
@Tagishsimon: ["#Second round of Wikidata updates on the mass data imported a while ago/ Wikidata history query service"] actually provides a pretty great case study of how signing statements would be beneficial by providing checks at the statement level. If the data were originally signed than finding these unedited statements would be a two step process:
1. Querying for statements signed by this institution.
2. Checking if the statement hash is still valid.
ElanHR (talk) 06:38, 19 March 2021 (UTC)[reply]

Discussion edit

An editor has requested the community to provide input on "Signed Statements (T138708)" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.

If you have an opinion regarding this issue, feel free to comment below. Thank you!

JakobVoss (talk) ClaudiaMuellerBirn (talk) Criscod (talk) Daniel Mietchen (talk) Ettorerizza (talk) Ls1g (talk) Pasleim (talk) Hjfocs (talk) 17:24, 21 January 2019 (UTC) PKM (talk) 2le2im-bdc (talk) 20:30, 24 January 2019 (UTC) Vladimir Alexiev (talk) 16:37, 21 March 2019 (UTC) ElanHR (talk) User:Epìdosis (talk) Tris T7 ^{TT me} UJung (talk) 11:43, 24 August 2019 (UTC) Envlh (talk) SixTwoEight (talk) User:SCIdude (talk) Will (Wiki Ed) (talk) Mathieu Kappler (talk) So9q (talk) 19:33, 8 September 2021 (UTC) Zwolfz (talk) عُثمان (talk) 16:31, 5 April 2023 (UTC) M2k~dewiki (talk) 12:28, 24 September 2023 (UTC) —Ismael Olea (talk) 18:18, 2 December 2023 (UTC) Andrea Westerinen (talk) 23:33, 2 December 2023 (UTC) Peter Patel-Schneider[reply]
Notified participants of WikiProject Data Quality
1997kB 94rain Ajraddatz Ameisenigel Andreasmperu ArthurPSmith Bingobro Bovlb Bridget CrystalLemonade Csisc Dan Koehl Daniel Mietchen Dr.üsenfieber Eihel ElanHR Elliot Padfield Envlh Epìdosis Esteban16 Hasley Jarekt Joalpe KonstantinaG07 M2k~dewiki MarioGom Martin Urbanec Matěj Suchánek MisterSynergy Pasleim Prahlad balaji Putnik Pyfisch QZanden Rockpeterson Samoasambia Sintakso Sjoerddebruin Titodutta Trade Tris T7 Valdemar2018 YMS ZI Jony Tanbiruzzaman BrokenSegue (talk) Yahya Nastoshka
Notified participants of WikiProject Counter-Vandalism
Andheb (talk) ElanHR (talk) 10:05, 17 March 2019 (UTC) Jneubert (talk) 20:55, 23 July 2019 (UTC) Daniel Mietchen (talk) 20:47, 26 September 2019 (UTC) Eihel (talk) 11:21, 21 March 2020 (UTC) PAC2 (talk) Blue Rasberry (talk) 20:57, 15 February 2021 (UTC) PKM (talk) 03:29, 23 February 2021 (UTC) KAMEDA, Akihiro (talk) 12:07, 28 June 2021 (UTC) Vladimir Alexiev (talk) 16:58, 12 December 2021 (UTC) Dhx1 (talk) 12:14, 23 September 2022 (UTC) RShigapov (talk) 08:42, 12 June 2023 (UTC)[reply]
Notified participants of WikiProject Datasets

The proposal as it stands doesn't solve any problem we are having. If we have an item for John Smith and a source adds a signed claim, the signature of the claim would still be valid if the item gets later changed to be about Jane Smith so the signature will be read as authorities endorsing claims that they don't actually endorse. At the same time the proposal provides no added benefit for the usecase where a single data donor is responsible for adding the data as we already store that information in the edit history. ChristianKl ❪✉❫ 23:01, 16 March 2021 (UTC)[reply]
The proposal as it stands doesn't solve any problem we are having.
I guess it really depends on who the "we" is in this statement. The reason for this proposal is to directly address problems I face both as a consumer/producer of data on the platform and someone interested in improving anti-vandalism tools .
- Consumer: I want to be able to use Wikidata statements with confidence that they have not been vandalized. I also would like to be able to efficiently verify claims indeed match their references. Even when the target URL is a trusted source and has the data on the page is in a structured format this is difficult and computationally expensive.
- Producer: I want the data that I help upload to be used by others and making it as trustworthy as possible is one way to promote that.
- Vandalism detection: I want tools to allow to reference verification in a computationally efficient way. This new schema provides an efficient way to do this for references that chose to include it.
I should be clear, I am not proposing we require data donors (or anyone) to use this schema/endorse their data in this way during their uploads - I simply propose we have the schema to enable users who want to. I believe having the schema would allow us to showcase its benefits which hopefully would drive further usage.

If we have an item for John Smith and a source adds a signed claim, the signature of the claim would still be valid if the item gets later changed to be about Jane Smith so the signature will be read as authorities endorsing claims that they don't actually endorse.

This is actually precisely the 2nd and 3rd examples of vandalism described under "#How would Signed Statements be verified?" and why in the original proposal @Lydia_Pintscher suggested including properties to state which revisions a claim is being made about. While this form of vandalism wouldn't be immediately flagged by simply checking the hash it would enable tools to do this check offline (having a revision ID to check makes this process much much easier). It also enables a particularly cautious user to only use the tagged revisions for sub/obj items.

At the same time the proposal provides no added benefit for the usecase where a single data donor is responsible for adding the data as we already store that information in the edit history.
This may be true in some cases but there are two complications for this: this is computationally expensive, and an institution may make data donations under multiple usernames or more likely via shared usernames (bots and quickstatements).
- Computation cost: This schema allows for statements using these endorsements to verified in O(1) time. Assuming that institutions each had a single WMF account (which I will show is often not the case) it may be possible to trace provenance but this would be slow to compare and require numerous calls to the API to check versions.
Consider this example: You have some rocks in your hand and someone hands you some more. If you keep track the # of rocks in your hand (X) and those given(Y) you can check the final value immediately by simply adding the two (X+Y). If you don't have this information you would have to count all the rocks which would take significantly longer. This proposal is the equivalent of keeping track of X & Y so you don't need to go through process of counting each time (the hash/signature are these extra bits of info to track)
- Data donors <-/-> Wiki usernames: While some data donors may upload their data under a single user this is not always the case. For institutional data there may be multiple people involved in the project and who may do separate uploads but each still representing the institution. Additionally a very common practice is to simply convert data to a common format (e.g. CSV) and then upload it via Quickstatements which all get uploaded under a single account (https://www.wikidata.org/w/index.php?title=Special:Contributions/QuickStatementsBot&offset=&limit=500&target=QuickStatementsBot). For the latter case it could be possible to parse out individual users from the edit logs but this is definitely an added layer of complexity and a brittle approach when trying to generalize.
ElanHR (talk) 04:21, 17 March 2021 (UTC)[reply]
- - I don't see why we should believe that's it's more complex for an API to check which user made an edit then checking whether the edit is signed correctly. Even if you have additionally make a lookup call to a list of which user accounts belong to which organization that's still O(1).

Key management is generally a hard problem in organizations. If we solve the problem based on user accounts it would be easy for an organization to declare which user accounts belong to the organization. That allows them to easily revoke an user account that's compromized as belong to their organization.

As long as you believe that the WMF servers are trustworthy at representing who made which edits, rights management based on user names is easier then rights management based on cryptographic keys. If you don't believe that the WMF servers are trustworthy (and for example state that edits have been made from accounts that didn't actually make the edit), please explain the thread model against which you want to defend. ChristianKl ❪✉❫ 13:03, 17 March 2021 (UTC)[reply]

"I don't see why we should believe that's it's more complex for an API to check which user made an edit then checking whether the edit is signed correctly."

I totally agree that one could define a data structure that internally tracked editor information at the claim/qualifier/reference level allowing a theoretical API to do what you're describing but my understanding is that it doesn't. While I'd love to be shown otherwise my understanding is that it currently isn't possible (even getting a complete list of editors for an item requires O(N) calls to the API - which is totally understandable for the case of popular items).

This proposal aims to provide a mechanism to do similar to what you're describing but in userspace without requiring an extensive update to Wikibase. My impression is that WMF/WMDE's dev teams are stretched pretty thin (this requested feature is from 5+ years ago) and making an extensive change to their systems to support this functionality is nowhere on their radar.

"If you don't believe that the WMF servers are trustworthy (and for example state that edits have been made from accounts that didn't actually make the edit), please explain the thread model against which you want to defend."

This isn't a motivation I had (and I don't think it is much of a worry). The reasons for not having this proposal be based on usernames are the before mentioned difficulties with matching edits to usernames and the fact that some usernames make contributions on behalf of multiple sources/individuals (e.g. LargeDataSetBot, QuickStatementsBot).

My main motivation is that it is currently difficult to:

Check that a reference is not outdated - claims are easily altered without references being properly updated/removed.
Assert that the referenced source indeed supports the claim on Wikidata. In the case of a reference URL (P854) you could try regex matching/NLP but this is difficult and computationally expensive). When it is a physical source (like a book/newspaper/etc.) this approach becomes impossible.

While humans can usually do both of these tasks manually (assuming the source is publicly available in a digital format), Wikidata is growing faster and faster (bots edits have outpaced ones made by humans) to the point where human inspection of every change is no longer feasible. This proposal enables us to automatically flag certain kinds of errors/vandalism overall reducing workloads for humans. ElanHR (talk) 01:27, 19 March 2021 (UTC)[reply]

Usernames are much better for attribution than a new system of cryptographic signature. Properly managing cryptographic keys is not easy. Integrating the functions of QuickStatementBot more natively on Wikibase would make more sense than a new signing system. ChristianKl ❪✉❫ 10:18, 10 August 2022 (UTC)[reply]

What about revoking signatures? Surely someone will sign something in error and we will need a way for them to revoke it. Such a revocation would need to include the statement id I would think. BrokenSegue (talk) 13:15, 17 March 2021 (UTC)[reply]
Good question! I think it depends on what sort of endorsement revocation is desired:
- On an individual statement level I think the easiest approach would be to just delete the endorsing reference.
- In the case that a systemic error is discovered with a specific batch of uploads (e.g. someone used instance of (P31) instead of occupation (P106) for people's occupation), most batch tools have an option to revert. This could also be done by reverting contributions in a specific timeframe.
- If someone/some institution wanted to revoke their endorsements wholesale then it should be as removing their public key from their domain (or from some wiki-page where they are listed) This would leave the uploaded facts but the endorsements would get flagged next time that the are processed for verification (at which point they would be removed). For example assuming I wanted to stop personally endoring facts I would just stop hosting my public key and thus facts could not be verified against it. ElanHR (talk) 00:05, 19 March 2021 (UTC)[reply]
@ElanHR: ok but reverting the edits defeats the purpose. the signature is still floating out there. how do I know it's wrong now? if we're just going to take the "current" value of wikidata as "good" and anyone can revert a signed edit and that "unsigns" it why did we bother with signatures in the first place? just add a reference "attested to by $foo". BrokenSegue (talk) 02:16, 19 March 2021 (UTC)[reply]
Please let me know if I've misunderstood your concern but I only meant to recommend reverting (and removing any new signatures) in the case that someone uploaded data which was later determined to be incorrect - though this should be the practice regardless whether statements are signed or not.

Additionally I should clarify that in the third case(wanting to revoke all their endorsements) this is not a case of a hash mismatch (which indicates the statement has changed without updating the reference and would be flagged for review) but a case where someone goes to verify a signature and the associated public key is no longer available. This case would be flagged differently and in the case of a genuine revocation (vs a technical glitch of not being able to access it) then all such signatures using this key should be revoked (though the hash could technically stay as it would still be valid). The underlying facts would remain untouched.

In this scheme anyone can remove anyone else's endorsements. While this could be problematic, the same could be said about removing other types of reference without cause and should be treated as a form of vandalism. As @Tagishsimon: pointed out protecting these signatures over other references would be at odds with wiki practices concerning donated data.

The difference between the proposed approach and having an "attested to by $foo" property is that in the latter case anyone could add it. For instance I can add the claim Barack Obama (Q76)place of birth (P19)Kenya (Q114) with the reference "attested to by" Smithsonian Institution (Q131626) without any factual basis in whether the Smithsonian actually endorsed such a statement and one will either have to trust me or verify the fact themselves at the Smithsonian. Due to the nature of how signing works one can only ever sign a statement as yourself (or someone for whom you have the private key of) which avoids the problem of misattribution. ElanHR (talk) 05:15, 19 March 2021 (UTC)[reply]
Why not just have authorities host attestation file(s) on their domain (secured via SSL) that includes all statements they attest to. I would alternatively appreciate an API that gives us O(1) lookup to who added/last edited what statement which offers some of the same value. Ultimately I don't know how valuable this would be but I won't stand in your way. BrokenSegue (talk) 13:20, 17 March 2021 (UTC)[reply]
Why not just have authorities host attestation file(s) on their domain (secured via SSL) that includes all statements they attest to.

That would be absolutely great and along those lines I think federated Wikibase instances would be the bee's knees! That said there are still some drawbacks to this approach mainly it puts the technical burden of hosting this on the data donors, and it requires data consumers to know about their wikibase instances.

If that project develops a way to automatically connect such databases that would be amazing but with this current proposal I am hoping to solve these problems without having to require additional development work on WMF/WMDE's side. ElanHR (talk) 00:15, 19 March 2021 (UTC)[reply]