Open main menu

Wikidata β

Wikidata:Requests for comment/Updating References for External Data

< Wikidata:Requests for comment
An editor has requested the community to provide input on "Updating References for External Data" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.

If you have an opinion regarding this issue, feel free to comment below. Thank you!

BackgroundEdit

Our bot keeps data from several sources up to data by regularly updating the data from the external sources. As part of the reference structure for these statements, we add retrieved (P813) to signify when the claim was checked against the database. An example item is here or Q4117777#P279. We run the bots regularly (depending on the data source, ranging from weekly to monthly), to check this data against the primary source, and would like to update the retrieved statements to indicate that this value was checked. However, this leads to many "blank" edits, where the only thing that changes are the dates. Here is an example.

Arguments for updating time stamps every time a bot checks a statement:

  • This allows a person to know when the last time a statements was checked without themselves checking the external source
  • This allows queries to make use of this information in data retrieval
  • If an external resource goes down, it is valuable to know when the data was last checked

Arguments against:

  • These edits obscures edit histories and makes it more difficult for a person to figure out when a certain statement was created or updated. (This is a greater issue with the wikidata histories, in that it is already very difficult to identify the last revision that a particular statement was modified in)
  • They cause unnecessary notifications for those watching the pages
  • They unnecessarily strain Wikidata resources

ProposalEdit

Currently, we only update the timestamps if a value on that item changes. We are proposing that we stay with this procedure, but additionally update all time stamps on all statements every 6 months. The "maintenance" revisions could be labelled or tagged as such.

DiscussionEdit

Interesting question. Some thoughts:

  • Reference retrieval dates, as they are typically used in most Wikimedia projects, are not updated each time the value is re-confirmed. This is mostly the case because there are barely any automatic routines in place, and no human editor would waste their valuable time with such edits. This has in fact changes with Wikidata and the possibility to use bots for this task.
    • I don't see an advantage with frequent updates, since users are not used to such behavior. We assume a reference to be correct as long as nobody invalidates it after it went offline.
    • However, annual update (or, maybe, all six months) would be fine and tolerable for me. A useful, explanatory edit summary should be provided.
  • An alternative to temporally narrow down changes regarding reference availability would be to make the full crawl log available at Tool Labs.
    • Create a tool account (if you don't have one yet), and provide a mask to filter the log with different, useful parameters. This would reduce the server load on the Wikidata side.
    • Advertise the tool within suitable Wikiprojects.

Regards, MisterSynergy (talk) 19:51, 7 May 2017 (UTC)

I have definitely run into this issue with some of the data sources I've used myself. An added twist is for a database that time-stamps itself, so that the reference URL (or other aspects of the citation) changes with each release. When I re-ran a bot in that case, many statements ended up with two (or more) separate references, for the two instances of the database that had the same information. I wonder if this is maybe an issue that the Wikicite project should take up? Provenance in wikidata is important, we want to encourage good practices here. But we don't want to break stuff that works either... ArthurPSmith (talk) 16:04, 8 May 2017 (UTC)
  • The best solution as defined in help:sources is to switch from retrieved (P813) to publication date (P577). From a consistency point of view, publication date (P577) has to be preferred because this is the most accurate way to date a statement. retrieved (P813) is just a lazzy solution to avoid to look for the publication date in the database. Most databases provided the date of the last change so bots have to look for that date and see if this publication date is different from the one available in WD statement. If no difference, no need of changing anything in WD.
Some databsases are updating their data only once every ten years. Example: population data from census offices. So updating every month or every year the data in WD is stupid. Should we wait 10 years to check the data ? No.
Here we can find the difference between update and check. An update means something is changing and this should be displayed in WD. A simple check just means that we verify that the data is always correct according to the original set of data.
My conclusions:
  • Bots have to follow the sourcing policy and to use as much as possible publication date (P577) instead of retrieved (P813)
  • A difference has to be made between an update and a simple consistency check: the first one means a change in the original database and this has to be recorded in WD, the second one has to be as much as possible offline (using a dump for example) and should not lead to any automatic update.
  • Discrepancies detected during check should not be automatically corrected in order to avoid bots wars. Every check should generates a report and this report should be analyzed by humans in order to understand the reasons of the changes. This is perhaps a vision of an ideal world, but this is the only smart one: just running bots without checking they did is just non sense work.
  • Data check and data uptdate should be coordinated with change rates of original databases. If a database performs an update every year, the update should be done in the same way. Snipre (talk) 19:37, 8 May 2017 (UTC)
and what to do with a database that is updated every month or more frequently? ArthurPSmith (talk) 20:00, 8 May 2017 (UTC)
In that case mix update and check, but follow the source guidelines by using publication date (P577) and by updating only the data changes. For all other data do a check but don't change anything if no change is detected. I don't think the whole dataset is changing each month. Snipre (talk) 05:27, 9 May 2017 (UTC)