Wikidata:Requests for permissions/Bot/InternetArchiveBot

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

Approved indexing but not adding the archive links to working references; we will need a separate request for that.--Ymblanter (talk) 18:05, 5 May 2019 (UTC)[reply]

InternetArchiveBot edit

Tracked in Phabricator
Task T143488

InternetArchiveBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Cyberpower678 (talk • contribs • logs)

Task/s: Add archive URL references to all original URLs.

Code: GitHub

Function details: This was highly requested from users of WikiData, and at the WikiConference N/A 2018. So InternetArchiveBot, well known for what it does on Wikipedia, searches through all of the items on Wikidata, and will add archive URL properties to the reference URLs. Also, IABot will proactively save all live URLs on Wikidata into the Wayback Machine for the future.—CYBERPOWER (Chat) 18:11, 12 March 2019 (UTC)[reply]

Thank you! Please remember to do a test run of between 50 and 250 edits. --abi án 19:05, 12 March 2019 (UTC)[reply]

This might apparently take a while. IABot is scanning links, but they are all alive. When there was so much demand for IABot, I kind of anticipated more dead links. Maybe it would be beneficial to have IABot just add the archive URLs to all links since they are merely properties bound to the original?—CYBERPOWER (Chat) 19:50, 12 March 2019 (UTC)[reply]

Excellent news. For now I would just fire up the indexing and in a later stage actually start doing edits. Do you agree? Where can I see what items/urls you already indexed?

Did you actually update any code? I didn't notice any relevant commits. Can you point me to the source code?

You will encounter quite a few broken links. I hacked up Wikidata:WikiProject sum of all paintings/Link rot the other day. Maybe you can recover some of the links. Multichill (talk) 21:53, 12 March 2019 (UTC)[reply]

The actual commits are on the test branch. It's where all beta commits go. IABot 2 is still in beta since I'm a one person dev team. :p—CYBERPOWER (Chat) 22:29, 12 March 2019 (UTC)[reply]

You didn't completely answer my question.

I proposed you only index for now and not edit. I reviewed some edits like [1] and these appear to be incorrect. The url to https://viaf.org/viaf/25168560/ works so the bot shouldn't add an archive url. Archive url should only be added when a link is broken. Why did the bot do this edit?

Where can I see what items/urls you already indexed? Multichill (talk) 22:02, 19 March 2019 (UTC)[reply]

That's not how InternetArchiveBot works. It will pass URLs it can't find in the Wayback Machine to the Wayback Machine for archiving. What IABot submits for archiving is not tracked. It only tracks which URLs have archive URLs. As for adding archive URLs, you are the only so far that I have spoken to that objects to adding archives to live links, or even edit on Wikidata for that matter. Consensus so far is to add archives to all links. Wikidata is a data repository. Not an encyclopedia.—CYBERPOWER (Chat) 23:01, 19 March 2019 (UTC)[reply]

You talk about consensus, but where does where was this established? I doubt there is consensus for this.

In your proposal you say you're only going to add archive url's when a link is dead. This seems to be much broader. If that is the aim, please update your proposal to what you actually want to do. Multichill (talk) 20:59, 20 March 2019 (UTC)[reply]

Yes, and then I wrote a comment that explains that I have expanded the functions after several users on IRC says they believe it is a great idea. Also, the one user below has also expressed support for this. I'll amend the proposal to reflect that.—CYBERPOWER (Chat) 22:06, 20 March 2019 (UTC)[reply]

Several users on IRC is not consensus building. I would support the original proposal, but I

Oppose the amended version. This would add millions of redundant links and wouldn't improve our data quality. Archive links should only be added when the original reference becomes unavailable. Multichill (talk) 21:27, 21 March 2019 (UTC)[reply]

I fail to see your viewpoint. Wikidata is meant to house data. How is adding archive URLs to existing URLs not improving the data. Unlike Wikipedia, Wikidata is simply a data repository. It's supposed to have as much data as possible. We should start a discussion then to see if there is a consensus. I feel there is consensus for this.—CYBERPOWER (Chat) 22:30, 21 March 2019 (UTC)[reply]

It's a data repository, not a data dump. Not all data should be added because we can. It's not supposed to have as much data as possible.

On Wikipedia we only add the archive url when the original url is no longer available. Why change? How do I know as a downstream user of the data if an url is broken if all references are full of archive urls?

You can just break this request into two parts. This one for only adding archive url when the originals are broken (the original proposal). Try to get consensus for expansion of the scope and if you manage to get that, open a new bot proposal. Multichill (talk) 11:00, 24 March 2019 (UTC)[reply]

You have a point there. But let me counter argue. Wikidata is nothing but structured data, and adding archives to all URLs is not turning it into a data dump, but rather preventing linkrot from the start for services relying on it. If the link ever goes down, there's already an archive ready to go for it. We could also add a property that identifies if the original is dead or not. There's an added bonus of identifying which URLs cannot be rescued with this method. While IABot 2.0beta14 will have significant improvements to the archiving routines and utilize the Wayback Machine's newest SPN2 APIs, along with some Wikidata integration bug fixes, not all URLs can be archived. It's not done BY DEFAULT on Wikipedia, but it is done when users tell the bot to do it. Here is a notable recent example.—CYBERPOWER (Chat) 16:36, 24 March 2019 (UTC)[reply]

Unless I misunderstood what you meant by indexing. IABot does keep track what URLs are found where. You can retrieve a list of items by URL. The other way around requires a DB query, or an API query to the tool.—CYBERPOWER (Chat) 23:28, 19 March 2019 (UTC)[reply]

DB query on what database on what server? API query to what endpoint? Multichill (talk) 20:59, 20 March 2019 (UTC)[reply]

IABot's DB on a cloud VPS located on Labs. The API URL is https://tools.wmflabs.org/iabot/api.php. The documentation can be found at meta:InternetArchiveBot/API.—CYBERPOWER (Chat) 22:06, 20 March 2019 (UTC)[reply]

Comment I've asked around on IRC, and it seems there is consensus favoring adding archive URLs to all links, being that Wikidata is a data repository rather than encyclopedic articles. With that in mind, I restarted some tests with IABot adding archive URLs to live references.—CYBERPOWER (Chat) 19:58, 13 March 2019 (UTC)[reply]
Support Adding archiving to all outgoing links.ChristianKl ❪✉❫ 11:51, 18 March 2019 (UTC)[reply]

Are we ready for approval here?--Ymblanter (talk) 19:38, 18 March 2019 (UTC)[reply]

I certainly am.—CYBERPOWER (Chat) 21:11, 18 March 2019 (UTC)[reply]

@Multichill: and you? Lymantria (talk) 06:19, 21 March 2019 (UTC)[reply]

No, see above. Multichill (talk) 21:27, 21 March 2019 (UTC)[reply]

Support Gamaliel (talk) 18:39, 25 March 2019 (UTC)[reply]
@Ymblanter, Lymantria: There seems to be a consensus favoring the bot's current proposed operation.—CYBERPOWER (Chat) 07:17, 29 April 2019 (UTC)[reply]
Well, I guess we still need to come to terms with the only dissenter. @Multichill: Maarten, could you please clearly articulate what you are currently unhappy with? I can not understand this from the discussion.--Ymblanter (talk) 20:03, 30 April 2019 (UTC)[reply]
I don't see consensus on everything. Mass adding archive url's to working references will cause a lot of clutter. I support everything mentioned here except the mass adding of archive links to working references. You can approve that part and move the later scope expansion into a new request. Multichill (talk) 20:42, 30 April 2019 (UTC)[reply]
Thanks, I will likely do this but I need to re-read the whole discussion, and it might get postponed till the weekend.--Ymblanter (talk) 21:21, 2 May 2019 (UTC)[reply]
Comment I think the best approach at this point would indeed be to approve the intial request (index all URLs contained in Wikidata, and adding archive URLs to defunct links) and to spin out a separate discussion about whether and how to add archival links for non-defunct links. --Daniel Mietchen (talk) 03:48, 5 May 2019 (UTC)[reply]

The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made in a new section.