Wikidata:Requests for permissions/Bot/METbot

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

Not done No follow-up on the request to see if this is still active. @Fuzheado: feel free to re-open this if you want to follow up on it (revert this edit, add it back to the list of bot requests). Thanks. Mike Peel (talk) 21:26, 4 February 2022 (UTC)[reply]

METbot edit

METbot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Fuzheado (talk • contribs • logs)

Task/s: This bot adds depicts (P180) statements to Wikidata items corresponding to Metropolitan Museum of Art artworks and the qualifier determination method (P459) -> Metropolitan Museum of Art Tagging Initiative (Q106429444) as well as "#metttagging" to the edit summary. It uses The Met's Open Access database and a controlled vocabulary of around 1,000 high-quality keyword tags that have already been reconciled to Wikidata Q numbers.

Code: Python code on PAWS here: https://public.paws.wmcloud.org/User:METbot/mettagger/mettagger_P180.ipynb

Function details: (copied from PAWS/Jupyter notebook)

METtagger bot helps add high-quality depiction information to Wikidata items that correspond to Metropolitan Museum of Art works that have been previously made. It does so by using the weekly CSV dump The Met puts on Github.

TL;DR: This bot adds depicts (P180) statements to Wikidata items corresponding to Metropolitan Museum of Art artworks and the qualifier determination method (P459) -> Metropolitan Museum of Art Tagging Initiative (Q106429444) and adds "#metttagging" to the edit summary. It uses The Met's Open Access database and a controlled vocabulary of around 1,000 high-quality keyword tags that have already been reconciled to Wikidata Q numbers.

Since 2020, The Met has been including high-precision Wikidata Q numbers for many of their fields, which makes these bot tasks easier and more precise. These include Q numbers for:

objects/artifacts
creator/constituent
tag/depiction info

An example of the Wikidata_URL that The Met records in its database can be see in this API call: https://collectionapi.metmuseum.org/public/collection/v1/objects/294500

The Met Github and CSV is here: https://github.com/metmuseum/openaccess

Bot procedure

The bot works by bringing in the CSV dump from The Met and finding out which objects have a Wikidata_URL. It then uses pywikibot to iterate through the list of Q items, checking what The Met has as depiction keywords (tags) and what the corresponding Wikidata item has as its P180. If the depiction statement implied by The Met keyword tag is not in Wikidata, the bot will add a new P180 statement via pywikibot. A second pass of the Wikidata item's P180 statements will add a qualifer to indicate that P180 statement is sourced to The Met (even if the P180 statement was already there).

We are using the depicts (P180) qualifier determination method (P459) -> Metropolitan Museum of Art Tagging Initiative (Q106429444). This is to make it consistent with the same thing we are doing in Structured Data on Commons. Since SDC does not have "reference" statements like Wikidata, we felt it was better to use P459->Q106429444 on both Wikidata on Commons to stay consistent. We are open to discussing other ways to do this, but this seems sensible for now. An example diff: https://www.wikidata.org/w/index.php?title=Q78828856&diff=1409934952&oldid=1363169833

Special flags and options

There are two Python dicts that bot operators can define to restrict the behavioral logic of the P180 additions.

do_not_depict - Define Wikdiata QIDs you don't want the bot to add to any P180. The Met is refining its tagging so that they don't reflect "instance of" information, such as portrait, landscape art, etc. So you can list a series of Q numbers to never add as depiction info.

compatible_met_object_names - Define object names from The Met that you want to process, excluding all others. A blank dict means process everything. To see a list of the types of names and their frequency, you can consult: https://www.wikidata.org/wiki/Wikidata:GLAM/Metropolitan_Museum_of_Art/TOAH/objectName_count

Scale of work

There are roughly 600,000 works of art from The Met in their CSV file, with about 20,000 having a corresponding Wikidata item. We are working methodically through the list, starting with 2D artworks and smaller sets of object names to help evaluate our working methods.

Contact

Contact Andrew Lih (User:Fuzheado) with any issues. (April 2021) --Fuzheado (talk) 02:07, 28 April 2021 (UTC)[reply]

@Fuzheado: Seems like a good project. Couple small issues:

Code doesn't properly handle deprecated pre-existing P180 statements.
Why are you using qualifiers instead of references? I think references are a more appropriate way to store this data.
You should probably use stated in (P248) instead of determination method (P459) and maybe also add the reference URL (P854) to the specific version of the csv file you are using on github (in case the file changes later).

BrokenSegue (talk) 03:32, 28 April 2021 (UTC)[reply]

Thanks for the feedback. Here are some responses:

P180 statements in the item that are not from The Met should not be considered "deprecated." So they should stay. If you're talking about P180 statements that are attributed to The Met but are no longer in The Met database, you're right that isn't handled by this bot. That can be handled by a future "roundtripping" maintenance bot. The current role of this bot is to add. But I'll look into adding that logic in follow-up bots.
As mentioned above, SDC does not currently have references. So we would have a peculiar situation where relating Met P180 info in SDC would be via qualifiers and on Wikidata with references. That seems to be a suboptimal situation. I suppose we could use both methods on Wikidata - both the qualifier and a reference and be a bit redundant. We do that with other things for artworks like collection/inventory number and inventory number/collection. Further discussion welcome.
My understanding is that stated in (P248) is for references only, so since I went the qualifier route, I went with determination method (P459). As for pointing to an exact CSV - have we done that for other cases in Wikidata? It seems like it may be overly specific to implementation details, since the same info is in several different places from The Met, whether it's CSV or their API. I think just stating that it's part of a project is enough in this case, and the details pointed to in Metropolitan Museum of Art Tagging Initiative (Q106429444). But I'm open to seeing other solutions. - Fuzheado (talk) 04:30, 28 April 2021 (UTC)[reply]

Regarding deprecated statements I mean that the code should not touch deprecated P180 statements already in wikidata. Sorry I'm not sure what "SDC" refers to here so I can't speak to whether not using a reference is appropriate? I personally have added links to exact versions of files/datasources. It's potentially useful but not critical. If we do go with a reference and stated in but not a reference URL I would suggest adding a retrieved timestamp. I would also suggest adding both a qualifier and a reference if for some reason a qualifier is needed. BrokenSegue (talk) 04:39, 28 April 2021 (UTC)[reply]

I guess it's tempting to not use Wikidata's datamodel because some other system doesn't support it, but we have made not so positive experiences with external contractors trying to place a non-Wikidata model into Wikidata. If there is a problem with the datamodel at Commons, this should be resolved there, here use the reference section for references. --- Jura 11:19, 29 April 2021 (UTC)[reply]
- @Fuzheado: Can you change into references? Lymantria (talk) 07:35, 8 May 2021 (UTC)[reply]

Looks good in general (except the point above). --- Jura 11:19, 29 April 2021 (UTC)[reply]

Hi Lymantria and Jura1 - I'm fine with implementing reference/source statements in addition to the qualifier as this is a reasonable dual solution. The case for the qualifier approach is still there, as we can imagine a variety of methods for the addition of depicts (P180) statements that would also benefit from being in qualifiers in addition to this case, including but not limited to:
- Tools - Utilities like ISA, Wiki Art Depiction Explorer, Wikidata Image Positions all add depiction info, and tracking these additions via a qualifier statement would be reasonable since a source/reference statement wouldn't be the right approach.
- Machine learning and AI - Similarly, automated tools and techniques using machine learning and AI are being used already to help add metadata to images/items, and tracking these in the qualifier is likely the right approach, versus a reference statement.

I should note that this type of approach is not new – there are a number of fields relevant to cultural heritage and digital humanities in Wikidata that are repeated in multiple places, such as collection (P195) or inventory number (P217), in the interest of discoverability and serving multiple approaches to modeling the data set. This is a good example of another situation where a dual approach is justified. Thanks. -- Fuzheado (talk) 04:41, 11 May 2021 (UTC)[reply]

It's probably normal to think that a given reference one adds is special and should be used everywhere, otherwise one would probably not add it in the first place, but that shouldn't mean one needs to add the same data three times to WMF projects.
The comparison with catalog/catalog code is helpful: there the qualifier is needed, because the information is split between the main statement and the qualifier. This is different here.
If there are some other tools that don't work correctly with references, maybe it's time to fix them. If they are only used on Commons, it's irrelevant for this bot request (this is Wikidata, not Commons with different Wikibase features). --- Jura 06:47, 16 May 2021 (UTC)[reply]

Even on Commons, it seems to be a mere GUI issue: phab:T230315 (found this by chance when searching for something else). --- Jura 11:44, 16 May 2021 (UTC)[reply]

Comment - There seems to be a consensus that this is a good approach to add both source statements, so if there's no objection, I'll start executing more runs of this bot. -- Fuzheado (talk) 15:44, 17 June 2021 (UTC)[reply]

@Fuzheado: This seems to be stale, is this still active? Perhaps @Ymblanter, Lymantria: could comment? Thanks. Mike Peel (talk) 21:57, 18 January 2022 (UTC)[reply]

I don't see any changes to Help:Sources that would support the approach planned by Fuzheado and the Commons problem they were trying to solve if the qualifier approach is being fixed on Commons. So no need to start a wave of duplication of references for the same statement. --- Jura 14:43, 19 January 2022 (UTC)[reply]

The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made in a new section.