Wikidata talk:Mix'n'match

Synchronisation and feedback loops

Latest comment: 10 months ago4 comments4 people in discussion

Hi! In this thread I would like to describe how the synchronisation (sync) between Mix'n'match (Mnm) and Wikidata (WD) works as of now, pros and cons of this present situation and possible improvements of it.

The most frequent case is the following: a MnM entry is manually matched by a user to a WD item; consequently, the external ID contained in the MnM entry is added to the WD item. In this ideal case, the same match is present on MnM and WD, without sync issues.

It is to be noted that the contrary is not true: if I remove a match from a MnM entry, the external ID is not removed from the unmatched WD item; and if I remove an external ID from a WD item, the match is not removed from the MnM entry containing the removed ID. This is problematic in terms of feedback loop, as I will show.

However, in some cases there can be sync issues, i.e. a match being present either only on MnM or only on WD.

W) Match only on WD: possible reasons	M) Match only on MnM: possible reasons
1W) the MnM entry is manually unmatched, but the ID isn't removed in the WD item (note: rare case)	1M) the ID is removed in the WD item, but the MnM entry isn't manually unmatched
2W) the ID is manually added to the WD item, not using MnM	2M) the MnM entry is matched by a user who isn't logged-in (or the user is logged-in, but MnM mistakenly doesn't make the consequent edit on the WD item because it is overloaded)
	3M) the MnM entry is matched by the "Automatic name/date matcher" or the "Auxiliary data matcher"

In order to deal with sync issues, each MnM catalog has a sync function, accessible through the "Action" menu (e.g. https://mix-n-match.toolforge.org/?#/sync/4990 for https://mix-n-match.toolforge.org/?#/catalog/4990); whichever user can activate use the two sync functions:

the first function allows to import matches from WD to MnM (solving column W); note: if an external ID is matched to WD item A, but its MnM entry is matched to WD item B, the case should be manually solved by the user
the second function allows to import matches from MnM to WD (solving column M)

Usually the first function doesn't create any problem: since case 1W is rare (= if a user finds a wrong match through MnM, which I think is rare, they usually remove it both on MnM and WD), in fact the first function only solves case 2W, and assuming that IDs manually added by users not using MnM are correctly matched, the first function usually doesn't import any wrong match from WD to MnM.

The second function is much more used than the first function, and here are some problems: while solving cases 2M and 3M is usually unproblematic, solving cases 1M is usually problematic. In other words, through the second function:

1M) I import to WD matches that have been removed from WD, usually by users which have considered them wrong but haven't removed them from MnM (probably because either they don't know MnM or they think that simply removing the ID from the WD item also removes the match from MnM, which is false); this is problematic, unless the ID has been removed from WD by a vandal (rare case)
2M) I import to WD matches made by other users through MnM; assuming that these matches are good, no problem
3M) I Import to WD matches made by the "Automatic name/date matcher" or the "Auxiliary data matcher"; assuming that these matches are good, no problem; however, there could be a few problems for the following reasons:
1. "Automatic name/date matcher" can be deceived by homonyms (very very rare)
2. "Auxiliary data matcher", using another ID as a bridge (= if ID A links to ID B and item A links to ID B too, then ID A can be matched with item A), can be deceived if the other ID used as a bridge is wrongly present either in the ID on MnM or in the item on WD (rare, but not impossible; e.g. the case of conflated VIAF ID (P214) clusters, often used as bridge for matches made by the "Auxiliary data matcher")

Of course the crucial problem is case 1M: in fact, the user who uses the second sync function of MnM to import matches from MnM to WD often reimports to WD wrong matches previously manually removed from WD by other users (these syncs usually cause motivated complaints in WD:AN). However, if no user uses the second sync function, many good matches falling into cases 2M and 3M will never be added to WD, which is also a problem.

I conclude with some possible solutions:

Possible solution	Pros	Cons
1) when an external ID is removed from a WD item, the match is automatically removed from the MnM entry containing the removed ID	avoids case 1M, solving the main problem (best solution in my opinion)	if an external ID is removed from a WD item by a vandal (rare case), it will not be reimported from MnM through the second sync function
2) distinguish the second sync function of MnM in two subfunctions: one to import to WD matches made by the "Automatic name/date matcher" or the "Auxiliary data matcher", the other to import to WD other matches	allows to solve cases 2M and 3M (usually unproblematic) without touching 1M	the problem of case 1M remains unsolved

I look forward to hearing your opinions! --Epìdosis 11:06, 5 January 2022 (UTC)Reply

Nice in-depth explanation of Mix'n'match. I like the import feature a lot since I can do proper XML or wikitext parsing as well as washing using OpenRefine. With regards to situation 1M; Before the feature to import and update catalogs was added in February 2021, there were a number of catalogs that suffered from lack of «washing». One such catalog IIRC suffers from having some ids that are urlencoded. The fix up until now have been to remove the claim from Wikidata, but that method will not work anymore if such an action in the future will trigger an update in the catalog. I think automatic update of Mix'n'match catalogs is a good idea, so additionally I propose that going forward it would be possible for a catalog creator to update pretty much all aspects of their catalog, such as ids, names and even removing entries. Since some of these operations are inherently destructive, an undo system should be implemented that is able to undo all operations done to the catalog as well as any operations that triggered a bulk update to Wikidata. Some sane limits should be put in place as well to prevent abuse. --Infrastruktur (talk) 22:09, 5 January 2022 (UTC)Reply

Thanks for this explanation, Epìdosis. I like the first solution too. FogueraC (talk) 08:46, 17 January 2022 (UTC)Reply

Excellent offer. Reasoned and detailed. I would appreciate it if it was taken care of as soon as possible. See also one of the results from this situation - User talk:Bargioni#J9U. Geagea (talk) 09:47, 8 October 2023 (UTC)Reply

Procedure and criteria for requests of catalogues' deactivation

Latest comment: 1 year ago1 comment1 person in discussion

As of now, there is no clear procedure (and criteria) for establishing if a catalogue should or not be deactivated. Whilst the great majority of such requests, usually sent in User talk:Magnus Manske or more rarely in the talk pages of other Mix'n'match admins, can be managed in a non-controversial way (e.g. users asking to deactivate catalogues resulting from wrong imports, or not useful anymore, or just obsolete and needing to be newly uploaded etc.), some cases are more edgy and cannot be instantly solved (e.g. cases in which two catalogues partially overlap, or a catalogue has a dubious quality etc.). It would be ideal that the most complex requests could be discussed in a dedicated page (this talk page would be a fine place IMHO) and that some criteria could be established in order to make such decisions less arbitrary. Opinions are welcome, as always, primarily from @Magnus Manske: of course! --Epìdosis 20:13, 31 January 2023 (UTC)Reply

Data addition invoked by catalogues

Latest comment: 1 year ago1 comment1 person in discussion

Sometimes (last cases Topic:Xdo8it3hp3iu0ke3 and Property talk:P2372#A problem with PB IDs) it happens that Reinheitsgebot adds wrong data on the basis of auxiliary data which are present in catalogue entries (e.g. this); auxiliary data are either entered in the catalogue at the moment of its scrape/upload or added automatically through the job aux2wd (this job is automatically performed by Mix'n'match on the entries of new catalogues: it copies parts of textual descriptions into auxiliary data). I would like to list here some possible error causes, and related solutions, for future thoughts. Each solution has some issues that I will try to list; one general issue, which I will mention only here but is of the greatest relevance, is that after massive additions of wrong data due to Mix'n'match a massive removal of them is often difficult to enact (i.e. wrong main values can be easily removed through QuickStatements, but wrong references necessarily require a bot intervention).

Mistakes can happen in 4 aspects (1 wrong aspect is sufficient to insert a mistake in an item; in a few cases, more aspects can be present at the same time):

Aspect	Description of the problem	Proposed solution(s)	Present issues
1) source	the source might contain a wrong info	if the source is generally reliable and has a few mistakes, the best solution is surely deprecate the wrong statements with qualifier reason for deprecated rank (P2241)error in referenced source or sources (Q29998666) if the source is considered not enough reliable, probably the best solution is deactivating the catalogue and massively removing the wrong statements based on it	no clear procedure for deactivating catalogues representing low-quality sources (in general, no clear procedure for deactivating catalogues: see second thread above)
2) catalogue	the catalogue might reproduce wrongly what the source contains	deactivate the catalogue and massively remove the wrong statements based on it; afterwards, upload a new catalogue that correctly reproduces what the source contains
3) aux2wd	the aux2wd process might wrongly interpret what the entry description contains (e.g. inferring a wrong death date due to misunderstanding of the context, like here; or simply misunderstanding a date, like here)	massively remove auxiliary data from the catalogue entries (keeping the catalogue itself) and remove the wrong statements based on them	no easy way of massively removing auxiliary data from the catalogue entries (no available job allows doing it)
4) match	an entry might be matched to the wrong item, due to vandalism (rare), human mistake (frequent) or assumptions based on auxiliary data, i.e. the entry contains an ID and it is matched to Wikidata on the basis of this ID (so: either the ID is misplaced in the entry, or in the Wikidata item, or is a conflation)	remove from the item data based on the wrong match (both the ID and the statement(s) based on the auxiliary data) and remove the wrong match from the entry	users often don't perceive that the presence of wrong data is due to Mix'n'match and only remove ID and the statement(s) based on the auxiliary data, but not the wrong match from the entry; in this way, the problem will come back (in general, synchronisation is a problem: see first thread above)

Mistakes happening only due to match regard single items, while mistakes due to at least one of the other causes usually regard a high number of items.

Despite all this possible sources of mistakes, I think most data derived from auxiliary data are indeed correct and don't fall in any of these cases; but, since the mistakes exist and might be insidious, I display some possible improvements (some of them might not be easy to enact):

create a clear procedure for proposing the deactivation of catalogues, with a list of reasons that are sufficient for requesting deactivation (this would solve the second thread above)
create a job "purge auxiliary data", specular to "purge automatches"; of course, once "purge auxiliary data" is run, the next run of "aux2wd" must not readd the exact same auxiliary data that have been purged, otherwise the purge is just useless
in cases of massive additions of wrong data: create an easy way to massively revert such additions (the best possibility would be EditGroups, I guess)

I'm sure other improvements are also possible, along with these three proposals; feel free to add ideas. --Epìdosis 23:29, 3 March 2023 (UTC)Reply

Add topic