Soweego adding wrong IMDb IDs
Thank you for your action and the notification. It looks like the ID you mentioned comes from a wrong URL in a correct MusicBrainz (Q14005) record: https://musicbrainz.org/artist/2810d448-388e-4205-b64c-071416d4d518. We're in contact with the MusicBrainz team to find a strategy for URLs maintenance.
Issues with Last.fm extractor
Another note: now as "%" is allowed in extraction pattern, could you automatically convert all %20 (" ") to "%2B" ("+") in Last.fm ids (and only for Last.fm ids)? Both spaces and plus encoding works for last.fm (even double encoding monstrosity like Kevin%2520Macleod works), but pluses are canonical there.
I'm very grateful for your regular feedback, that's really precious. Here are the actions taken:
- all the bad IDs resulting from the SPARQL query you pointed out are now deleted;
- all percent-encoded IDs are now replaced with decoded ones;
- all Last.fm ID (P3192) IDs now have pluses instead of whitespaces.
Hi! special:diff/1505768662 removed a p3192, which seems to be correct as i checked the website.
Thanks for the heads up, much appreciated. The bot deleted that identifier because it contained a percent-encoded string, as raised here. The next step is to add those identifiers back properly decoded.
I can confirm the identifier statement is now correct, see Q7766064#P3192. Cheers!
Issue with WorldCat IDs
I removed all the 320 statements now, but you should fix your bot so it doesn't add more of these. Have a great day!
Many thanks for your valuable action! The bad values you spotted come from an extraction of identifiers from MusicBrainz (Q14005) URLs, which went wrong.
I double-checked the datasets pending upload, and no statements seem to contain the flawed value you raised, so everything should be fine.
Soweego bot adding invalid fandom pages
Bot seems to be adding links to Fandom article ID (P6262) "lyrics". which is invalid. Can some extra filtering be done to prevent this from happening in the future?
Example of this happening: https://www.wikidata.org/w/index.php?title=Q344822&diff=1485652992&oldid=1479804413
Many thanks for reporting this: I've just stopped the bot, will look into those bad IDs, and will delete those that were uploaded.
I got rid of additional bad IDs, so they shouldn't be added anymore. The bot has now restarted. While deleting the uploaded ones, I noticed you have already taken care, thanks again for your action! One question: what's your solution to do so? I see you used QuickStatements, and was wondering how. Cheers!
I used a SPARQL query to find all items that linked to the bad fandom page and then a one line bash script to convert that to a quickstatement file that removes them.
The thing I don't get about this error by your bot is how it parsed out just "lyrics" as a fandom article link. That isn't the correct format for fandom articles but it is the correct format for Fandom wiki ID (P4073).
I like your workflow!
The bot tries its best to parse URLs into proper Wikidata IDs, but it looks like there's a jungle of URL variations, bad regexps matches, multiple matching groups, and the like, so it's not perfect. In this specific case:
- a couple of hundred fandom URLs available in MusicBrainz (the total has order of magnitude 10^5) matched the second regexp in Property:P6262#P8966;
- the regexp has 2 matching groups;
- the bot didn't consider URL match replacement value (P8967) qualifier stated in the regexp;
- it took the first matching group as the ID value.
ok sounds like you should add support for URL match replacement value (P8967). it wouldn't have fixed this case but in general it's important.
I have also a general request: could you validate extracted identifiers with ? It would solve the problem. Also I just know that MusicBrainz has no validators there, so it would protect Wikidata against ill-formed (accidentally or intentionally) data. --Lockal (talk) 08:02, 21 September 2021 (UTC)
Similar problem with NicoNicoPedia ID (P6900) in - extractor should remove URL encoding (same thing applies to Fandom.com) and, I suppose, any other identifier.
Thanks a lot for reporting these issues, really appreciated. I'll file them in my tracker, and take the due actions soon.
Hello! Please do not add Treccani ID (P3365) when you can use more specific properties such as Biographical Dictionary of Italian People ID (P1986). See also https://w.wiki/3$y7. Thanks,
Thank you for the heads up and for fixing that statement. I've just checked the dataset that is being uploaded by the bot, and can confirm there will be only one statement like this: Biographical Dictionary of Italian People ID (P1986). Cheers!. I've manually updated it to use
Soweego bot adding invalid GND ids
VIAF contains GND entries, and the IDs listed there are usually the GND IDs themselves. However, some GND IDs contain hyphens, and those are replaced by numeric IDs usually starting with a 0 or a 9.
Those are not GND IDs. Your bot is importing those invalid ones from MusicBrainz. (e.g. see Special:Diff/1483617303) They do not match the regular expression. The easiest approach would be to ignore them and maybe list them somewhere to fix them here and in MusicBrainz.
Another more advanced and complex approach would be to check for a VIAF processed entry and extract the correct ID, that is in there.
Thanks again for your precious help on these troublesome IDs: I have stopped the bot, will fix the bad IDs, and will prevent their addition in future runs. I'll stick with the easy approach you propose, since I believe it would also be very fruitful in terms of feedback loops with the MusicBrainz maintainers. By the way, this is one of the main goals of the soweego project, see m:Grants:Project/Hjfocs/soweego_2#Goals. Cheers!
I have found another malformed identifier: the LoC authority file web access uses the auth URI with an
.html suffix, that shouldn't be part of the ID. see Special:Diff/1483536843
I checked the bot datasets and found 2 such malformed IDs in total. It's great that you have already fixed them, thank you so much for your work! I deleted all the problematic GND IDs, too. Cheers!
Discogs artist ID
[soweego 1] feedback on the linker
Hi, on 1 August 2019 the bot added the Discogs artist ID (P1953) 1943818 (described as a "Danish bassist, guitarist") on Q57409084, yet the correct one would have been 6046575 ("Danish trumpet player")
I corrected the error. Hopefully this message improves your bot further.
@TherasTaneel: thanks for your action, that really helps.
Twitter user name for Peter Sagan
[soweego 1] Twitter
Hi, I just readded petosagan as Twitter handle for Q309911. Your bot deleted that in October 2019. Was that an error, or did Sagan not have a Twitter account at that time? In the first case, maybe you want to recheck that batch of deletions. Greetings, --~~~~
Thank you anonymous user!
If you add them (Soweego_bot), would you include the numeric id? See Wikidata:Bot_requests#Add_numeric_id_to_Twitter_username_(P2002)
I think discussions should be onwiki .. I can put it on Wikidata:Project chat if you prefer. BTW, it seems that you are resolving the numeric ids to names .. so it should be simple to include them.
Yes, it's probably not a big deal, altough it may require some work. That's why I asked you to file a ticket on the code repository. Not a problem anyway, I'll look into that together with the stated in (P248) thing.
I think one could add them easily with a Quickstatements batch afterwards ..