About this board

Previous discussion was archived at User talk:Hjfocs/Archive 1 on 2018-07-25.

Soweego adding wrong IMDb IDs

2
Summary by Hjfocs

[soweego 2] MusicBrainz (Q14005) URLs validation: bad URL in one correct record, fixed by the reporter (thanks!)

Capmo (talkcontribs)
Hjfocs (talkcontribs)

Issues with Last.fm extractor

3
Summary by Hjfocs

[soweego 2] MusicBrainz (Q14005) URLs validation: delete percent-encoded IDs & put back decoded ones; use pluses instead of whitespaces for Last.fm ID (P3192) values

Lockal (talkcontribs)

Hi, could you reevaluate recently imported LastFM ids, please? It was broken and now we have 34 Johns.

Lockal (talkcontribs)

Another note: now as "%" is allowed in extraction pattern, could you automatically convert all %20 (" ") to "%2B" ("+") in Last.fm ids (and only for Last.fm ids)? Both spaces and plus encoding works for last.fm (even double encoding monstrosity like Kevin%2520Macleod works), but pluses are canonical there.

Hjfocs (talkcontribs)

I'm very grateful for your regular feedback, that's really precious. Here are the actions taken:

  • all the bad IDs resulting from the SPARQL query you pointed out are now deleted;
  • all percent-encoded IDs are now replaced with decoded ones;
  • all Last.fm ID (P3192) IDs now have pluses instead of whitespaces.
Summary by Hjfocs

[soweego 2] MusicBrainz (Q14005) URLs validation: use pluses instead of whitespaces for Last.fm ID (P3192) values, see feedback

RZuo (talkcontribs)
Hjfocs (talkcontribs)

Thanks for the heads up, much appreciated. The bot deleted that identifier because it contained a percent-encoded string, as raised here. The next step is to add those identifiers back properly decoded.

Hjfocs (talkcontribs)

I can confirm the identifier statement is now correct, see Q7766064#P3192. Cheers!

Issue with WorldCat IDs

4
Summary by Hjfocs

[soweego 2] MusicBrainz (Q14005) URLs validation: bad ID extraction, fixed by the reporter (thanks!)

Belteshassar (talkcontribs)

Hey! What is this supposed to be? I currently get 320 similar values with this search.

Belteshassar (talkcontribs)

I removed all the 320 statements now, but you should fix your bot so it doesn't add more of these. Have a great day!

Hjfocs (talkcontribs)

Many thanks for your valuable action! The bad values you spotted come from an extraction of identifiers from MusicBrainz (Q14005) URLs, which went wrong.

Hjfocs (talkcontribs)

I double-checked the datasets pending upload, and no statements seem to contain the flawed value you raised, so everything should be fine.

Soweego bot adding invalid fandom pages

10
Summary by Hjfocs

[soweego 2] MusicBrainz (Q14005) URLs validation: bad ID extraction & percent-encoded URLs

BrokenSegue (talkcontribs)

Bot seems to be adding links to Fandom article ID (P6262) "lyrics". which is invalid. Can some extra filtering be done to prevent this from happening in the future?

BrokenSegue (talkcontribs)
Hjfocs (talkcontribs)

Many thanks for reporting this: I've just stopped the bot, will look into those bad IDs, and will delete those that were uploaded.

Hjfocs (talkcontribs)

I got rid of additional bad IDs, so they shouldn't be added anymore. The bot has now restarted. While deleting the uploaded ones, I noticed you have already taken care, thanks again for your action! One question: what's your solution to do so? I see you used QuickStatements, and was wondering how. Cheers!

BrokenSegue (talkcontribs)

I used a SPARQL query to find all items that linked to the bad fandom page and then a one line bash script to convert that to a quickstatement file that removes them.

The thing I don't get about this error by your bot is how it parsed out just "lyrics" as a fandom article link. That isn't the correct format for fandom articles but it is the correct format for Fandom wiki ID (P4073).

Hjfocs (talkcontribs)

I like your workflow!

The bot tries its best to parse URLs into proper Wikidata IDs, but it looks like there's a jungle of URL variations, bad regexps matches, multiple matching groups, and the like, so it's not perfect. In this specific case:

  • a couple of hundred fandom URLs available in MusicBrainz (the total has order of magnitude 10^5) matched the second regexp in Property:P6262#P8966;
  • the regexp has 2 matching groups;
  • the bot didn't consider URL match replacement value (P8967) qualifier stated in the regexp;
  • it took the first matching group as the ID value.
BrokenSegue (talkcontribs)
Lockal (talkcontribs)

External id builder for Fandom is still broken.

I have also a general request: could you validate extracted identifiers with ? It would solve the problem. Also I just know that MusicBrainz has no validators there, so it would protect Wikidata against ill-formed (accidentally or intentionally) data. --Lockal (talk) 08:02, 21 September 2021 (UTC)

Lockal (talkcontribs)

Similar problem with NicoNicoPedia ID (P6900) in - extractor should remove URL encoding (same thing applies to Fandom.com) and, I suppose, any other identifier.

Hjfocs (talkcontribs)

Thanks a lot for reporting these issues, really appreciated. I'll file them in my tracker, and take the due actions soon.

Summary by Hjfocs

[soweego 2] MusicBrainz (Q14005) URLs validation: extraction of a more coarse-grained ID (only happened in 2 statements)

Horcrux (talkcontribs)
Hjfocs (talkcontribs)

Soweego bot adding invalid GND ids

4
Summary by Hjfocs

[soweego 2] MusicBrainz (Q14005) URLs validation: bad ID extraction

CamelCaseNick (talkcontribs)

VIAF contains GND entries, and the IDs listed there are usually the GND IDs themselves. However, some GND IDs contain hyphens, and those are replaced by numeric IDs usually starting with a 0 or a 9.

Those are not GND IDs. Your bot is importing those invalid ones from MusicBrainz. (e.g. see Special:Diff/1483617303) They do not match the regular expression. The easiest approach would be to ignore them and maybe list them somewhere to fix them here and in MusicBrainz.

Another more advanced and complex approach would be to check for a VIAF processed entry and extract the correct ID, that is in there.

Hjfocs (talkcontribs)

Thanks again for your precious help on these troublesome IDs: I have stopped the bot, will fix the bad IDs, and will prevent their addition in future runs. I'll stick with the easy approach you propose, since I believe it would also be very fruitful in terms of feedback loops with the MusicBrainz maintainers. By the way, this is one of the main goals of the soweego project, see m:Grants:Project/Hjfocs/soweego_2#Goals. Cheers!

CamelCaseNick (talkcontribs)

I have found another malformed identifier: the LoC authority file web access uses the auth URI with an .html suffix, that shouldn't be part of the ID. see Special:Diff/1483536843

Hjfocs (talkcontribs)

I checked the bot datasets and found 2 such malformed IDs in total. It's great that you have already fixed them, thank you so much for your work! I deleted all the problematic GND IDs, too. Cheers!

Summary by Hjfocs

[soweego 1] feedback on the linker

TherasTaneel (talkcontribs)

Hi, on 1 August 2019 the bot added the Discogs artist ID (P1953) 1943818 (described as a "Danish bassist, guitarist") on Q57409084, yet the correct one would have been 6046575 ("Danish trumpet player")

I corrected the error. Hopefully this message improves your bot further.

Hjfocs (talkcontribs)

Twitter user name for Peter Sagan

2
Summary by Hjfocs

[soweego 1] Twitter

176.198.184.171 (talkcontribs)

Hi, I just readded petosagan as Twitter handle for Q309911. Your bot deleted that in October 2019. Was that an error, or did Sagan not have a Twitter account at that time? In the first case, maybe you want to recheck that batch of deletions. Greetings, --~~~~

Hjfocs (talkcontribs)

Thank you anonymous user!

Jura1 (talkcontribs)
Hjfocs (talkcontribs)
Jura1 (talkcontribs)

I think discussions should be onwiki .. I can put it on Wikidata:Project chat if you prefer. BTW, it seems that you are resolving the numeric ids to names .. so it should be simple to include them.

Hjfocs (talkcontribs)

Yes, it's probably not a big deal, altough it may require some work. That's why I asked you to file a ticket on the code repository. Not a problem anyway, I'll look into that together with the stated in (P248) thing.

Jura1 (talkcontribs)

I think one could add them easily with a Quickstatements batch afterwards ..

Hjfocs (talkcontribs)

Sounds good.