About this board

Previous discussion was archived at User talk:Hjfocs/Archive 1 on 2018-07-25.

TherasTaneel (talkcontribs)

Hi, on 1 August 2019 the bot added the Discogs artist ID (P1953) 1943818 (described as a "Danish bassist, guitarist") on Q57409084, yet the correct one would have been 6046575 ("Danish trumpet player")

I corrected the error. Hopefully this message improves your bot further.

Reply to "Discogs artist ID"
176.198.184.171 (talkcontribs)

Hi, I just readded petosagan as Twitter handle for Q309911. Your bot deleted that in October 2019. Was that an error, or did Sagan not have a Twitter account at that time? In the first case, maybe you want to recheck that batch of deletions. Greetings, --~~~~

Hjfocs (talkcontribs)

Thank you anonymous user!

Reply to "Twitter user name for Peter Sagan"
Hiro (talkcontribs)

Hi Hjfocs, on 14 October, 2019, your bot made an error on the item ''New day'' (Q2372798). It added a link to the Discogs master-id 391172, which is about the album ''New day'' by the Belgian band Sweet Coffee (Q14422303). However, the correct id is 649443 which is about the album of the same name by the Belgian band Absynthe Minded (Q164298). It's easy to see how the error was made and I must say I am impressed by how close your bot came to the correct link.


I corrected the error. Perhaps this message helps you with improving the bot.

Hjfocs (talkcontribs)

@Hiro thanks a lot for your helpful work and your kind words, much appreciated! Cheers

Reply to "Bot made an error with Discogs"

Lots of error'd twitter handles

8
Summary by Hjfocs

[soweego version 1] in situ evaluation

BrokenSegue (talkcontribs)

Can you explain how your bot matched twitter handles to items? I'm seeing lots of very wrong matches. For example in Q45526049 it added a twitter handle to someone from ancient china. I've found hundreds of such examples.

BrokenSegue (talkcontribs)

huh, the bot has even added the same twitter handle to multiple different incorrect items e.g. Q45502815 and Q45607716 which should be impossible.

BrokenSegue (talkcontribs)
Hjfocs (talkcontribs)

Hi BrokenSegue (talkcontribslogs), thanks a lot for spotting and fixing those obvious errors: that's very valuable for the bot, as it can learn on the mistakes it makes. The bot uploads Twitter identifiers that are considered confident by the underlying system, SocialLink. More specifically, we tried to filter out non-living individuals in the process, but unfortunately death dates are not always available, which is probably the main reason behind the obvious errors in the Ming dynasty you pointed out. On the other hand, I agree that Alex Gough (Q2114948) looks like a more reasonable error. Before reverting the whole batch of edits, could you please share more detailed information on the errors? Do you have any references of the hundreds examples you found? That would be really really useful. Thanks again for your time. Cheers!

BrokenSegue (talkcontribs)

Ah, ok. Yeah so I may have miscounted the number of obvious errors. I've reverted at least a hundred through quickstatements though (e.g. this batch). There's also some really difficult cases like Q51166600 where we have a twitter username ("WeiYinChen16) that is used across 10 items. I'm not sure what kind of ML social link is using but it's clearly being too aggressive and I'm guessing it doesn't attempt any global optimizations (e.g. "well this twitter account matches these 5 items but it matches this one best").


There's also a ton of twitter account statements with the reference "stated in Twitter" which is very unhelpful. I see at one point you went back and swapped some out for a different reference but there's still a ton with the old incorrect one (optimally the reference would mention social link somehow). It would be helpful to know which entries were done using name matching / ML. Is there a reason they haven't all been changed?

Hjfocs (talkcontribs)

First, I'm really grateful that you reverted the ancient China batch.

Wei Chen (Q51166600) is probably wrong: as I can't understand Chinese, I can't judge the Twitter ID. But if we follow the Facebook link inside the Twitter one, the profile picture is clearly a baseball player, so Wei-Yin Chen (Q708040) should be the right one. I've removed the identifier from the other items.

On top of my head, this might be a corner case, where SocialLink's confidence score across the items is identical. Said that, while I agree we should avoid such 1-to-n links, Was a bee (talkcontribslogs) proposed an interesting alternative to keep them, see https://github.com/Wikidata/soweego/issues/374

In addition, there's an open ticket that aims at intercepting the constraint check reports, see https://github.com/Wikidata/soweego/issues/266.

Regarding the stated in (P248) reference node, it seems that the process in charge of converting into based on heuristic (P887) stopped unexpectedly. I'll have to investigate why.

BrokenSegue (talkcontribs)

Here is an example of an item that had a date of birth in the 5th century but was still assigned a twitter handle: Q3734999. Clearly something went wrong.

Hjfocs (talkcontribs)

Totally agree, of course. This is due to the lack of date of death. I acknowledge that we should apply a less trivial filter.

Reply to "Lots of error'd twitter handles"

Duplicate Twitter user names

8
Summary by Hjfocs

should be fixed

Tacsipacsi (talkcontribs)
Hjfocs (talkcontribs)

Hey @Tacsipacsi, thanks a lot for your feedback and apologies for the trouble, I didn't mean to interfere with your work at all! FYI, the edits you mention come from a QuickStatements batch upload, which seems to silently ignore potential duplicate statements differing only in case. I'm gonna fix this very soon.

Tacsipacsi (talkcontribs)

Thanks! I might have been a bit more upset than you deserve, because this is not the first such case (I mean overall, I don’t remember names from previous cases). It seems logical for me that QS doesn’t check for such semi-duplicates, as most things are case-sensitive, Twitter user names being an exception (and not the standard): external identifiers, image names, work titles and so on are usually case-sensitive.

Hjfocs (talkcontribs)

Soweego bot (talkcontribslogs) is on its way to delete the whole Twitter batch holding these errors. It will then put them back in a case-insensitive fashion.

Tacsipacsi (talkcontribs)

I hope it will finish soon, as sometimes the two runs managed to remove pre-existing Twitter user names, e.g. here. Next time probably it will be better to use EditGroups’s revert feature.

Hjfocs (talkcontribs)

I didn't know that tool, so I opted for a bot deletion. I've now stopped the bot. Thanks a lot for pointing that out!

Tacsipacsi (talkcontribs)

Sorry, but the current database state is the worst, as inconsistent as possible. Some items (like the above-cited) have all statements removed without replacement, while others (e.g. Q29315) still have doubled ones. As far as I see, now the only option is to finish the removal and than re-add the correct statements.

Hjfocs (talkcontribs)

New page for catalogues

6
Summary by Hjfocs

stale

Adam Harangozó (talkcontribs)

Hi, I created a new page for collecting sites that could be added to Mix'n'match and I plan to expand it with the ones that already have scrapers by category. Feel free to expand, use for property creation. Best, --Adam Harangozó (talk) 19:21, 26 October 2019 (UTC)

Hjfocs (talkcontribs)

Thanks for pointing out your work, looks useful. Just curious, how did you build the list of catalogs?

Adam Harangozó (talkcontribs)

Thanks! I've Googled a lot for online encyclopedias, also found a lot on digital humanities websites and in link collections under entries of bigger databases like Deutsche Biographie.

Hjfocs (talkcontribs)
Hjfocs (talkcontribs)

(or just let me know, and I can take care of that)

Adam Harangozó (talkcontribs)

Thank you, I'll check it!

Questions in Natural Language

2
Summary by Hjfocs

no further discussion

Hogü-456 (talkcontribs)

Hello Hjfocs,

I read your proposal and the information in the project chat. I read at your userpage that you work about Natural Language Processing. This is a interesting field of work. Do you think it is possible to use it for creating Lists out of Wikidata. I think it can be used for querying content of Wikidata. I tried something like that on my own about 1 year ago and it worked for some things very good. I think it is possible to query the most content of Wikidata in that way and the most queries can be described in mostly the same way in a natural language sentence. Are you interested in creating a project for that. I think it is possible to do that without Artificial Infelligence or Neuronal Networks. A example for how the queries could look like is the tool Wikistats Version 2. There you can enter defined questions and then you see the results. If this technology could be used in a modified way in Wikidata it were great. For example that it is possible to enter the question how many pages have been edited in Wikidata? and then you get the result. At this example the word Wikidata would be the position of the question who is variable can be changed into another Wiki for what the user entering the query is looking for. I hope you understand what I mean.

Hjfocs (talkcontribs)

Hi there and thanks for reaching out!

It sounds like you are talking about a specific research area, namely question answering over knowledge bases. I'm not an expert of that particular task, but you can start from en:Question_answering to grasp the concepts. You can also have a look at Christina's, Axel's and Philipp's work for a deeper dive.

Hope this helps!

P.S.: if you like m:Grants:Project/Hjfocs/soweego_2, please consider an endorsement!

Primary source tool doesn't work

4
Summary by Hjfocs

stale, the user who reached out never sent an e-mail, despite my request

JLuzc (talkcontribs)

Hello!,

I would like to collaborate with some information for Wikidata. This information consists on a list of triples like <subject, predicate, object> with existing entities, and I would like to use the Primary source tool, however it is not working. Will it be fixed soon?, or Is there another option?.

Thanks for your help!

@ChristianKl

Hjfocs (talkcontribs)

Hi there, thanks for reaching out. Here's some background on the tool:

  • it is currently broken because its backend is broken. This is version 1, developed by Sebastian. You should contact him to get more insights;
  • I developed version 2, with a totally rewritten backend and a frontend as a MediaWiki extension;
  • the MediaWiki extension went into a never-ending review process, for which extra resources should be allocated. Right now I don't have resources to address the review.

The result is:

  • the deployed tool is still the old version;
  • the old backend is the bottleneck;
  • resources are needed to address the MediaWiki extension review.

The fastest fix would be to move the MediaWiki extension JavaScript modules to the frontend version 1 code.

If you are interested, let's talk in more depth. Cheers!

JLuzc (talkcontribs)

Thanks @Hjfocs!!. How can I help with this? "move the MediaWiki extension JavaScript modules to the frontend version 1 code".

Hjfocs (talkcontribs)

Please send me an e-mail and we can schedule a call.

Summary by Hjfocs

Cope with conversion of identifiers into full URLs, Bandcamp case

Meisam (talkcontribs)
Hjfocs (talkcontribs)
Hjfocs (talkcontribs)
Meisam (talkcontribs)

Thanks mate! Cheers 🍺

Duplicate ID added by Soweego-bot

4
Summary by Hjfocs

Twitter IDs are case-sensitive, so an extra check was implemented

Jean-Frédéric (talkcontribs)

Hi,

In Special:Diff/782722068, Soweego-bot added a duplicate Facebook ID, as Facebook IDs are case-insensitive.

Hope that helps!

Hjfocs (talkcontribs)

Hello Jean-Frédéric (talkcontribslogs), thanks for your feedback, really valuable. The bot currently checks exact claim values: are you aware of other IDs that behave like Facebook? Cheers!

Hjfocs (talkcontribs)
Meisam (talkcontribs)