About this board

Previous discussion was archived at User talk:Magnus Manske/Archive 9 on 2015-08-10.

MargaretRDonald (talkcontribs)
Reply to "sourcemd"
Kolja21 (talkcontribs)

Hallo Magnus, ein gesperrter User meldet sich seit Wochen regelmäßig unter neuen Usernamen an und führt mit Mix'n'match tausende Edits aus, die fehlerhaft sind. Besserung ist nicht in Sicht, siehe User talk:Lizofon#Erroneous batch from Mix'n'Match. Kannst du die Nutzung von Mix'n'match für neu erstellte Accounts beschränken?

Kirilloparma (talkcontribs)

 Comment: This could be a block evasion by indefinitely blocked user Matlin. See this thread for more details. I'm also wondering if there is any way to restrict Matlin's access to Mix'n'Match and his new sockpuppet accounts. Regards

Magnus Manske (talkcontribs)

replied there

Solidest (talkcontribs)

I posted there that the code changes didn't seem to work, but didn't get a response. Here's another case where an account registered 4 days ago is doing stuff on MnM: https://mix-n-match.toolforge.org/#/rc/5708 (pretty sure it's another sockpuppet of him)

Reply to "Mix'n'match-Missbrauch"
Solidest (talkcontribs)
Magnus Manske (talkcontribs)

Yes, Toolforge is switching from compute grid to kubernetes, and guess who has to adapt dozens of job scripts? Work in progress...

Magnus Manske (talkcontribs)

Should be running now, at least some of them

Solidest (talkcontribs)

It looks like they have moved once and got stuck again overnight :(

Magnus Manske (talkcontribs)
Solidest (talkcontribs)

The jobs are stuck again, for almost a week now :(

Magnus Manske (talkcontribs)

I was travelling. Restarted the jobs now.

Solidest (talkcontribs)

Stuck again since 14 January.

Magnus Manske (talkcontribs)

kicked it

Solidest (talkcontribs)

Jobs are stuck again after the last glitch a few days ago. In 4 days "automatch by search" in RUNNING status has not progressed at all for some catalogs, when it used to take a few hours to fully complete and find 30k preliminarily matched, now it's showing 34 preliminarily matched for a few days now.

Magnus Manske (talkcontribs)

looking into it

Magnus Manske (talkcontribs)

I changed a few things, seems to work better now

Solidest (talkcontribs)

it's stuck again. the numbers haven't progressed anywhere among the "running" jobs in 8+ hours.

Solidest (talkcontribs)

After 48 hours I can confirm that the jobs are running extremely unstable. Almost all the time they don't progress (preliminary numbers are not changing), and at some point they break out for a while, after which they freeze again. It looks like they actually worked for about an hour or two in two days.

Magnus Manske (talkcontribs)

Deployed an update, let's see if that does the trick

Solidest (talkcontribs)

It seems it didn't work and the queue has stopped again: none of the RUNNING - automatch by search has made any progress in the last hour (by preliminarily matched numbers)

Solidest (talkcontribs)

Yeah, the situation hasn't changed. Since I wrote the previous post the numbers didn't change until yesterday morning when there was a jump in progress, and now in more than 24 hours the numbers have not changed again and the queue is not progressing.

Jonathan Groß (talkcontribs)

The problem seems to persist. Is there anything you can do?

Solidest (talkcontribs)

The queue progressed today for the first time since 31 July (but now seems to be stuck again). Meanwhile, autoscrape/update from file jobs have been stuck since 14 July - none of them have been completed since then.

Jonathan Groß (talkcontribs)

Something happened yesterday, the job queue moved, but now some jobs are invisible, e.g. my own new catalogue #6010. In the catalogue itself is still says "job from 2023-08-15: update from tabbed file", but it doesn't show up in the general job list. Instead, newer catalogues are given "High Priority" for webscraping.

Magnus Manske (talkcontribs)

Just back from holiday now, looking into it. Problem is, the jobs run well for hours/days then stop for no obvious reason. I may have to restart the jobs several times to test things.

Jonathan Groß (talkcontribs)

Καλώς ήρθας! :) Nice to see you back here, but even nicer to know you had yourself a holiday. Hope it was relaxing (even though I kept bugging you on various channels).

Solidest (talkcontribs)

I don't know how much this helps or if it's the cause, but I've noticed that when the queue hangs, the properties whose number of values can cause a timeout always end up in RUNNING. Right now NUKAT has the timeout error, but now there's also IMDb which also has a timeout problem due to 975k IDs on WD. Currently MnM can't open the IMDB manual sync page - https://mix-n-match.toolforge.org/#/sync/676 , because the WD query service almost always times out - MnM also always times out via the API response, while the MnM page is stuck on infinite loading. UPD: After I wrote this it looks like there was a reset and IMDB went from RUNNING to TODO.

Solidest (talkcontribs)

upd: never mind, the queue is now stuck with fairly small catalogues, both on mnm and wd

Solidest (talkcontribs)

The queue has just been cleared of everything except autoscrape jobs. And it looks like autoscraping doesn't work anymore - all such jobs just stuck in RUNNING status. #4712 and #4708 have a pretty simple ID grabbing from a single html or json file that used to complete in a few seconds, but now stuck idle.

Solidest (talkcontribs)

The queue got clogged again a couple of days ago with five non-passing autoscrapes. Is it possible to completely disable autoscrape jobs so that other jobs can go through until the autoscrape gets fixed?

Magnus Manske (talkcontribs)

I changed autoscrape task size one level up, and restarted. Should run others first now.

Solidest (talkcontribs)

Apparently autoscrape tasks share slots with "update from file" tasks? As they are still blocking them from going through.

Magnus Manske (talkcontribs)

No, but the priority assignment is a bit complicated. I have tuned it a bit, should be better now.

Solidest (talkcontribs)

Seems to be fixed. Thanks!

Matthias M. (talkcontribs)

Hi, can you please remove/stop the scraper from Mixnmatch:5951? It will likely fail again and only waste resources. I scraped on my own computer and imported manually.

Magnus Manske (talkcontribs)

Done

Avocadobabygirl (talkcontribs)
Epìdosis (talkcontribs)
Avocadobabygirl (talkcontribs)
Framawiki (talkcontribs)

May I intrude in the conversation to ask to ask deletion of the outdated 503 and 2059 please?

Reply to "Retiring superseded dataset"
Jonathan Groß (talkcontribs)

Hi Magnus, I've tried to create a catalogue for Mythoskop https://mix-n-match.toolforge.org/#/catalog/6010 My CSV file is fine, but the catalogue turns up empty. Is it possible that MnM "loses" submitted CSV files if the jobs get stuck for too long? I've waited for a fortnight the first time around, and I've resubmitted my file last night.

As always, your help is much appreciated. Φιλία, J.

Jonathan Groß (talkcontribs)

update: The catalogue is now online, but it has only 317 entries. There should be ca. 2000.

Magnus Manske (talkcontribs)

working on it...

Magnus Manske (talkcontribs)

There was a regression where only "autoq" and not "q" was allowed for pre-matched entities. FIxed now. I believe there might still be a few missing, but that might be an UTF8 encoding issue in the data file.

Jonathan Groß (talkcontribs)

Is UTF8 the desired format?

There are still a few missing, but I don't know which ones. There should be 2,080 entries, MnM says 1,926. Should I send you the CSV file for reference?

Magnus Manske (talkcontribs)

yes please

Jonathan Groß (talkcontribs)

Sent.

Jonathan Groß (talkcontribs)

Hi Magnus! I am not sure what to do next. Should I put the description in my CSV file in "" quotation marks and re-upload?

Magnus Manske (talkcontribs)

I think that might fix it. Untested, obviously.

Jonathan Groß (talkcontribs)

Thank you! It seems I have a lot to learn about CSV files. I will give it a try.

Jonathan Groß (talkcontribs)

Hi Magnus,

a week ago I had uploaded the CSV file again with "" around the strings. Yesterday the MnM job list finally fed the new tabbed file through. Unfortunately, the catalogue still has the same low number of entries (1926, so more than 100 short). With MANTO it's the same problem.

I don't understand what goes wrong there. Could you perhaps take another look at both files? I'll send them to you again via email.

Reply to "MnM 6010 not applying CSV data"
Summary by Jonathan Groß

Never mind, I just found the button ... sorry for the inconvenience :)

Jonathan Groß (talkcontribs)

Χαίρε Μάγνε!

Dein Bot läuft ja unermüdlich, aber einige Seiten liegen trotzdem schon lange brach (z.B. der sehr nützliche Rapport Wikidata:WikiProject Ancient Greece/reports/Set of mythological Greek characters). Gibt es irgendeine intendierte Regelmäßigkeit bei den Updates (z.B. alle drei Monate wird der Bericht aktualisiert)? Wenn ja, könntest Du das transparent machen?

Φιλία

Jonathan

Zenfiric (talkcontribs)
Solidest (talkcontribs)

MnM web scraper lookahead prohibition

2
99of9 (talkcontribs)

Hi Magnus. I've previously scraped lots of sets using lookahead like

<li>(([^<]|<(?!/li>))*)</li>

(or the equivalent for a table row) to get a Regex block for each item. I made a simple scraper at https://mix-n-match.toolforge.org/#/catalog/6013 but it failed, because a new requirement prohibits lookahead. The job log error reads:

regex parse error: <li>(([^<]|<(?!/li>))*)</li> ^^^ error: look-around, including look-ahead and look-behind, is not supported

But I can't figure out a good way of doing this without lookahead. Any suggestions? How important is prohibiting lookahead?

Magnus Manske (talkcontribs)

Hi, the reason for this is that I rewrote the background jobs in Rust, and the default Rust regex crate does not support lookahead. There is an alternative crate for regex, but it's not an instant replacement, and I didn't have time to change it over yet.

Reply to "MnM web scraper lookahead prohibition"
Kuldeepburjbhalaike (talkcontribs)

Hi, if you please check why this template doesn't work on pawiki?

This post was hidden by Solidest (history)
Reply to "Wikidata list template on pawiki"