Open main menu

User talk:Magnus Manske

About this board

Previous discussion was archived at User talk:Magnus Manske/Archive 9 on 2015-08-10.

Jheald (talkcontribs)

Hi Magnus,

Would it be possible to update MnM catalogue 2822 ("Gazetteer for Scotland person") -- the original scrape didn't go to quite high enough ID numbers, so was missing a couple of hundred entries from the end.

I've put up an MnM upload-format file at https://tools.wmflabs.org/paste/view/003b4e1d also now including the first sentence of the biography for each entry.

Thanks!

Magnus Manske (talkcontribs)

Done.

Jheald (talkcontribs)

Is there a reason it didn't import the biographical summaries that were in the file, apart from for newly-added entries? These would be helpful, particularly for the unmatched entries.

Magnus Manske (talkcontribs)

Example entry where it didn't import the description? Many seem to consist of "only" the dates.

Magnus Manske (talkcontribs)

OH I see, description "updates"

Reply to "Update for MnM 2822"
Jheald (talkcontribs)

Hi Magnus,

I've been having trouble with "manual sync" for catalogue 2839 ("Art UK sculptors (people)"). There should be over a thousand of these already matched on Wikidata (as well as 20,000+ items with P1367 that aren't in this catalogue), but on each of the three times (twice last week, once just now) that I've tried "manual sync" it's only added a little over 300 of the matches to MnM.

I have just run it a fourth time, to pull in the last lot of matches, but this is probably something that ought to be looked into.

Jheald (talkcontribs)

I don't know if it's related, but the automatic matcher also seems to have missed a lot of targets in this set -- items with identical names, very similar dates, but not suggested.

Magnus Manske (talkcontribs)

Syncing/matching now. The large number of Wikidata entries might trip some scripts (SPARQL results etc.)

Deactivation of out-of-date mix-n-match catalogues

3
Samwilson (talkcontribs)

Hi,

Could you please deactivate mix-n-match catalogues 1667 (25,514 items) and 2845 (1,921 items)? They have been replaced by catalogue 2846 (25,796 items, and corrected metadata alignment).

Thanks!

99of9 (talkcontribs)

I created one of those, and I agree to it's deactivation.

Magnus Manske (talkcontribs)

Done.

Evolution and evolvability (talkcontribs)

Hi Magnus,

The SourceMD batch queue currently has unprocessed items from last month. Would it be possible to reserve a smaller partition/queue for small batches (1-20 items), like a supermarket isle? It could help for introducing new users to the tools so that they can see their batch being processed.

Magnus Manske (talkcontribs)

SourceMD bot is currently not running, until this is fixed. But I don't have much time to work on it.

Evolution and evolvability (talkcontribs)

Ah, thank you! Sorry for pestering. Some day I'll hopefully have the technical skills to help out a bit more on such things.

Reply to "SourceMD queue backlog."
Pierrette13 (talkcontribs)
Jheald (talkcontribs)

Looking at the edit summary, it appears to have come from VIAF. Looking at the VIAF entry, it appears to have come from the BNF and/or the ISNI entry, which give the date of birth as "19..". However, this is coded in MARC field 997 as 1950 0. On a very quick look, I couldn't see a spec for the field. It may be that that final 0 indicates that the value is approximate; or there may be no way to know that it is not a detailed year-specific date.

One thing that may stop the bot re-adding it would be to leave the statement on the item, but change its rank to "deprecated", perhaps with the additional qualifier "reason for deprecation" = "incorrect value in source". If it is already in the item in this way, that may stop the bot adding it.

Pierrette13 (talkcontribs)

Hello Magnus, thank you for your explanation, I come back to you if the birthdate comes back, best regards

Reply to "Birthdate"
GZWDer (talkcontribs)

Can the URL of each entry be replaced with http://data.star.sports.cn/person_en.php?id=<external ID> ? The current one is wrong.

Reply to "Catalog 2843"
Gerwoman (talkcontribs)

How have you done this?

Reply to "catalog 2838"

New petscan has picked up a case sensitive for lead character

3
Billinghurst (talkcontribs)

With the change over to the new petscan, numbers of my searches are now failing due to case sensitivity based on the first letter of a template name. An example

petscan:796421 has Has none of these templates: index transcluded

and this now fails to filter. Are we able to have a modification back so I don't need to regenerate all my maintenance searches. Thanks.

Magnus Manske (talkcontribs)
Magnus Manske (talkcontribs)

FYI, this is fixed now.

Bad QuickStatements edits with no attribution

23
Bovlb (talkcontribs)

This edit conflated two people, not only adding a second ORCID to an item that already had one, but one which was already on a distinct item. I would complain to the person who issued the QuickStatements batch, but I don't know how to determine who that was. Cheers!

Bovlb (talkcontribs)
Bovlb (talkcontribs)

Q64867660 is another case where an item was created with an ORCID already on an existing item, where the author and intention of the QuickStatements batch is not apparent.

Bovlb (talkcontribs)

Q64860817 and Q64860820 are a case of items created with the same ORCID within a few seconds of each other, again with no way to track the author.

Bovlb (talkcontribs)

I have consulted the RFP for this bot, and it clearly states that it will have "both batch and submitting user indicated and linked in the edit comment". Recent edits appear to violate this. I note that I brought this to your attention three months ago. I would appreciate some feedback from you, as the creation of duplicate items is disruptive, and this lack of attribution prevents an effective response.

Magnus Manske (talkcontribs)

I have deactivated SourceMD for now, until I can figure out what's wrong.

GerardM (talkcontribs)

What I have noticed fixing authors is that the number of these instances have largely gone away. Particularly after the latest update of the software.. You may argue that it is disruptive when duplicate records are created. The loss of this service is much more disruptive.. Yes, I do merge duplicate records..

Thanks,

Bovlb (talkcontribs)

GerardM: Please feel free to work on the backlog of ORCID duplicates in the meantime. https://w.wiki/5ZJ

GerardM (talkcontribs)

There are regular reports on the ORCID duplicates.. check them and you will see my handiwork.. I do not do query

Bovlb (talkcontribs)

I'm still seeing cases within the last month of bad QuickStatements changes that have no useful edit summary, and hence no way to track down who is making the errors. I am surprised that this could possibly be intended or permitted. Looking at the most recent bot changes from today, I see many entries with the summary #quickstatements; invoked by SourceMD:ORCIDator. Referring to Wikidata:Requests_for_permissions/Bot/QuickStatementsBot, I see an undertaking that both batch and submitting user [are] indicated and linked in the edit comment, which suggests that this bot is in violation of its request for permissions. This appears to be a long-standing problem that has been brought to your attention repeatedly. Can we please put an end to this?

GerardM (talkcontribs)

Hoi, so what is the problem? It is the most important tool available to link people and papers in Wikidata based on information from ORCID. Its use is necessary because it provides primary data leading to Scholia presentation. They tend to be good. When you have a statement where you indicate that it is problematic, the question is why did this come up in the first place where is the data in error for this to happen. Given the quality of our runtime environment, it is easy to notice that edits get refused for spurious time outs and consequently it is not a given that processed data ends up as intended in the database. For me it is not obvious that you are not throwing out the baby with the washing water. Thanks, but no thanks

Bovlb (talkcontribs)

@GerardM: Specifically, the problem is that bad edits like the diff linked above cannot be traced back to find who is making the bad edits and why. This means that we lack an essential tool for improving our process. It is far better to educate editors (and fix tools) so that we introduce fewer errors in the future, rather than merely find ways to detect and fix errors after the fact. I have found (and fixed) many such cases of incorrect assignment of Scopus Author ID (P1153), but I have been unable to find out who is making these errors, how, and why.

More generally, it is not good practice for bots to fail to do the things that were promised in their request for permissions. Circumstances change, of course, and we don't need to be over-pedantic on this technical point, but in this case the promised behaviour is clearly highly desirable, and it is a mystery to me how we apparently came to drop it.

GerardM (talkcontribs)

Hoi surprise, data can be dirty and we are talking about data not at our end but at the end of Scopus (an organisation that does not care about us) OCLC, VIAF including all the library authorities of this world and ORCID. So it is not bots fail, it is the data fails us.

Now here is something to consider, how can we be the place where authorities come together if we do not take the data warts and all. The desired behaviour is a pipe dream when at the same time you want to accomplish data that is meaningful, worthwhile. Given that you are professional at data (as per your user page) you should understand this well.

As to data cleaning, I merge quite a number of items. For me the key thing is that with more data merged, chances of keeping the data clean improve. The interoperability of data improves.

The notion that we should stay away from datasources is absolutely painful. We have lost years in not accepting data that is/was no beter than the data we have/had. For me the this notion that we can build Wikidata and keep it clean is false.


Bovlb (talkcontribs)
The notion that we should stay away from datasources is absolutely painful.

Could you please explain how your response is related to the issue I am raising?

GerardM (talkcontribs)

SourceMD takes info from sources like ORCID and assumes that the data is fine. Typically it is. The notion that SourceMD is banned because of errors elsewhere is for me absolutely painful. It is a tradition, we have scorned the data from Freebase and many others. The arguments are based on single item quality not quality of sets and subsets.

Bovlb (talkcontribs)

Thanks for responding. I am still confused about how your point relates to this issue. I am not trying to ban SourceMD.

What I am seeking is that, when QuickStatementsBot acts on behalf of a user, that user is identified in the edit summary. Not only does this seem like a reasonable request, but it also appears to be promised in the bot's RFP. Edits should be assignable. Either the bot author is taking on responsibility for these edits, or they should be attributed to another editor. So far as I can tell from the RFP, QuickStatementsBot falls into the latter category.

GerardM (talkcontribs)

I am not the author. I use this tool for my purposes it is vitally important. While your arguments are reasonable to an author, they would force the users of the tool to abandon the tool. THAT is not reasonable.

Bovlb (talkcontribs)

If I understand what you're saying correctly, you are taking the position that if the SourceMD tool were to record the user responsible for each change, then you would have to stop using it. In other words, you can only use the SourceMD tool on condition of anonymity.

This is a startling claim. Could you explain your reasons?

GerardM (talkcontribs)

No, I am more than happy for the tool to register me as the user. What I am not happy to do is refrain from using the tool.

Bovlb (talkcontribs)
Bovlb (talkcontribs)
Bovlb (talkcontribs)
Bovlb (talkcontribs)
Reply to "Bad QuickStatements edits with no attribution"

Quickstatementsbot putting in bad aliases

4
ArthurPSmith (talkcontribs)

I've noted this on the User talk:Reinheitsgebot page, but Quickstatements bot seems to be doing the same thing, and more recently - example: this edit somehow mixed up "Oscar H Franco" with Franco Giulianini???

Magnus Manske (talkcontribs)

I believe that code is no longer active.

ArthurPSmith (talkcontribs)

Do you know when it was changed? This edit by Quickstatementsbot is from September 11...

ArthurPSmith (talkcontribs)
Reply to "Quickstatementsbot putting in bad aliases"
Return to the user page of "Magnus Manske".