Hi! I remember you run a very efficient bot and in the past I asked you some fixes which were very efficient. Now I mostly do fixes through QuickStatements, which is a very good tool, but isn't still able to fix references leaving the statements unchanged. I sometimes notice big groups of items (thousands and tens of thousands) having references which are imprecise or wrong and I don't know who to ask for correction. Could I slowly report you some notable cases of references to be fixed, so that we can slowly deal with them through your bot? I think it is crucial for our data quality having references which are exactly correct, whilst at the moment this fact often doesn't happen. Thank you very much in advance!
Topic on User talk:Ladsgroup/Flow
Hey sure. I try to write something but I want to know the exact framework so I don't need write similar code every time, so I would write something general and use that every time.
Can you give me a couple of examples?
OK, great! So, here is a detailed panoramic of the situation. I see three main types of errors to be corrected:
- first type: correct previous imprecise bot-edits
- one known case: all references containing stated in (P248) property (Q1400881) + Internetowy Polski Słownik Biograficzny ID (P8130) ID (you can easily infer a complete list from here) should have the stated in (P248) corrected in Internetowy Polski Słownik Biograficzny (Q96022943) (example); these wrong references have been recently added by @Reinheitsgebot:, to which I reported the problem without obtaining a correction
- second type: properties which changed their format, so that references are now broken
- case 1: thousands of references containing HDS ID (P902) still have IDs with 4 or 5 digits (complete list): now the IDs require always 6 digits, so 1 zero should be added before the 5 digits (example), 2 zeros before the 4 digits (example); if absent, stated in (P248) Historical Dictionary of Switzerland (Q642074) should be added before HDS ID (P902) (example)
- case 2: hundreds of references containing InPhO ID (P863) still have IDs containing only a number (complete list): now the IDs require the prefix "thinker/", which should always be added before the number (example)
- case 3: hundreds of references containing Spanish Biographical Dictionary ID (P4459) still have IDs containing only a number (complete list): now the IDs require also the following part, with "/" and the name, which should always be added after the number; if, as nearly always happen, the ID in the reference numerically coincides with the main value of P4459, the main value of P4459 should be used to complete the value of P4459 used in the reference (example)
- third type: properties which have been added twice as references to the same statement, with small differences or exactly equal; the two references should be merged keeping all the properties except reference URL (P854) (possibly obsolete and anyway not stable): stated in (P248), ID, named as (P1810) if present, most recent retrieved (P813) if present (here the range of properties involved is huge, I will give only some examples - more to follow in the next weeks) - introductive example 1, introductive example 2
- note: the following queries regard only date of birth (P569), but should be repeated at least for date of death (P570) and possibly for all properties
- Bibliothèque nationale de France ID (P268): first list (example of merged references, another one) and second listmay timeout, if necessary use LIMIT (example of coincident references, the oldest is removed, same here)
- Artsy artist ID (P2042): listcontains false positives (example of coincident references, the one having P813 first is removed)
- Artnet artist ID (P3782): listcontains false positives (example of merge of three references)
- GND ID (P227): list (example: a reference with P248 but without ID is markedly imprecise, so in presence of another reference having both P248 and ID should simply be removed)
For whichever question, ask me! When you have the bot ready, please start with some test-edits, so that I can have a look. Thank you very much in advance!
Thanks. I try to tackle it next weekend. This weekend I'm drowning in something personal.
Hi! Any updates? Obviously no urgence, as I said - just a little message in order not to forget myself the issue :)
Hey, sorry. I have been doing a million things and have been drowning in work but will get to it ASAP. I took some vacation for volunteer work :)
But it's on my radar, always has been. Don't worry.
Again. I have not forgotten about this. One day I will get it done. It's just there are so many things to do :(
Okay, one part is done: The bot now takes a SPARQL query and removes references that are exact duplicates. here's an example. I will write more in next weekends.
Very good, thanks!
And the second type Let me know if we want to clean up more. First type is very similar to the second one. So consider that done as well. Let's do this then.
Very good fixes for HDS ID (P902), great work! Could you link me also examples for InPhO ID (P863) and Spanish Biographical Dictionary ID (P4459)? After these two, second type is surely OK.
I'm doing them one by one because there's so many of them and for example the P902 took a day to finish. The P863 is underway
Ok P863!
A little case related to third type: Benezit ID (P2843) that had been inserted as reference in two different ways, the older one with reference URL (P854) and the more recent one with Benezit ID (P2843).
Very good P4459!
Done now, Gosh it took days :))) Let me fix type one now.
Can you give me a SPARQL query for the first type? I'm not good at queries involving refs :(
Use https://w.wiki/iNn, it contains both cases of date of birth (P569) and of date of death (P570).
Very good. Waiting for part 3, which is obviously the most difficult, I have another task: all uses of described by source (P1343) in references (these thousands) should be substituted with stated in (P248), in order to avoid scope-constraint violations.
The third type is not that hard. I thought it's done. Let me double check and clean the mess.
Re-reading what you wrote for the third type a couple of times and now I get what you want but it's pretty complex. I'll try to see what I can do about it next weekend.
Hi! When you have time, could you have a look at these three?
- Wikidata:Bot requests#Accademia delle Scienze di Torino multiple references
- Wikidata:Bot requests#Archivio Storico Ricordi multiple references
- Wikidata:Bot requests#Library catalogs (2021-01-28)
They are probably less difficult than point 3 above, which I understand is quite difficult. See you soon!
Hey, Sure. Just give me a week or two.
Wrote something that can cleanup duplicates and subsets (e.g. if the reference is fully covered in another reference and more). I already started the bot and it's cleaning. Will continue but I don't think I can clean up more than that as it gets really really complicated.
Perfect! When it finishes, could you schedule it as periodic maintenance (e.g. once a month)? This would assure us the stability of the quality.
It works based on SPARQL queries. Which queries you want me to run regularly?
Maybe after the cleanup Dexbot is doing now it won't be necessary anymore; I think that these redundant references have been inserted due to an error by Reinheitsgebot, so maybe the error has been solved and the cases won't surge again. Maybe, however, I will give you other queries (of third type) in the future if I find similar problems with different properties.
Just two more tasks when you have time: Wikidata:Bot requests#Accademia delle Scienze di Torino multiple references and Wikidata:Bot requests#Fix values of P248 in references (2021-06-13). Thanks!
When you have time, could you have a look at Wikidata:Bot requests#Accademia delle Scienze di Torino multiple references? Thanks as always!
Hi, you mean the Czech part? I just fixed it and running it again. Everything else has been for really long time now.
No, I mean Wikidata:Bot_requests/Archive/2020/12#Accademia_delle_Scienze_di_Torino_multiple_references (don't know why it has been archived!); it would be very useful.
I would need another little bot intervention: for all the statements listed in https://w.wiki/4HP4, the qualifier statement is subject of (P805) should become field of work (P101). Unfortunately I cannot do it through QuickStatements. Thanks!
Let's sum up the missing ones:
- Wikidata:Bot requests#Accademia delle Scienze di Torino multiple references (updated)
- https://w.wiki/4HP4: move qualifier statement is subject of (P805) to qualifier field of work (P101). Done
- Wikidata:Bot requests#request to delete wrong references (2021-10-27) Done in October by MisterSynergy
- a little sequel: change stated in (P248) Southern Africa Association for the Advancement of Science (Q7569570) into stated in (P248) Biographical Database of Southern African Science (Q24276683) for these references Done
Thanks!
Hi, I started the last one. Will get to the rest slowly.
- 5: another little sequel: change stated in (P248) Flanders Heritage Agency (Q3262326) into stated in (P248) Inventory of Immovable Heritage (Q2091956) for these references Done
Thanks in advance!
The fifth one is running now. I'll check the rest a bit later.
The second one is also running now. Special:Diff/1554000636