Wikidata project chat
A place to discuss any and all aspects of Wikidata: the project itself, policy and proposals, individual data items, technical issues, etc.

Please use {{Q}} or {{P}} the first time you mention an item or property, respectively.
Other places to find help

For realtime chat rooms about Wikidata, see Wikidata:IRC.
On this page, old discussions are archived after 7 days. An overview of all archives can be found at this page's archive index. The current archive is located at 2023/12.

Concept of bot edits edit

There is a problem I would like to ask the community about. The description will be long, I will ask the specific questions at the end.

Vojtěch Dostál imported data on hundreds of thousands of individuals from the Czech National Library's NKC database this summer. This is a big and important project, although the data was incomplete and sometimes wrong, I think I was not the only one who was basically happy with the project, followed the import and corrected and completed the items.

Another editor, Frettie, with the help of his bot (Frettiebot), started to add more data to the items : occupations, birthplaces, languages spoken, etc. This also meant forcing a lot of problematic data. Violent, because it is currently the case that if one corrects or deletes an erroneous data, Frettiebot will add the same data again, if one deletes it again, it will add it again, and this repeats in an endless cycle. Unfortunately, communication with Frettie is at a very low level, despite being told repeatedly and repeatedly that what he is doing is a problem, he neglects requests and usually gives a condescending answer: if you don't like the data added by the bot, change it to "deprecated". His edits have led to edit wars: between editors and Frettiebot on the one hand, and with other bots on the other (the latter has led to two bots being blocked from a page)

Let's look at the problem with occupation data: obviously all NKC-identified persons have at least one occupation, but it is common to include two or three statements for the P106 trait. For an import of hundreds of thousands of persons, this is hundreds of thousands of data. If only ten per cent of this is incorrect or redundant, it is also in the order of tens of thousands, if one per cent, it is also in the order of thousands. The fundamental problem is that for data imports of this magnitude, it is the wrong methodology to build a project around correcting data 'manually' over and over again. The right thing to do would be not to overwrite the already corrected data by the bot.

Not a problem for me, but I note that if a source database gives this much erroneous data, the reason for deprecated rank (P2241) added to the "debrecated" flag will eventually include source known to be unreliable (Q22979588), which in turn qualifies the entire Czech National Library database. But I don't think the source is that unreliable, it's just a bad concept of data distribution and the bot operator doesn't hear the problem signal.

The conceptual question is, where do we import from and how much do we build on the source data?The personal database of the Czech National Library is not a biographical database, just as the other library catalogues are not. The intention of the database creators was simply to be able to distinguish between identical forms of names in some way. Therefore, for example, they do not or rarely include detailed biographical data: they do not include exact dates and places of birth or death, perhaps only years, exact occupations, education, and obviously cannot be used as archontological data. For example, Hrvatski biografski leksikon ID (P8581) or Vienna History Wiki ID P7842 etc. point to a biographical database, but neither NKC, Viaf nor OSZK are biographical databases. Data imported from the latter should be treated with a certain degree of caution, rather than forcibly rewritten over and over again to the items. Here, however, it seems that despite all the feedback requesting corrections, the NKC data are treated by the bot host as if they were dead certain.

The countless incorrect or unnecessary data added in this way will only turn the Wikidata page into a swamp. Why, for example, do you need five or ten occupations, three or five of which should be set to obsolete because they are either wrong or simply add nothing extra?Let's see what are the typical mistakes:

For example, if a person's occupation is Lutheran pastor (Q96236305), but is recorded in the NKC as priest (Q42603), parson (Q955464) or pastor (Q152002), Frettiebot will add it to the existing Lutheran pastor (Q96236305), sometimes all of them. If someone is known to be living in the 18th arrondissement of Paris, but the NKC only records his place of birth as Paris, Frettiebot comes and adds Paris to the element, even though it already records that he was born in the 18th arrondissement of Paris. If this element is not a person but several people (e.g. a duo, twins, married couple, etc.), then certain attributes are not added to this element but to the element containing P31 Q5. Such a property is, for example, P1412, which is not added to the group of several people, but to each person, but Frettiebot ignores this caveat.

These are just a few examples, obviously I have not brought this to the attention of the community just because of three problems, but because there are countless - in my opinion conceptually flawed, unnecessary - bot editing practices.

The specific question is: is it correct for a bot to repeatedly enter the same data into an element if that data is incorrect, redundant or out of place? Is it correct to extract specific biographical data not from a biographical database but from a non-specialised catalogue? Is it right to put the burden of correction so much on the users when it could be done by the bot operator?

Of course, I'm also waiting for Frettie's reply, because - although asked - he never described what justifies the bot having to re-enter redundant, redundant or incorrect data over and over again, i.e. why is it better for the user to set it to obsolete, rather than the bot changing the data entry? Pallor (talk) 20:18, 21 October 2023 (UTC)Reply[reply]

Bot is not fortune-teller, bot cannot know what has been deleted by other user. That's the main problem. It could find this from the history, but it would make the script run longer (a lot). Moreover, I personally think that deprecated values are better left, because we are able to detect that it is wrong in the source data and possibly have it fixed. Which is sometime done, the cooperation between WM CR and NK CR is mutual. By the way, I'm a man. --Frettie (talk) 21:40, 21 October 2023 (UTC)Reply[reply]
I would have thought a bot should only be used if its edits are correct. If it adds substandard information, for example, then please no. Automated undoing of corrections? Um, Maculosae tegmine lyncis (talk) 21:47, 21 October 2023 (UTC)Reply[reply]
Given you know there are problems with your data sources, can't you record your inputs for each tranche and only apply the differences each run. It seems very bad practice to reapply the complete dataset knowing that it will recreate errors. And deprecated values are for "often thought to be true, but actually not", not to act as a database of problems in your data sources. To create that you should do a report from WD and compare it with your sources outside the WD system. Vicarage (talk) 22:05, 21 October 2023 (UTC)Reply[reply]
@Vicarage I don't think your definition of deprecated rank is correct. It's also used to mark sourced, but incorrect statements (i.e. information that was never correct, but was at some point thought to be). Vojtěch Dostál (talk) 18:14, 22 October 2023 (UTC)Reply[reply]
Yes, but I don't think WD should be used as a staging area for fixing other people's data, as @Frettie merely hints they might do, particularly in this case when the approach is irritating others. Vicarage (talk) 18:59, 22 October 2023 (UTC)Reply[reply]
Why shouldn't the bot be suspended until at least it stops edit warring? Assuming "Unfortunately, communication with Frettie is at a very low level, despite being told repeatedly and repeatedly that what he is doing is a problem, he neglects requests and usually gives a condescending answer: if you don't like the data added by the bot, change it to "deprecated". His edits have led to edit wars: between editors and Frettiebot on the one hand, and with other bots on the other." is accurate, are you unwilling or unable to resolve the problems, starting with stopping it from edit warring? I think Vicarage is right. RudolfoMD (talk) 03:52, 23 October 2023 (UTC)Reply[reply]
@Frettie? RudolfoMD (talk) 18:13, 23 October 2023 (UTC)Reply[reply]
Iam disagree with: "Unfortunately, communication with Frettie is at a very low level". It's fact, stopping of edit warring is by leave "mistake" with deprecate status. It is correct way. --Frettie (talk) 19:16, 23 October 2023 (UTC)Reply[reply]
@Frettie, I see you continue to refuse to explain why is it better for the user to set it to obsolete, rather than the bot changing the data entry. This is unacceptable: if a person's occupation is Lutheran pastor (Q96236305), but is recorded in the NKC as priest (Q42603), parson (Q955464) or pastor (Q152002), Frettiebot will add it to the existing Lutheran pastor (Q96236305), sometimes all of them. Draceane is right. As you refuse to fix the bot, it should be blocked. It would be bad if the bot added them with the flag obsolete, but at least that would make leaving the bot running defensible. Adding them as it's doing is indefensible. RudolfoMD (talk) 19:29, 23 October 2023 (UTC)Reply[reply]
I see that Frettiebot is still being run while complaints about its use are being discussed here. This is inexcusable. Vicarage (talk) 19:42, 23 October 2023 (UTC)Reply[reply]
If is people part of Lutherian pastor and priest, so it is ok, because, he is pastor AND priest, no Pastor OR priest, it's my point of view. So, if bot would be fixed – how? What is best practice? Do you have some ideas? If some value is imported and later removed, bot dont have this information. Bot can save pairs "QID" + "PROPERTY" + "VALUE" from all runs and if this is again ready to save, bot does not save this. It can be possible, but it will be slower. @Vojtěch Dostál: – what do you think? Adds new values only once. --Frettie (talk) 06:36, 24 October 2023 (UTC)Reply[reply]
@Pallor From my point of view, Wikidata is a database aggregator. We collect data (with a bot) and then we sometimes curate them (usually by manually setting ranks). That's how I understand Wikidata's general approach. P.S. I note that your examples with Lutheran pastor (Q96236305) and Paris (Q90) aren't in fact examples of incorrect data, am I right? Vojtěch Dostál (talk) 18:18, 22 October 2023 (UTC)Reply[reply]
Vojtěch Dostál yes, we collect data, but we are lucky that we are human beings, not machines, we can make decisions that machines cannot. We also operate the machines and we can tell them what to do and what not to do. With all this in mind, the aim obviously cannot be to put all the variations of all occupations, or all the occurrences of a settlement, on a data sheet and increase the noise to infinity, because that would turn the Wikidata database into a swamp. We can make good decisions and bad ones. The evangelical pastor, Paris, and all the other examples not listed here show that it is possible to pour data into Wikidata that makes a piece of data - which was previously precisely defined - redundant or ambiguous. I can give a particularly bad example, when a graphic artist/photographer's album of historical sites was written in the descriptive data that the author was a historian, but your vitalapod also had a case of incorrect data. All my examples support the point that you should not spread data like this, you should give users a chance to correct what the source does not know well, you should not force the issue of putting up incorrect and redundant data at all costs. Pallor (talk) 18:42, 22 October 2023 (UTC)Reply[reply]
You are obviously right about the importance of humans for Wikidata and I understand that. But I have hard time understanding how the presence of "less precise" professions turns Wikidata into a swamp. How is the "profession:priest" statement preventing you from querying all Wikidata for all lutheran pastors? I see how it would be a problem in Wikipedia, but isn't it a purely aesthetic problem for Wikidata? And on the contrary, if the source for "lutheran pastor" is later deemed incorrect and the corresponding statement deprecated, because the person actually was a priest but not lutheran, we still have a rough idea about his profession with the less precise statement... Vojtěch Dostál (talk) 18:57, 22 October 2023 (UTC)Reply[reply]
I feel this is still our (my and Vojtěch's) ongoing dispute over data representation. IMO WD should be not only machine readable, but also human readable. For you it's just aesthetics, for many others this is the matter of usability. — Draceane talkcontrib. 14:47, 23 October 2023 (UTC)Reply[reply]
Yes, I have the same feeling about this discussion :) It's about the desire of a part of the Wikidata community to turn it into a second Wikipedia :-). Vojtěch Dostál (talk) 06:52, 24 October 2023 (UTC)Reply[reply]
WD is a curated database, yes we might reduce the workload by using bots for mass import, but if a human decides the information is wrong, I think they should remove it. There are clearly techniques in the AI world where a learning machine can absorb vast quantities of machine scraped information and do probabilistic assessment of which facts are most likely to be correct, but using them here would overwhelm the GUI we have, and I agree with @Pallor we'd have a swamp. Vicarage (talk) 19:07, 22 October 2023 (UTC)Reply[reply]
We agree that we have to record certain data even if it is not true: this could be, for example, a historical error or a poorly drawn conclusion, since it is widespread, and we help to refute it by indicating it. But we usually do this on the basis of reliable sources and thus help to refute incorrect/erroneous data. But here the source itself is not perfect either, since - as I explained above - we do not take the data from a biographical database, but from a library catalog. The aim of the librarian was not to position the person between the denominations, but to distinguish him from the person of the same name, perhaps born in the same year, and for this it was sufficient to describe a more general, schematic occupation. It's like the system of tags and descriptions in Wikidata: you don't have to be extremely precise there either, but when you fill in the P106 field, you're obviously trying to create the most accurate model of reality, you're not forced to rough out the description. If someone is a high school teacher, we don't have to describe that he is a educator, a instructor, AND a high school teacher, the last one is enough, there is no need to add the other two - especially not if our source is not completely reliable in this regard. Pallor (talk) 19:33, 22 October 2023 (UTC)Reply[reply]
I personally don't think that this is a majority view. I would be surprised if the community here really thinks that we should remove incorrect sourced statements rather than deprecating them. Can we somehow determine what the consensus really is? Let's write it down somewhere afterwards, because I feel I already had this discussion somewhere. Vojtěch Dostál (talk) 19:15, 22 October 2023 (UTC)Reply[reply]
Bot is machine. If is some type of wrong edit made very often, is good to add some exception to bot.
But not only for this case it would be fine, if there is some universal solution. What about some bot which would deprecate statements which are one level upper than some other statement? When there is eg. genre=adventure film (Q319221), statement genre=film (Q11424) will be marked as deprecated. THe same for occupation, place of birth, category combines topics etc.. JAn Dudík (talk) 07:44, 23 October 2023 (UTC)Reply[reply]
That bot job would be against Wikidata rules. True statements should never be deprecated. Vojtěch Dostál (talk) 14:12, 23 October 2023 (UTC)Reply[reply]
I don't generally agree. By applying this rule literally, we could add to all items instance of (P31) entity (Q35120), to all people place of birth (P19) Earth (Q2). Yeah, it's true, but um... If you added all superclasses of the statements, you would just made WikiSwamp, incomprehensible for humans. — Draceane talkcontrib. 14:47, 23 October 2023 (UTC)Reply[reply]
@Draceane That would be absurd, but I don't see a relevant source that collects all people born on Earth as opposed to people born on other planets :). Vojtěch Dostál (talk) 06:45, 24 October 2023 (UTC)Reply[reply]
That's exactly Draceane's point. The examples he gives are true statements, yet as absurd as the additions the bot owner is being asked to stop making, and you are saying should not be deprecated merely because they are true. Your argument makes no sense. It seems like Frettie is trying, hard, to not understand, but AGF makes me assume it's a language barrier. (For clarity, I'm referring to the notion that "It is difficult to get a man to understand something, when his ego depends upon his not understanding it!")
Are the edits the bot is making so valuable as to outweigh the problems its causing? I suggested an admin suspend the bot. RudolfoMD (talk) 00:45, 25 October 2023 (UTC)Reply[reply]
It is sadly not Frettie who does not understand. Actually, I think other people find it hard to understand elementary rules of Wikidata: 1) Wrong sourced claims should not be removed but deprecated and 2) Preferred claims are marked with ranks, not by removing less precise yet true claims. These rules are essential to the way Wikidata operates and cause no significant problems at all to reuse of Wikidata, but it is sometimes difficult for Wikipedians to get a grasp of them. Vojtěch Dostál (talk) 14:19, 25 October 2023 (UTC)Reply[reply]
After reading all this I feel a strong urge to express my agreement with Pallor and Vicarage. Not because I have new points to add in favor of their opinion, but as a counterpoint to Vojtěch Dostál’s claim that their point of view marks a misunderstanding of Wikidata's principles. I think this comes close to assaulting them and like-minded Wikidata users like me on a personal level. In my opinion, this discussion is too important to be bogged down like this. Let's try and keep the exchange productive and respectful, please.
On the point of Frettie's alleged "not understanding": My argument applies here too (mutatis mutandis). But I must confess I have a hard time understanding what you are trying to say, Frettie, because of your English phrasing. Maybe the same is true for others? Jonathan Groß (talk) 16:38, 25 October 2023 (UTC)Reply[reply]

Discussion after bot suspension edit

A day ago, at Wikidata:Administrators'_noticeboard#Suspend_a_bot;_remove_incorrect_admin_claims? our request on the administrators' message board, Frettiebot was suspended until this discussion was closed. I'd like to lay down some basics (although I've already mentioned some of them).

  1. The transfer of data from the NKC database to Wikidata is fundamentally good, so it benefits Wikidata.
  2. Frettiebot has some useful edits.
  3. The goal is not to make a rule that says: a bot cannot fix or override a person's edit (see e.g. the {{Autofix}} template, which I think is useful)
  4. At the same time, we also don't want a bot to UNOVERWRITABLE fill up Wikidata with unnecessary and/or wrong data.

If others agree with this point 4, then we respectfully ask Fretti to improve the operation of the bot, upload all data from NKC only once, and accept when this data is corrected or deleted. I am pinging a few people who have participated in the debate or have previously made a request to Frettie in a similar matter to write down if they can support point 4. Of course, VD and Frettie can also ping people who have previously commented on the question anywhere.

@Maculosae tegmine lyncis, Vicarage, RudolfoMD, Draceane, Jonathan Groß, GrandEscogriffe: @Emu, Canley, U. M. Owen, Andrew Gray, RAN, Jackie Bensberg: @Polarlys, Vanbasten 23: (I apologize to those who are no longer interested in the topic, but still had to come here) Pallor (talk) 23:00, 27 October 2023 (UTC)Reply[reply]

  Support for point 4, although even if this were to become consensus (which it should), the assessment of what is "wrong data" will always be a point of contention. In any case, thank you for this clear and constructive comment. Jonathan Groß (talk) 05:40, 28 October 2023 (UTC)Reply[reply]
  Oppose I find myself perfectly in accordance with Vojtěch's vision of what is Wikidata. We should aggregate first, sort (and not delete) later. I think Frettiebot is doing an important job of providing references to P106 that are too often not referenced, making them basically worthless. Frankly, I'd even wish other bots would do the same with LC or GND. Now sure, as it was said, NKC is only a library catalog, therefore it might not be the best source available, nevertheless it is a legitimate source. I think the real problem here isn't much the bot's edits but rather how do we model competing or hierarchical values for P106? The bot is only exposing the problem, but it would have come sooner or later. --Jahl de Vautban (talk) 06:38, 28 October 2023 (UTC)Reply[reply]
Yes, you have explained the problem very well. If a person is a footballer it makes no sense to also add that he is an "athlete" o "sport people", because we would be filling Wikidata with useless data. Many Wikipedias use this data for their templates and what we are achieving is that these files are full of professions that do not inform readers of anything, on the contrary, they confuse them more. These users see it as normal and there is no room for much discussion. --Vanbasten 23 (talk) 07:54, 28 October 2023 (UTC)Reply[reply]
  Support importing bots should only attempt to add data once. @Frettie allows his bot to do this multiple times, while not engaging, or even pausing his bot after multiple complaints. The huge differential in human time in setting a bot in action, and reviewing and flagging the results means anyone running a bot needs to be cautious, and what we have here is reckless behaviour. Vicarage (talk) 08:40, 28 October 2023 (UTC)Reply[reply]
"importing bots should only attempt to add data once" is in practice super complicated and IMO not feasible in most situations. Use ranks to indicate which claims should be visible (and which ones not) to end users; and ask the bot operator not to import already existing values regardless of their ranks. —MisterSynergy (talk) 09:05, 28 October 2023 (UTC)Reply[reply]
Since I have been pinged: I probably don’t understand all nuances but it seems to boil down to “do import unless an issue is raised with an edit or a type of edit, in this case resolve manually”. That’s the general idea with mass edits anyway, so yeah, no reason to act differently in this case. On a more general note: Vojtěch Dostál is right, no notes from me on that issue. --Emu (talk) 09:10, 28 October 2023 (UTC)Reply[reply]
  Support for point 4 of course. Also I agree with Emu that there are two different issues. First a general good practice of bot programming that bots should always accept human corrections, and never get into edit wars. This should be consensual. Second the more fundamental question of which kind of data should appear in Wikidata, which I am surprised has not been resolved earlier in the history of the project. There I am in the Pallor/Vicarage/Draceane/RudolfoMD/Jonathan Groß camp. I think that some statements are both true, sourceable, and useless because they are superseded by a more precise true statement, and that such statements should not appear in Wikidata at all. Yes Vojtěch Dostál my opinion is informed by Wikipedia, but what is the problem with that? Isn't supplying the Wikipedias the main original mission of Wikidata? Are there use cases of Wikidata where it is in fact useful to have large lists of redundant imprecise statements? --GrandEscogriffe (talk) 11:10, 28 October 2023 (UTC)Reply[reply]
Sure but that doesn’t mean that we have to answer to the whims of infobox programmers from other projects, to put it bluntly. I often find it quite helpful (when researching and/or disambiguating) to have many statements of varying precision and even accuracy. This gives me a fuller picture of what is generally known about a person – whether true or not, whether precise or not. It also sometimes helps to trace how inaccuracies over time evolved into falsehoods. This is different from Wikipedia where we generally only strive for the best available version of the received opinion of the truth. --Emu (talk) 11:41, 28 October 2023 (UTC)Reply[reply]
Can you give an example? GrandEscogriffe (talk) 12:18, 28 October 2023 (UTC)Reply[reply]
I don’t have a good example for occupation (P106) at hand (and most of those cases would be hard to explain since there is often an element of language dependency and I mostly work with German sources) but in the past (in a very, very similar discussion) I have mentioned Q94694204#P569 as an example in that direction. --Emu (talk) 22:18, 28 October 2023 (UTC)Reply[reply]
@Emu: I agree with you and Epìdosis below that keeping incorrect sourced statements as deprecated is useful. My problem is mostly with redundant (and therefore correct) statements. In your example, I do not see what can be the use of the correct, redundant statement date of birth (P569) 1831 — unlike the deprecated 1841s which inform users not to add 1841 at normal rank. Every user (human or bot) who is tempted to add the imprecise 1831 should already "see" that 1831-09-09 is present. So the imprecise 1831 does not play the safeguarding role that deprecated common falsehoods do.
Also, this example has only one best-ranked value (as it should) so it does not clutter the external users*. A big problem with Frettiebot is that it put everything at the normal rank. I would be much less bothered if it upgraded the already existing more precise statement to preferred rank every time it adds a less precise statement. Although even then I would not really see the point.
*Of these users, I am familiar with Wikipedia, but I guess other external users also rely and the rank system, and I am really curious of who these other users are. Perhaps Wikidata should not be at the whims of infobox programmers specifically, but it should make/keep itself useful to the people who use it. GrandEscogriffe (talk) 21:25, 3 November 2023 (UTC)Reply[reply]
To take the example of Lina Wasserburger (Q94694204): The probably correct precise value is sourced with user-generated content and a primary source. The statement with year precision however has a secondary source, so do the other deprecated statements. In theory, you could also query statements that are sourced by Österreichische Schriftstellerinnen 1880–1938 (Q104601081) against our best guess therefore estimating the accuracy (and precision) of a given source which to me is quite an interesting use case. And finally: Precise values can be deleted for all sorts of legitimate reasons – resulting in missing statements instead of other sources statements with lower precision.
Don’t we have a bot job that periodically sets a preferred rank in those cases? Of course it would be ideal if Frettie took care but I imagine it’s not that simple. --Emu (talk) 21:53, 3 November 2023 (UTC)Reply[reply]
  Oppose to point 4 (with one precisation at my point 4 below); first of all, I very much agree with @Jahl de Vautban: in the comment above: aggregating data from authoritative sources (among which national authority files are surely to be counted) and then ranking the statements; the phrase "The bot is only exposing the problem" (of managing competing or hierarchical values for P106) perfectly summarizes the situation (BTW, since these topics are clearly of general interest, I think they would deserve a RfC, in order to involve more users; the Project chat has tens of messages each day and is very difficult to follow). However, since I understand the concerns motivating users who have expressed critics on some aspects of the activity of Frettiebot, I would like to try to address these concerns proposing a few solutions alternative to the necessity of changing the present activity of the bot (points 1 and 2); I add a small comment about edit wars (point 3); finally, I would like to propose myself one change in the bot activity which, as far as I see, wasn't mentioned above (point 4). I apologize in advance because I will write a lot, but I think the importance of these themes deserves a detailed analysis.
  1. "is it correct for a bot to repeatedly enter the same data into an element if that data is incorrect, redundant or out of place?" (the initial question by @Pallor:): I think these three categories need to be considered separately (and, as it appears both from comments above, and from my personal experience, the most frequent problem is redundant data, so I will dedicate to this part more space):
    1. incorrect data can be entered repeatedly by a bot, if supported by an authoritative source (as I said above, IMHO national authority files are authoritative sources), for two reasons: 1) as a principle, "Wikidata simply provides information according to specific sources; those sources may or may not reflect contemporary thought or scientific consensus" (quotation from Help:Ranking); 2) technically, ""importing bots should only attempt to add data once" is in practice super complicated and IMO not feasible in most situations" (I'm not a bot operator, but I trust @MisterSynergy:, who is a bot operator, so I'm quoting his comment above). Given this premise, in order to avoid incorrect data being received by Wikipedia and other data reusers (a legitimate concern, which I obviously share), these incorrect data need to be set to deprecated rank (as stated by Help:Ranking#Deprecated rank), with qualifier reason for deprecated rank (P2241)error in referenced source or sources (Q29998666) (or typographical error (Q734832), useful in some specific cases). Of course keeping incorrect data as deprecated clutters the items, worsening their readability for humans (which is a legitimate concern, although I think it's rare to see more than 1 or 2 incorrect deprecated statements in the same item): this can be addressed in at least two ways, the first being collapsing not-best-ranked-values (see below point 2) and the second being data round-tripping, which I treat here at point 1.1.1.
      1. Data round-tripping (Wikidata:Data round-tripping) is crucial for Wikidata data quality because, if some authoritative database outside Wikidata contains mistakes, these mistakes risk to damage Wikidata in many ways as long as they exist (the most problematic way is e.g. a deprecated incorrect statement deriving from one import is removed on Wikidata, maybe from a user in good faith just judging it useless, and then another import readds it with normal rank, reintroducing the mistake in full power; the less problematic way, nevertheless problematic, is that deprecated incorrect statements clutter items); ideally we should have a workflow implying that a) when we notice that statement X, supported by an entry Z of the authoritative database Y, is incorrect, we are able to report this mistake to database Y; b) database Y reads our reports and solves them on a regular basis; c) once entry Z is fixed, we can remove statement X (I think that, once the supporting source is fixed, removing the statements has more advantages than keeping it as deprecated), ideally the removal should be performed by the curators of database Y at the same time as they fix entry Z. This workflow should be improved (see e.g. phab:T312718); the more efficient this workflow is, the less time incorrect statements remain on Wikidata. Of course improving this workflow is a task for Wikidata community and not for bot operators; however, if a bot operator has a longstanding collaboration with the curators of the database which they periodically import to Wikidata, they could encourage the curators of the database to improve this workflow (and to remove from Wikidata incorrect statements sourced by their entries, once they have fixed these entries).
    2. redundant data can be entered repeatedly by a bot, for the reason 1 quoted about incorrect data. Redundant data clutter the items, worsening their readability for humans (which is a legitimate concern, especially in the case of occupation (P106), and I very much share it; in fact I periodically remove unsourced redundant values of P106 to reduce a bit the issue, which is very serious): this can be addressed IMHO in one main way, i.e. collapsing not-best-ranked-values (see below point 2). Redundant data can also clutter Wikipedia and other data reusers receiving them (another legitimate concern), and this should be avoided using ranks. It needs to be noticed here that deprecated rank is designed "for statements that are known to include errors (i.e. data produced by flawed measurement processes, inaccurate statements) or that represent outdated knowledge (i.e. information that was never correct, but was at some point thought to be)" (quotation from Help:Ranking), so not for redundant statements, which aren't wrong stricto sensu. I propose two different procedures for ranking redundant values:
      1. for properties having single-best-value constraint (Q52060874) (mainly date of birth (P569), date of death (P570), place of birth (P19), place of death (P20)), if there are 2(+) values all supported by authoritative sources, the most precise one should get best rank; if the values only differ in precision (i.e. day vs year, or village vs municipality), the best rank can be motivated with qualifier reason for preferred rank (P7452)most precise value (Q71536040). I requested to do it for dates through a bot (preferrably, but not necessarily operated by the same bot operator adding less precise values) a few years ago, and I think it is presently done by BorkedBot (per this task approved in 2021; @ BrokenSegue: could you confirm?); programming a bot to do the same for places, on the basis of recursive located in the administrative territorial entity (P131), should be doable and I would support it; of course, the automatisation has some limitations both for dates and places (see the mentioned bot task), e.g. if a birth date has values 1948, 31/10/1948 and 1949 (or a birth place has values Paris, XVIII arr. of Paris and Saint-Denis) we need a human to choose if 31/10/1948 (or XVIII arr. of Paris) deserves best rank, but in fact a bot can safely operate in most cases.
      2. for properties allowing multiple values (mainly occupation (P106)), which are more seriously affected by the issue of redundancy, two choices are possible: a) set to best rank all "good" values (with qualifier reason for preferred rank (P7452)most precise value (Q71536040)), leaving redundant values in normal rank; b) set to deprecated rank all redundant values (with qualifier reason for deprecated rank (P2241)value to be decided), leaving "good" values in normal rank. Since the number of "good" values is in most cases higher than the number of redundant values, I would probably prefer solution b) just because it would imply to change fewer ranks than option a); however, solution b) has the drawback of deprecating statements which are redundant but not wrong stricto sensu, and this contradicts the present definition of deprecated rank. I think this choice deserves further reflection and discussion. Once we choose one option, it can be mostly applied by the bot, as in the previous case: we just need a bot operating on the basis of recursive subclass of (P279), which will allow it to know which values are redundant and which aren't; of course I would support such a bot.
    3. out of place data (which I would define as values neither incorrect nor redundant, but problematic because they are placed under the wrong property) must not be entered by a bot, neither one nor multiple times. Given this principle, let's draw some practical consequences, outlining different responsabilities: 1) the community (not bot operators) should add constraints to property, wherever possible, so that out of place data get marked as constraint violations; 2) bot operators must avoid adding data which trigger constraint violations, ideally using a mechanism which is always synchronized with constraints (which frequently are added, edited and sometimes removed); 3) if a guideline states that a certain combination of property-value is out of place and should be fixed to another one, but this guideline has not been "translated" into a constraint, bot operators are not required to know it (guidelines are scattered among various WikiProjects and it's often difficult to have in mind all of them); 4) however, if a user writes to a bot operator reporting them that a certain combination of property-value is out of place according to a certain guideline and should be fixed to another one, the bot operator must comply the mentioned guideline as soon as possible (I remember one such case, in which I had no complaint about Frettie's answer).
  2. "I feel this is still our (my and Vojtěch's) ongoing dispute over data representation. IMO WD should be not only machine readable, but also human readable. For you it's just aesthetics, for many others this is the matter of usability." (comment by @Draceane:). I very much agree with this comment of Draceane, Wikidata should be readable not only for machines but also for humans. In the points 1.1 and 1.3 I supported keeping inside items both incorrect statements (with deprecated rank) and redundant statements (either with deprecated rank, or in normal rank with most precise statements in best rank); the use of ranks I propose solve the issue of machine readability, meaning that Wikipedia and other data reusers can read only best-ranked data, thus avoiding incorrect and redundant data. In order to make items also easily readable for humans, I propose the solution of collapsing not-best-ranked values: if a property has 2(+) values and these values have 2 or 3 different ranks, a button appears near the property allowing the user to collapse (= hide) all values which haven't the best rank (i.e. if at least one value has preferred rank, all not-preferred-ranked values are collapsed; if a property has only normal and deprecated values, deprecated values are collapsed). I think a gadget like this would make items perfectly readable; the user should also be able to activate it by default (i.e. not-best-ranked values are collapsed when the item is loaded, and the user can just click the button near one or another property to show the not-best-ranked values for that property if they are interested).
  3. about edit wars between bots: of course they should not happen; I see basically two solutions: 1) the bot operators should encode in their bots some constraint like "if you make the same edit on the same item for a total of N times (with e.g. N = 3), stop editing the item" (I think we have no precise guideline about this, but it would be positive IMHO); 2) we probably need an admin bot which monitors items and, if an edit war between bots develops on one item (e.g. bots A and B adding and removing the same statement on the same item for N times, with e.g. N = 3, then block both bots indefinitely from editing that item and send a message to both bot operators about this). Solution 2 would make 1 not strictly necessary and I hope it's not too difficult to enact.
  4. finally [precisation], @Frettie: my request of one improvement to Frettiebot's handling of some occupation (P106) values: I have noticed that, for "composite" occupations recorded in NKC, Frettiebot sometimes duplicates them, adding both the composite occupation (correctly) and the basic occupation (incorrectly introducing a redundancy absent in NKC). To be clearer, some examples: humans being both historians and art historians, often sources support both values (e.g. Renate Kohn (Q66685235)) and so everything is fine, but in other cases (e.g. Renata Zemanová (Q95156951) before my last edit) the source NKC has only "historičky umění" as occupation but the bot added also the basic occupation "historian", which in fact is wrong because it is absent in NKC - I have seen other similar cases with "historian" wrongly added where in fact NKC has only "historian of X"; another example, humans being both professors and university professors, in nearly all these cases (e.g. Elliott R. Jacobson (Q112427327) before my last edit) NKC has only "vysokoškolští učitelé" but Frettiebot also added the basic occupation "professor". In these cases the mistake lies in how Frettiebot imports the data from NKC; I would ask to avoid such mistakes when the bot will restart its activity and possibly to try to spot existing cases like the ones outlined above and remove these values (here there is no need of deprecation, because in fact the source mentioned in the references does not contain such values). This is the only change in the bot activity I would require.
--Epìdosis 14:47, 28 October 2023 (UTC) P.S. I have added a subparagraph "Discussion after bot suspension" for better readability, feel free to edit itReply[reply]
Thank you for your work! I agree, just two notes:
  1. As you said, deprecating true statements should be avoided – enforcing our ranking rules is difficult enough as it is now without an ad-hoc extra rule just for a set of cases.
  2. Do we have examples where “statement clutter” is a real problem for human readability? I would imagine that our current color coding for ranks (enabled per default AFAIR) is helpful. In some cases, rearranging values (first best values followed by normal ranks, deprecated ranks at the hand) by hand might be helpful. Collapsed values always carry the danger of overlooking important data and even adding those statements a second time. --Emu (talk) 22:36, 28 October 2023 (UTC)Reply[reply]
@Epìdosis As for (4) - adding historian AND art historian and how it happens - we actually have a conversion table that prevents cases like this and tries to understand the whole phrase "art historian" in descriptions (see [1]). The occupation "historian" was added by me to that item two years ago, before this specific handling of occupations was not possible. Vojtěch Dostál (talk) 15:31, 29 October 2023 (UTC)Reply[reply]
I'm still preparing for an answer, but it's slower because of my work (an anon archived the section) Pallor (talk) 09:37, 31 October 2023 (UTC)Reply[reply]
Thank you for your patience.
I also thank Epidosis for the very detailed summary. Many strategic questions have now been emphasized, but I still feel that we would say yes to a data entry method that will gradually make Wikidata more difficult for both machines and humans to read in the long term. This situation is like when the waves of the sea wash over the shore, which we take for granted and do not put a stop to it. But when the water starts washing garbage ashore, we can't say again, "this is the order of nature" and let it happen. In this case, something must be done to keep the coast free from waste, we must install some kind of filter in order to save both the water and the coast from garbage.
Let's start with the most important, the source.
You write that you think the "national authority files" are the authentic data. I already wrote about this above: it is a library database, that is, it serves to record the descriptive data of the books. This is complemented by a database that lists the authors of the books to a depth that is absolutely necessary to distinguish authors of the same name. What this resource can be used for is to find out: what is the title of each book, who is the author, publisher, where and when it was published, what is the size, weight, number of pages, binding, what is the theme of the book, etc. In these data, the NKC is just as authentic as any other national library. However, this database cannot be used to find out what the authors' authentic and precise(!) biographical data are. Not only because the database does not take into account who studied where and when they obtained what education, what their family relationships are, but not even the exact birth and death data. He is satisfied with the fact that he was born and died in a certain year, but not where and on what day. simply because it is not needed in the NKC database, it fulfills the purpose it was created for without it. The same is the case with occupations: it is enough for the NKC to write about someone as a priest, teacher or athlete. This is a necessary superficiality that satisfies NKC's needs, but not Wikidata's.
This situation is like using the database of a company that trades in agricultural products as a source of SI units, citing that they also use the terms ton and metric meter, or if we were to process the product range of a paint factory to support the values of the compounds as a source, citing that chemical engineering is also behind it. In addition, both example databases can be used, obviously only in the right place. The database of the Czech National Library can also be used when it comes to books, in fact, it can be used to create elements of persons missing from Wikidata, but with regard to precise data, a biographical source must be sought, rather than constantly rewriting superficial data just because someone somewhere on the world wide web he belched them up. I would emphasize again that the problem is not that this data was found, but that, although it is trivial that it came from an inappropriate source, it was constantly rewritten.
If, for example, we were to take over the birth data of the persons in addition to all the values entered precisely in the format: year month day, we could also include the data containing only the year. Or, for example, for all places of birth or death that are narrowed down to a specific administrative unit, we could enter the data of the broader unit one above it. Could this data be wrong? No, they're just not as accurate as what's already in Wikidata. If we accept Epidosis's argument, we open the door to writing any more superficial data from any database. In fact, we could even do it automatically ourselves, since we don't lie with any of them, and sooner or later we're sure to find a source that supports it. Enter only the year of birth under each date of birth. Would Wikidata be better than that? We have to enforce practical aspects that preserve the coherence of Wikidata. And if we don't want to write 1720 next to the exact date of birth (May 8, 1720) in every element, then we have to follow a similar principle for the occupation: we don't want to write pastor next to the Lutheran pastor, and write it next to the secondary school teacher , a pedagogue, next to the hydraulic engineer, that he is an engineer, because it is completely unnecessary. This will just flood Wikidata with unnecessary and meaningless data.
(I'm showing one more error in Frettiebot's editing, which someone may find correct, but I think it's grossly unnecessary. Some positions usually have an element that applies to a specific country and a specific position. For example, the representatives of a country's parliament have the position held (P39) element used in : member of parliament (Q486839) is obviously not an error, but where a local element exists, we use it (see.
Compared to this, Frettiebot mercilessly wrote that the person was also Q486839 for the persons for whom it was mentioned in the source, even though the more accurate element was already there. This query shows the current situation, i.e. those who are members of parliament, their position element has Q486839 as a subdivision, but P39:Q486839 is also specified. There are currently 958 results, of which 459 are Czech or Slovak. Let's look at two: Q1294312 or Q895898. Both have Q486839 with five or six sources, all of which are NKC. Do we need this? No. Can we expect the NKC to describe the precise position in the given context as we would use the Wikidata table? Again no. Whether Frettiebot added this unnecessary element or it was included, it is clear that the proportion of data added unnecessarily would decrease by 50 percent if the data Q486839 were deleted from them, or if we look at the reverse, then the number of meaningless data increased by 100 percent. If we project this onto the properties of birth dates, places and occupations, we can see how much Wikidata would swell if we accepted that superficial data should also be included. I only examined this for a single position, obviously if you look at the number of presidents, finance ministers, museum directors, fire chiefs, etc. is in the database, which can be titled as president, minister, director, commander using a more superficial database as a source, essentially we could "expand" Wikidata indefinitely, without adding a single meaningful piece of data. Not to mention that if a co-editor writes it in, we can correct it, but if it's a bot, we can't?)
Of course, I understand the part of the argument that says that if a common biographical error needs to be corrected, an excellent method is to record the data, source it, make it obsolete, and indicate the correct (according to more recent research) data accurately and with sources, but I think it is quite clear that this is not the case in these cases.
I still maintain that bot editing should end there, where you upload a piece of data and then leave it up to the community members (the people) to decide if the data is important, necessary, and act responsibly without using a bot they should fight. Pallor (talk) 15:49, 3 November 2023 (UTC)Reply[reply]
Both of the positions expressed throughout this thread can be sympathised with, but think Pallor's post here is an excellent representation of the general approach to weighing which sourced statements belong in Wikidata. My own opinion of this is formed by having read help pages over the years and finding their advice to be well reasoned and appropriately opinionated.
In summary, this thread revolves around three concepts:
  • Imprecise statements (to be unprioritised). There are infinitely many true statements under an open world assumption. As such, these are unnecessary where a more precise statement is available. The exception to this rule is when their sourcing makes the imprecision somehow notable (e.g. an imprecise year of birth thought to be irrecoverable and widely sourced as such, later discovered precisely in historic records).
  • Incorrect statements (to be deprecated). There are infinitely many incorrect statements under an open world assumption. As such, these are unnecessary where their sourcing is insignificant or not authoritative.
  • Appropriate sourcing. This is the crux of the issue discussed here, because it applies to both of the above. I think Pallor has covered it well in the message I'm responding to, but we probably shouldn't be pulling biographical from a library database unless there is no better source already present.
As for the question of the bot, I agree that not restoring statements removed seems appropriate. Adding them in the first place may generate some cruft, but that's not a huge deal - which is why removing it should also be respected. SilentSpike (talk) 09:20, 5 November 2023 (UTC)Reply[reply]
I’m not sure why everybody seems to be so hung up on the fact that NKČR is a library database (or at least has its origins in this field). Why does this make the database less authoritative? --Emu (talk) 11:37, 5 November 2023 (UTC)Reply[reply]
Let's look at a specific example to understand: Walt Whitman is perhaps a well-known American poet and essayist, so that we can use his data to examine whether the NKC data is suitable as a source. This is what the NKC data sheet looks like: jn19990009101
This is what some other biographical database items look like:
It can also be determined at a glance that the NKC's data are incomplete and simplified. But not because the NKC is bad, but because the NKC has enough data for its own purposes to distinguish the American writer Walt Whitman from, for example, the American actor Walt Whitman. For Wikidata, however, this is not enough data, because Wikidata strives for completeness. More is needed here.
But I would also like to add that it is not a problem that many new elements have been added based on the NKC, because each new data sheet opens a door to expand these data sheets, supplement and correct incorrect or incomplete data, and remove unnecessary data. The problem is that this data is written back again and again by the bot, you can't get rid of it. It is as if they want to convey through the bot that there is no more accurate data than the NKC data, although we can clearly see that the data is insufficient because it comes from a database that does not provide a complete biography. This makes the concept flawed. Pallor (talk) 12:20, 5 November 2023 (UTC)Reply[reply]
  • I had some problems with the bot over the summer but those were fixed. My thoughts on the general principles -
  1. Wikidata has our own data model, and it may not view the world in exactly the same way as other databases. This is fine - we don't need to mirror the exact structure and content of every other database. For example, whether a certain thing goes in occupation (P106) vs position held (P39) was the issue I had problems with. Similarly we may not want to have a generic item for something (like member of parliament (Q486839), mentioned above) when we can have a more specific one. So if Wikidata prefers to use a different property, or something more precise, we should not worry about imports being moved or updated afterwards.
  2. A bot should not be edit-warring with people or with autofix bots. If its edits are being repeatedly undone - especially on a day-by-day basis - it should not keep making them. It might be the autofix bot is wrong - so fix that instead, don't just keep making edits that will get undone.
  3. Considering 1 and 2 above, "Only upload data once" is a good rule of thumb to aim for. Reuploading data should only be done when you are doing it intentionally and you have a reason for doing it.
  4. Deprecating "wrong information" is good but it shouldn't be done just because we imported it in the wrong way - if it's something like "this value should be in P39 instead of P106" then it's just going to confuse people to keep a deprecated value around. It implies it is incorrect / outdated when it's simply misplaced. Andrew Gray (talk) 23:23, 3 November 2023 (UTC)Reply[reply]
Three notes to @Andrew Gray's points: 1) We *are* trying to get the bot to understand the Autofix templates and NOT editwar with the autofix bots. This is sometimes difficult for us (I think no other bot is trying the same thing as we are) and it would be better for everyone to come up with a systematic solution for all bots. Currently it is difficult for the bots to load all these autofix commands and keep them updated in our code. 2) We are not asking the community to deprecate our statements in cases where the value was just moved from property to property based on Autofix rules. 3) However, in basically all other cases, as MisterSynergy pointed out, it is virtually impossible for bots to avoid adding the statement unless we stick to our rules and deprecate wrong sourced statements. Therefore, we are asking the community to respect this rule, so that content-adding bots have a place in Wikidata. Vojtěch Dostál (talk) 09:49, 5 November 2023 (UTC)Reply[reply]
Vojtěch Dostál: let's be careful not to read what others have written one-sidedly. It's not the problem that sometimes the bot enters incorrect or unnecessary data (although we talked a lot about choosing the right source, didn't we). Sometimes people mess up the data entry, it happened to me too. That's not the problem, because it can be fixed.
The problem is that the bot uploads incorrect and unnecessary data again and again and again, even though people delete it, which means it CANNOT BE CORRECTED. This should be changed. Pallor (talk) 10:08, 5 November 2023 (UTC)Reply[reply]
Actually, as you know, the bot does *not* reinsert the wrong statement if it is not removed but deprecated instead. So it is not true that the wrong data entered by the bot cannot be corrected. Vojtěch Dostál (talk) 10:29, 5 November 2023 (UTC)Reply[reply]
Then let's start over, because it seems the essence of the discussion didn't get through.
The request is that the bot does not upload the same data over and over again. The bot is a machine and cannot decide whether that data is unnecessary or incorrect. Sometimes there are data that are both incorrect and unnecessary. Part of the reason for this is that the bot spreads them based on an inappropriate source.
However, people can decide and have the ability to correct it. Either by making it obsolete or by deleting it. My proposal is to leave this decision to the people. Let the bot upload the data once and let people decide what to do with it.
I see that there is a consensus that certain erroneous data should be preserved and marked out of date (at least that's what I communicated). Perhaps there is agreement that certain unnecessary data should simply be deleted. Should there be an agreement that this bot should decide, or should we leave it to the people? I prefer the latter. Pallor (talk) 11:25, 5 November 2023 (UTC)Reply[reply]
Respectfully, I know what your proposal is. However, if humans remove incorrect statements, Wikidata will be a much more difficult world for bots. I am merely suggesting that we humans agree to not remove unnecessary or incorrect data - and rather set ranks to them, as is the official Wikidata policy. I feel that we both already know what the other wants, and it's now on the community to either go my way or suggest amendments to Help:Ranking Vojtěch Dostál (talk) 17:06, 5 November 2023 (UTC)Reply[reply]
  Oppose Generally is possible to remove incorrect statement, because it might be added by mistake (even with source). But if some bot is readding this again, is better to deprecate this statement and prevent bot-revert-warring. JAn Dudík (talk) 07:02, 8 November 2023 (UTC)Reply[reply]

Summary edit

make This conversation will be archived soon, I would like to summarize it.

The claim raised was: "At the same time, we also don't want a bot to UNOVERWRITABLE fill up Wikidata with unnecessary and/or wrong data." In other words, Frettiebot uploads a piece of data only once, and then entrusts the judgment and fate of that data to the (human) users.

Some of the contributors to the discussion expressed their agreement or opposition by using a template, in which opinions are equal:

supported by Jonathan Groß, Vicarage, GrandEscogriffe
opposed by Jahl de Vautban, Epidosis, JAn Dudík.

The others did not use a template, but you can reconstruct from their comments whether they supported or opposed it (if I drew the wrong conclusion, please let me know):

opposed by @Emu, Vojtěch Dostál, Frettie:
supported by @Vanbasten 23, SilentSpike, Andrew Gray: and finally myself, Pallor [edit: adding myself - RudolfoMD - I also expressed support]

I judged that @MisterSynergy: suggested a third, intermediate solution.

From the summary, I came to the conclusion that several people support the fact that the bot should add some value to Wikidata only once.

I hope this lesson can be used for future data dissemination by other bots. Questions such as choosing the right source or creating a project sheet to record a significant amount of data, for example, were not discussed, but this discussion may provide ammunition for a debate about these later. Pallor (talk) 10:11, 14 November 2023 (UTC)Reply[reply]

I honestly don't know what the lesson from this discussion is. Some people agree with our rules at Help:Ranking, some don't, but I don't see a consensus for change. Many prolific bot operators explained why it would not work to remove wrong statements instead of deprecating them. Still, one bot is blocked as a result of this inconclusive discussion. Vojtěch Dostál (talk) 20:22, 14 November 2023 (UTC)Reply[reply]
Also, could you please stop making the false claim that Frettiebot added 'unoverwritable' data again and again? I explained numerous time that this is not true, and Frettiebot would not add the data again if they are correctly deprecated. --Vojtěch Dostál (talk) 20:25, 14 November 2023 (UTC)Reply[reply]
Unfortunately, from the very first moment, I feel that the communication moves at the level where you react to whatever you want, but you do not write anything to those comments to which you do not have an adequate answer, in fact, you pretend that they were not written at all. This had already been expressed in me before, but I did not want to make this discourse personal. Thus, it is naturally hopeless to reach a consensus, this expectation is only good for delaying the conclusion of the debate. In this situation, of course, there is no other option than to accept the majority opinion.
If it's really the case that you don't understand what the seven editors who agreed with my suggestion were trying to achieve, then at this point in the discussion I can't recommend anything other than re-read the conversation. If you really understand what it's about, you just want to dramatize the situation, then please find another partner, because I don't want to get involved in this play.
I will propose to remove the restriction of the bot's operation, but with the guarantee that it will only write a piece of data to Wikidata once. Pallor (talk) 22:50, 14 November 2023 (UTC)Reply[reply]
I’m still puzzled by this discussion: What exactly seems to be the problem? I think we can all basically agree with the idea that a bot shouldn’t UNOVERWRITABLE fill up Wikidata with unnecessary and/or wrong data. I for one would support such a statement albeit not in the context it was put forward: It has been shown how to handle wrong data. As to “unnecessary“, well, there seems to be some disagreement about what constitutes necessity – but that’s not really a bot issue per se, is it? --Emu (talk) 00:19, 15 November 2023 (UTC)Reply[reply]
Was it ever established what fraction of the bot's changes are regarded as unhelpful? Before we re-enable it we need to know how much human time will be spent clearing up after it, so we can assess whether it is of net benefit to WD. Vicarage (talk) 04:43, 15 November 2023 (UTC)Reply[reply]
Is unnecessary the same as unhelpful? If so, the core of the problem still doesn’t seem to be the potential various misdeeds of the bot but rather different opinions about necessity and helpfulness … --Emu (talk) 06:27, 15 November 2023 (UTC)Reply[reply]
Emu: I want the bot to publish data for an item only once. After that, whatever happens to this data - community members make it obsolete, delete it or fix it - it would no longer publish this data. This is clearly a bot operation issue. Pallor (talk) 09:45, 15 November 2023 (UTC)Reply[reply]
Of course, I am also oppose, btw. And method, that bot will be allowed to work by importing only ONCE is a very dangerous precedent and may lead to threat to Wikidata as an updated (and still actual) database. This will defacto set a precedent where any human edited item can never be overwritten or edited by a bot. And I see that as a huge threat.--Frettie (talk) 10:38, 15 November 2023 (UTC)Reply[reply]
I agree. This proposal would mean that each database could only be imported by a bot once. This would eliminate one of the main advantages of bot usage: Updating statements isn’t exactly fun and we need fun to attract and keep human volunteers. Therefore, this cumbersome process should be left to bots if possible. And they can’t do that if they can only touch statements or even items once. --Emu (talk) 11:07, 15 November 2023 (UTC)Reply[reply]
Sorry, but there is some fatal misunderstanding here. That's the summary, the discussion continued one stage higher. What you are writing here, you have already partially described above, I think it is unnecessary to describe it in every section. If I did not open a new section for the summary, the discussion would have been archived today.
What has not been answered at all is, for example, the inappropriate choice of sources. The issue of mitigating the redundant data (for example - but not exclusively - what will be the fate of "National Assembly representatives"). Low-quality communication (even now I could point to a section on Frettiebot's discussion board that is unresolved). And of course I could give other examples that could have been discussed in the above section, but did not take place.
For my part, I insist that Frettiebot not get his editing rights back as long as he is in danger of uploading unnecessary data to Wikidata, because that poses a greater threat to our database than the issue of updates, and as the dispute stands it seems that bothers several editors. Pallor (talk) 11:37, 15 November 2023 (UTC)Reply[reply]
But those are very different things. Not responding at all is a problem, that is true. Not fixing obvious problems is a problem too. But uploading data you consider to be low quality or unnecessary isn’t a problem per se. The discussion has clearly shown that your views on those concepts aren’t exactly consensus. --Emu (talk) 11:42, 15 November 2023 (UTC)Reply[reply]
Volunteers who find their accurate changes overridden by a bot won't stay. At least with a dispute with a person there can be discussion and the time devoted is equal both sides. That's not true with a bot, especially if the owner will not engage. Remember frettiebot continued to run well after the issues were raised. Vicarage (talk) 11:57, 15 November 2023 (UTC)Reply[reply]
Please be aware that accurate changes, when done properly, are never overridden by Frettiebot. Vojtěch Dostál (talk) 12:28, 15 November 2023 (UTC)Reply[reply]
Of course, I understand that, from the point of view of the storage space, uploading unnecessary data is not a problem, as it will fit. I also understand that it is not a problem from the point of view of some queries either, because whoever is looking for version "A" does not mind that there is also "A1" and "A2" and "A3". However, there is a problem when we perform maintenance and want to clarify the data entered with the A3 version and correct it to "A" or/and "B" or/and "C".
It should also be seen that the description of Wikidata data states that the data should serve to better understand the given thing. If I write "pastor" next to the occupation of an evangelical pastor, do we understand better? I do not think so. If I write "member of parliament" next to the position of a member of the Czech parliament, will it be more understandable? No. This is a method that goes against the basic principles of Wikidata. Pallor (talk) 12:37, 15 November 2023 (UTC)Reply[reply]
I understand that you only want the most precise version of a given information to be in Wikidata. Several people including me have tried to explain why this might seem like a good idea at first glance but at the same time is a flawed concept and indeed in many cases detrimental to the project (the occupation (P106) issue is a little more nuanced, I’ll give you that, but we seem to be beyond nuances at this point). In any case, this wish can’t be grounds for blocking a bot. --Emu (talk) 16:51, 15 November 2023 (UTC)Reply[reply]

SIMPLE: Lymantria blocked Frettiebot "Until resolution of issues on Frettiebot's editing". The consensus that issues with the bot's editing require code changes (which were not forthcoming) is is what caused the block and are the reason it's still in place. Code changes to address the issues haven't been made. There are no grounds for unblocking the bot. The end.

Folks, if you don't understand what the issues are, then "re-read the conversation". If you still don't understand what it's about, leave it to those who do.

RudolfoMD (talk) 05:49, 19 November 2023 (UTC)Reply[reply]

Nope, this is not an accurate summary of this discussion. I am afraid I see no clear consensus for change. @Lymantria, what do you think the bot owner should do in this case to qualify for unblocking? Vojtěch Dostál (talk) 08:56, 23 November 2023 (UTC)Reply[reply]
Lymantria: we did not receive substantial answers to some of the questions that arose, so no consensus could be formed. At the same time, Vojtěch also admitted that some of the entered data was unnecessary. In the summary, I quantified that more people opposed the previous operation of the bot than supported it. Of course, the operation of the bot cannot be blocked forever, but the previous operating principle does not have adequate support. These aspects must prevail. Pallor (talk) 09:40, 23 November 2023 (UTC)Reply[reply]
@Pallor Yes, there indeed is some opposition to how the bot operates, but the discusssion should not be evaluated by the number of 'votes'. Furthermore, that opinion collides with some of our key principles outlined in our written documentation, which I've linked before. To me, the fundamental question arising from this discussion is how we should operate bots when there are clashes on these fundamental principles - should this be further discussed here, or should the bot operator start a RfC on these fundamental topics, or should it be discussed via a Request for a Bot Permission? This is why I tagged @Lymantria who I think is experienced in these matters, but of course, anyone else's opinion is also appreciated. Vojtěch Dostál (talk) 10:00, 23 November 2023 (UTC)Reply[reply]
Is there anything concrete (beyond your rather far-reaching ideas about necessity and usefulness) that the bot operator could fix? --Emu (talk) 18:13, 23 November 2023 (UTC)Reply[reply]
Emu Yes, the data should be uploaded by the bot only once.
I realize that I am not considered an old editor, because I have only been here for 5 years. But I have never, ever seen a data spread that added the same data to the element multiple times. So far, there have been examples where the bot entered data into the element once. After that, the editors decided whether that data was appropriate, relevant, or not. He Frettiebot also works this way, then I will be satisfied. Pallor (talk) 10:32, 24 November 2023 (UTC)Reply[reply]
I think it has been sufficiently shown by MisterSynergy that this is not a reasonable thing to ask from a bot operator. --Emu (talk) 11:16, 24 November 2023 (UTC)Reply[reply]
I don't think that a bot should judge whether sufficiently sourced data is "unnecessary" or not. I do think however ranking correctly can be requested from a bot. A bot should not be asked to deprecate correct data, but it can be asked to give preferred rank to more (in fact the most) precise data, which it can determine by subclass of (P279). Is Frettie capable to change its bot in order to take care of this? If data is wrong, but sourced, it should be deprecated if a (bot or a) human notices that. I noted that Frettiebot recognise deprecated data and does not change its ranking. Correct but possibly "unnecessary" data I judge as unproblematic if coming from a source that has shown to be a useful, as is the case in this discussion. --Lymantria (talk) 19:52, 23 November 2023 (UTC)Reply[reply]
The request to assign a preferred rank if a more precise information is already available seems fair to me. --Emu (talk) 09:17, 24 November 2023 (UTC)Reply[reply]
Lymantria I'm sorry that you see it this way, since I wrote at length about the fact that the source used is not the most optimal, there are better sources, and I supported this with examples. Much of the data entered is too imprecise, redundant or simply not Wikidata compatible. With such a decision, we are opening the door for all parliamentarians, ministers, mayors, ambassadors, etc. among his positions, let's add the general designation next to his already existing specific position, all this just to make it obsolete: Minister of Foreign Affairs in Belgium (Q1670832)=minister (Q83307) or Lord Mayor of London (Q73341) = mayor (Q30185), etc. We are sure to find a source where these common names are mentioned. And it's actually priceless to find a generic, unnecessary name for anything and fill Wikidata with it. So I still maintain that this is not a good source for the uploaded data.
If I ask you to write an RfC for this, will you do it? I have not done this before and my English is not strong enough. Pallor (talk) 10:23, 24 November 2023 (UTC)Reply[reply]
I fixed the indentation on your comment, Pallor. Also, I think it falls to someone wanting the bot reactivated to make the case/write an RFC, rather than on Lymantria. My summary was accurate. RudolfoMD (talk) 11:39, 24 November 2023 (UTC)Reply[reply]
Okay, let’s be more specific: Are you suggesting that the bot shouldn't import certain positions that are unsuitable for occupation (P106) usage because they belong to position held (P39) and are too unspecific? And could you come up with a list of those values? This could be a compromise that is beneficial to Wikidata. --Emu (talk) 14:31, 24 November 2023 (UTC)Reply[reply]
The solution has been presented umpteen times. The bot should keep track of what it has added (or use wikidata history) to not override manual deletions. Again, this is not just about P106. RudolfoMD (talk) 21:55, 24 November 2023 (UTC)Reply[reply]
I repeat: I think it has been sufficiently shown by MisterSynergy that this is not a reasonable thing to ask from a bot operator. --Emu (talk) 22:03, 24 November 2023 (UTC)Reply[reply]
I don't. It wasn't shown. And he did NOT say it was infeasible in this situation. FS! RudolfoMD (talk) 22:24, 24 November 2023 (UTC)Reply[reply]
To expand on this: The "09:05, 28 October 2023 (UTC)" doesn't 'show' anything. It makes a claim. And not the one you present.
Clarifying what is the most appropriate solution IS productive, IMO. RudolfoMD (talk) 22:32, 24 November 2023 (UTC)Reply[reply]
Also, the bot doesn't have to literally keep track of what it has added or use wikidata history; when re-run, it could only add new data by only extracting new data to add in the first place. RudolfoMD (talk) 23:48, 24 November 2023 (UTC)Reply[reply]
I don't mind if we decide to run a bot which up-ranks the most precise occupations, and if this is the only issue standing in the way of unblocking the bot, I am sure Frettie would assist and we could together devise such a bot job. But can you please make this proposal clearer? Because we need to define such a job and we need the help of those who propose it. For example, we might want to up-rank all occupations when no other statement is subclass of that occupation. We probably want to do this for all statements, not just the sourced ones. However, we might want to skip the statements which already have a non-normal rank. And we might also skip all items where no occupation statement is a subclass of another occupation statement. This is already getting quite complicated and it shows why ranking is usually left to human editors... Vojtěch Dostál (talk) 19:58, 24 November 2023 (UTC)Reply[reply]
I'm skeptical Frettie is willing to make such bot. Evidence is needed. (Also, my read is that there is much opposition to this solution, as there is a lot of support for not adding low-quality position info when there is high-quality position info; the bot should simply be modified to stop adding low-quality position info when there is high-quality position info. Many maintain it is the case that Frettiebot kept adding 'unoverwritable' data again and again because deprecation is not the correct solution; saying it is over and over doesn't make it so. And you've been chastised for pushing this over and over already, e.g. by Jonathan Groß.) There's already a ton of info on what the bot should not add for Frettie to act on, but no interest expressed in doing so that I have seen. RudolfoMD (talk) 21:17, 24 November 2023 (UTC)Reply[reply]
Please try to be productive. --Emu (talk) 22:12, 24 November 2023 (UTC)Reply[reply]
Please clarify. Clarifying what is the the situation and most appropriate solution IS productive, IMO. RudolfoMD (talk) 22:35, 24 November 2023 (UTC)Reply[reply]
Frettie hasn't yet replied to Vicarage's comment of 22:05, 21 October 2023 (UTC), far above. There is reason for skepticism. RudolfoMD (talk) 23:52, 24 November 2023 (UTC)Reply[reply]
Please bear in mind that we are all volunteers here and nobody is under any obligation to respond in a certain time frame or at all. --Emu (talk) 08:26, 25 November 2023 (UTC)Reply[reply]
I find that comment is inappropriate. I asked you to clarify and you are avoiding/refusing to do so. On wikipedia, at least, there is an expectation (PAG) that admins, especially, respond to reasonable questions. Not here. Your comment that I asked you to clarify was implicitly threatening me with your tools, and tersely/harshly critical, yet you refuse to clarify. I would ask that you strike it if you won't clarify it, or at least drop the matter. A comment below supports that my skepticism about willingness is well-founded. RudolfoMD (talk) 09:44, 26 November 2023 (UTC)Reply[reply]
Useless to fix a bot at a time when there is discussion about possibly banning all active (other than only insert once) bots. --Frettie (talk) 11:59, 25 November 2023 (UTC)Reply[reply]
Surely its trivial to flag pairs of occupations where one is a subclass of the other, and remove the most generic. Vicarage (talk) 22:50, 24 November 2023 (UTC)Reply[reply]
Removal where? --Emu (talk) 23:25, 24 November 2023 (UTC)Reply[reply]
From the person. But it equally well applies for all the military museums that are also instances of museum and tourist attraction. Vicarage (talk) 06:34, 25 November 2023 (UTC)Reply[reply]
Valid referenced statements should never be deleted Piecesofuk (talk) 08:12, 25 November 2023 (UTC)Reply[reply]
Exactly. --Emu (talk) 08:27, 25 November 2023 (UTC)Reply[reply]
As so often other sources do not match the WD ontology, they can pollute as well as as inform. WD needs to be a consistent, editable, queriable resource, not a rag-bag of others facts Vicarage (talk) 11:28, 25 November 2023 (UTC)Reply[reply]
I like this. This is a good direction. Everyone should think about this. Pallor (talk) 01:39, 25 November 2023 (UTC)Reply[reply]
This could be part of a solution. Would need to also address what other axes? locations? dates? remove the most generic, yes? (as you mentioned, Paris, year of death...) I'll be pleasantly surprised if its easier than avoiding overrides. RudolfoMD (talk) 07:20, 25 November 2023 (UTC)Reply[reply]
I would not participate on developing a bot which *removes* sourced statements, as opposed to up-ranking. Vojtěch Dostál (talk) 07:12, 25 November 2023 (UTC)Reply[reply]
I agree. I also see a lot of possible criticism from other users who don't want to delete everything that three users here wish.--Frettie (talk) 12:01, 25 November 2023 (UTC)Reply[reply]
Frettie, I still see this as low level communication. On the one hand, because you know full well that it is not the wish of three users, since I have aggregated how many editors disagree with the editing principle of your bot. On the other hand, because what you read is a suggestion in the direction of compromise. You don't have to accept it, but not to discuss it is to reject the compromise. Please consider this to be the first suggestion in the debate between the two positions that points in the direction of a possible solution. (However, it is possible that RudolfoMD is right, and that it is a more complicated solution than setting the bot to edit once per data, but in a democracy sometimes the more complicated and costly solutions represent the consensus.) Pallor (talk) 12:15, 25 November 2023 (UTC)Reply[reply]

Second Summary edit

To sum up again: The bot is currently blocked [u]ntil resolution of issues on Frettiebot's editing. When questioned what specifically has to change, a few ideas emerged:

  1. Data should only be added once: MisterSynergy’s assessment of the impracticality of this request has not been substantially questioned, at least I haven’t found a rebuttal when reading the whole discussion again.
  2. The bot should keep track of what it has added and not override manual deletions: Do you have the same doubts that apply to your response to request #1, @MisterSynergy?
  3. The bot should set the most precise occupations to preferred rank: There seems to be no real opposition but it seems to be questionable if that‘s really the problem.
  4. Certain values should be avoided: The interested parties haven’t come up with a list of those values.
  5. The bot should delete imprecise statements even when sourced.
  6. Low-quality communication should be improved upon.

The main problem with #5 seems to be that this goes against several Wikidata principles. The problem with #6 seems to be that it’s unclear what should change and how change would be measured. --Emu (talk) 14:42, 25 November 2023 (UTC)Reply[reply]

What I contributed earlier to this discussion still stands. Bot editing is effectively a stateless operation; a bot does not have sufficient access to its previous edits, or to edits others have made to a given page. While revision histories and contribution lists can be accessed to read revision metadata, it is super difficult to extract useful information from it regarding the actual editorial content of an edit. It is thus reasonable to assume that by default all bots do not know anything about past activity; and that every bot operates based only on the current state of an item page, and the content of an external source (in this case).
In order to change that, a bot operator would somehow need to set up a shadow database regarding previous edits of their bot, but given the wide range of different edits a bot can make, it is unclear how this could work in a reliable way and there is no existing solution one could readily use. If a bot would be required to do this, it would effectively render its operation impossible.
In other words: #1 and #2 would kill this bot, and set a dangerous precedent for future cases. —MisterSynergy (talk) 18:33, 25 November 2023 (UTC)Reply[reply]
I think #1 is a perfectly reasonable request in this context, which is a bot that got blocked mostly because it kept adding data even when people were removing it. I don't agree that we can't ask for it because it wouldn't work for all bots ever.
Is it impossible for this bot to make a reasonable attempt to not upload the same data? (I don't think anyone has said yes or no on this - only talked about general precedents.) I don't know how its data is generated, but it feels like this should be achievable. No need for edit-history parsing or 100% accuracy, just a reasonable good-faith attempt to avoid pushing the same data into WD over and over again. Most bots & batch uploaders seem to manage it. Andrew Gray (talk) 00:47, 26 November 2023 (UTC)Reply[reply]
After rereading the whole thing again, I'm still not convinced that this a bot problem at all.
1 and 2 at least wouldn't arose if people deprecated validly sourced statement instead of deleting them because they think that the source is worthless. I certainly do thing that some sources are worthless, but not national library authority files. I have only seen Walt Whitman (Q81438) put forward as an example of why library authority files shouldn't be used, and apart from the dates that are year and not day level I don't see anything wrong with the data.
3 might be a good idea theoretically, but if that means pushing as preferred rank values which are unsourced I don't think it's a progress. As a basic we also need to be sure of our subclasses' modelling quality.
4 as for dates, following the example I took earlier, an improvement would be not to import dates when a more precise and sourced value is available, though I'm not sure if a bot can tell that a value is more precise than another without a qualifier to say so. However my main concern with Frettiebot is when it's edit warring with KrBot over autofixed values. This is what lead to László Szalay (Q1294312) or Ferdinand Friedensburg (Q895898) situation with member of parliament (Q486839). But Vojtěch said previously that keeping track of all autofix template is difficult and I see no reason not to believe them. Still, that would be a really good thing.
5 is an absolute no.
6 frankly, I have seen a lot of passive-aggressive comments or outright mistrust of good faith in this thread and I think it isn't only on the bot operator side to improve their communication. --Jahl de Vautban (talk) 07:01, 26 November 2023 (UTC)Reply[reply]

Correct. I agree w/ Andrew Gray. As I have proposed earlier, the bot doesn't even have to literally keep track of what it has added (though that is certainly do-able) or use wikidata history; when re-run, it could only add new data by only extracting new data to add in the first place. This could work in a reliable way too. #1 does NOT effectively render its operation impossible. Bots are not necessarily stateless. (Felt the need to reply as MisterSynergy had said something new - responded to explain what he saw as the hurdles. I still sense glacial progress.)
5 is a straw man. RudolfoMD (talk) 23:28, 26 November 2023 (UTC)Reply[reply]
A few notes on what @Andrew Gray and @Jahl de Vautban have written (thank you both for civil comments, appreciated). This discussion may create an impression that this bot's edit have a high revert rate. I don't think this is true - 99.99% edits are OK. The largest share of the edit-warring is the aforementioned Autofix template, which develops over time and it is sometimes hard for us to keep pace with it - even though we try to do our best. This issue is relevant for all potential similar bots and I would like that we build a framework that provides easy access to Autofix commands to all bots in real-time. This is something we are actively thinking about. The last concern is the "add-only-once" policy. Adding our data only once is difficult (among other reasons) because the entries improve over time, and we want to make these updates appear in Wikidata. Therefore, a more complex system outlined by @MisterSynergy would be required. It is not impossible but I think that a better solution - more in touch with our existing policies - is to deprecate the (very few) outright-false statements which may appear in National Authority files. Vojtěch Dostál (talk) 07:44, 26 November 2023 (UTC)Reply[reply]
@Vojtěch Dostál I agree that updated records can be a problem, but some of the issues here were data being added and re-added five or six times in a single week - it seems likely that was a bot setup issue and not the original source being continually re-checked? Hopefully that sort of thing should be relatively easy to fix. Again, I don't think we need to aim for 100% perfection or checking the item history or anything, just setting things up so it isn't likely to happen too much.
I don't disagree with "deprecate the outright false statements", but how we define 'false' is still a bit of a question mark - is it just things that are objectively wrong (born in 1787, not 1878) or things that are correct when stated elsewhere in the data model (we use P39:mayor, so deprecate P106:mayor)? I'd not be keen on that last one, since it feels like the only reason we have the statement is to avoid bot problems - it feels like it would be better to find some way of preparing the upload so that things go into the right properties in the first place. Andrew Gray (talk) 00:19, 28 November 2023 (UTC)Reply[reply]

How long will we argue over how many angels can dance on the head of a pin? I think this conversation should be closed; at this rate, it'll be months to get anywhere close to resolution. Not worth it; seems more disruptive than productive. Unsubscribing not sufficient, so closing, based on state of discussion. No new or sound old arguments presented.

SIMPLE: Lymantria blocked Frettiebot "Until resolution of issues on Frettiebot's editing". The consensus that issues with the bot's editing require code changes (which were not forthcoming) is is what caused the block and are the reason it's still in place. Code changes to address the issues haven't been made, and Frettie has expressed little interest in making any. There are no grounds for unblocking the bot. The end.

4: Misconstrues. There's a lack of consensus on those values.
How long will we argue over how many angels can dance on the head of a pin? I think this conversation should be closed; at this rate, it'll be months to get anywhere close to resolution. Not worth it; seems more disruptive than productive. Unsubscribing. RudolfoMD (talk) 12:17, 26 November 2023 (UTC)Reply[reply]
It would be simple if you would use ranks, just as it is the norm in Wikidata and as it was suggested several times in this discussion. IMO the bot should immediately be unblocked without any requirements. —MisterSynergy (talk) 12:51, 26 November 2023 (UTC)Reply[reply]
I do not disagree with this assessment of the current state of this discussion. --Emu (talk) 20:11, 26 November 2023 (UTC)Reply[reply]
Discussion closed. Documentation at Template:Closed is wrong missing; template import is incomplete; 'result' argument is being ignored. Unsubscribing not sufficient, so closing, based on state of discussion. No new or sound old arguments presented. --RudolfoMD (talk) 23:11, 26 November 2023 (UTC) .Reply[reply]
What I am missing is a notification by Frettie of willingness to implement some ranking, which IMHO is consensus upon. --Lymantria (talk) 08:06, 27 November 2023 (UTC)Reply[reply]
Hi, @Lymantria, thanks a lot for your response, we will try to implement some realistic ranking process. It is not yet clear how the process should look like. We would be happy to have the community help us with this. --Frettie (talk) 08:32, 27 November 2023 (UTC)Reply[reply]
I have given this reaction some time. It seems to me that it is time to "release" Frettiebot. --Lymantria (talk) 19:46, 4 December 2023 (UTC)Reply[reply]
What was the resolution of the issues that led to the block? There's been no change to Frettiebot noticed here. So no resolution of the issues that led to the block, or consensus that the block was improper. ISTM, reversing the block would be wheel warring. RudolfoMD (talk) 02:02, 5 December 2023 (UTC)Reply[reply]
There seem to be no issues left that are still relevant after this discussion, at least not in a way that would make blocking necessary. --Emu (talk) 08:00, 5 December 2023 (UTC)Reply[reply]
I judged the same as Emu. As blocking admin I don't see how wheel warring comes into play when I am the unblocking admin as well. --Lymantria (talk) 08:09, 5 December 2023 (UTC)Reply[reply]
@Lymantria: this is a strange and surprising decision. Nothing has been fixed, and Frettie has admitted that he is not going to fix anything now because he does not know what to do. Frettiebot is back doing problematic edits like these redundant places of birth/death. The practical consequence of this edit was degrading the infoboxes of several languages' wikipedias with duplicate data.
The decision process should have been (could still be): stop the bot -> make consensus on what the bot shoud generally aim to -> make consensus on what exactly the bot shoud do -> implement it in the bot's code -> restart the bot. We are barely at step 2. GrandEscogriffe (talk) 22:09, 5 December 2023 (UTC)Reply[reply]
Please try not to delete statements just because you consider them to be redundant. --Emu (talk) 23:27, 5 December 2023 (UTC)Reply[reply]
Adding sourced statements, perhaps redundant, is not problematic for the aim of Wikidata. They should never be removed. The correct way to deal with them is to use ranking, if necessary. Frettie has expressed his willingness to adjust the bot accordingly and asked for help to practically do so. I suppose, GrandEscogriffe, you have offered such help to deal with cases like this one. --Lymantria (talk) 07:38, 6 December 2023 (UTC)Reply[reply]
I and several others disagree with your position on redundant sourced statements. Even if I had the time and ability (currently I have neither, if this means bot programming expertise) I would not help you implement something that I disagree with and which is apparently not consensual. I think at this point the way forward is a formal vote (or request for comments) on the divisive questions.
In any case I am not going to have much time for Wikidata, but I had to at least support Pallor's and Rudolfo's point. GrandEscogriffe (talk) 17:49, 6 December 2023 (UTC)Reply[reply]
Indeed. (But my mistake on the wheel warring!) RudolfoMD (talk) 01:50, 6 December 2023 (UTC)Reply[reply]

I really regret this decision. I think it is clear from the summary of the discussion that the majority of participating users do not agree with the operation of the bot. It is also clear that no change has been achieved in this - even by offering compromise solutions. I do not consider this to be a democratic solution, I would like to indicate that I only acknowledge the decision out of necessity. Pallor (talk) 08:51, 5 December 2023 (UTC)Reply[reply]

I challenge all interested to come up with a realistic proposal for how the profession-ranking job should be set up. I outlined some basic concepts and challenges above on 19:58, 24 November 2023 and we really are serious in the promise that we would assist on it, but a more systematic discussion will be required to turn it into reality. Possibly start a proposal at Wikidata:Bot_requests and tag me and Frettie there...Vojtěch Dostál (talk) 10:34, 5 December 2023 (UTC)Reply[reply]

Vojtěch Dostál It's a sympathetic gesture, but I don't operate a robot, so I can't ask relevant questions. It would be helpful if you opened that discussion and asked the questions that knowledgeable bot operators could answer. On the other hand, I feel that this is more of a practical problem than a theoretical one. As I understand it, we want existing property assertions to not be subclasses of the new assertion (and vice versa). The main question is whether the elements are properly filled out, for example, is it stated for the pastor of each church that they belong to the pastor's subdivision? Pallor (talk) 09:30, 6 December 2023 (UTC)Reply[reply]
On a related note, what will happen when the subclass arrangement changes and occupation1 ceases to be a subclass of occupation2? Would the bot then be expected to change the ranking scheme again? How does it know if the ranking scheme is actually not an outcome of a manual edit of a user. Situation then becomes very complex very soon. This shows how difficult this bot job will be. I can't even see all possible repercussions and like you, can't ask all the relevant questions. Vojtěch Dostál (talk) 10:27, 6 December 2023 (UTC)Reply[reply]
In edge cases, the bot could just err on the side of doing nothing. GrandEscogriffe (talk) 17:50, 6 December 2023 (UTC)Reply[reply]
If the class structure is subject to change, who is to say that the source's meaning of the word is WD's. And what if the source changes. This whatiffery needs to be balanced with clutter. A bot doesn't care about clutter, a query doesn't care about clutter (unless it times out, a common occurrence now), but for sure people eyeballing the entries do. The subclass system passing information to the highest relevant node is key to WD's brevity and usability, using external sources with coarser precision undermines that. Vicarage (talk) 18:20, 6 December 2023 (UTC)Reply[reply]
With all due respect: Basically you (and others) are saying that you have aesthetic objections if there are too many statements. Seems like a job for a userscript or gadget – not a reason for blocking bots. ----Emu (talk) 20:13, 6 December 2023 (UTC)Reply[reply]
I have started a bot request here. I can't fail to notice that while I am among those that were fine with Frettiebot, I am the one starting a bot request to find a solution to problems other are seeing. --Jahl de Vautban (talk) 20:22, 6 December 2023 (UTC)Reply[reply]
I think my vision for WD differs from yours. I think of it as a human-curated set of facts, with bots used to aid their collection and justification, but a single set of hierarchical answers. Some AI system can clearly collect the opinions of worldwide authorities and other AI tools can present them, but that's not what I want WD to be. My vision requires that human assessment is key, so usability for humans is key, and aesthetics are vital for that. Perhaps WD will fork as the AI tools develop, and the automated one will prevail, as Google did over Yahoo. But I think the Yahoo approach has merit. Vicarage (talk) 20:29, 6 December 2023 (UTC)Reply[reply]
Again: If human preferences are an issue, you should create a user script that hides any imprecise information that might bother you. --Emu (talk) 21:28, 6 December 2023 (UTC)Reply[reply]
Is there a user script that hides all the citation qualifiers? I rarely have any interest in any of that. Or one that hides all the deprecated information which the bot-human interaction is likely to generate. Vicarage (talk) 21:51, 6 December 2023 (UTC)Reply[reply]

how to model birthplaces? edit

Hi, as of (P642) is deprecated, modelling of birthplaces like Birthplace of Frédéric Chopin (Q4917089), Atatürk Museum (Q753756), Mozart's birthplace (Q3327039) does not conform. What is the scheme to model birthplaces correctly? --Herzi Pinki (talk) 09:16, 1 December 2023 (UTC)Reply[reply]

@Herzi Pinki Hello, I recently asked a similar question at Wikidata:Project_chat/Archive/2023/01#Birthhouses. Have a look if it helps. Using property P551 with a specific qualifier seemed like a good idea to me. See the resulting Czech birthouses with this query. Vojtěch Dostál (talk) 10:18, 1 December 2023 (UTC)Reply[reply]
@Vojtěch Dostál IMHO this does not help. I want to add a property that links from the building to the human (and not vice versa). As a replacement of of (P642) in the object describing the building. best --Herzi Pinki (talk) 00:48, 3 December 2023 (UTC)Reply[reply]
@Herzi Pinki It generally does not matter which way you join the two items, if from building to person or vice versa. One direction is generally enough. Vojtěch Dostál (talk) 09:20, 3 December 2023 (UTC)Reply[reply]
@Vojtěch Dostál ok, thanks for the sparql. The problem as always is that sparql only works if modelling is uniform and consistent. A proposal to model birthplaces is too weak for that, it needs a rule. Until then I will follow your proposal. best --Herzi Pinki (talk) 12:47, 3 December 2023 (UTC)Reply[reply]
@Herzi Pinki: I've found relative to (P2210) most useful for these types of relationships. It won't work in all cases, of course, but I think it would work fine here. Huntster (t @ c) 15:58, 3 December 2023 (UTC)Reply[reply]
@Huntster Nice! That would work for the other direction. Vojtěch Dostál (talk) 16:14, 3 December 2023 (UTC)Reply[reply]

Modifying the reference model edit


At the Data Modelling Days event earlier today, I led a discussion on the Wikidata reference model - see the commons file at right for the slides. The concern centered on how we handle duplicated references on items - right now references attach to statements, so the same reference may be duplicated many (up to thousands) of times on the item. The etherpad notes records much of the discussion, and we conducted a poll at the end that was strongly in favor of solving this through some development work to change the storage format to a more compact form. There was also an interest in improving the UI mechanisms for handling duplicated references.

Before we create Phabricator tasks for the developers, it seemed prudent to have a bit of a wider discussion with Wikidata users. I'll do a formal RFC on this shortly, but to get some initial feedback I'll ask here - have you run across issues with duplicated references, or other things related to the size of large items in Wikidata, and are you interested in seeing this improved?

I've created a draft RFC here: User:ArthurPSmith/DraftRfC References - comment on the talk page (or edit the draft directly) if you think changes may be needed right now. ArthurPSmith (talk) 18:58, 1 December 2023 (UTC)Reply[reply]

Now a real RFC - Wikidata:Requests for comment/Duplicate References Data Model and UI ArthurPSmith (talk) 20:42, 7 December 2023 (UTC)Reply[reply]

Occupation and employer edit

During the Data Quality Days 2021, we had a good workshop on how to model occupation/employer. That is, we discussed which property should we have as a qualifier on the other. The result of the discussion was that neither seemed really wrong, so it was mostly up to which one we preferred. However, as far as I know, we didn't come to a strong consensus on it at the time. During this week's Data Modelling Days, I made a few queries to see which ones were most common. Those queries (employer as qualifier on occupation and occupation as qualifier on employer) shows that the community currently has a 20-to-1 (6,186 to 335) preference of using occupation (P106) as main statement and employer (P108) as qualifier. I also noticed that occupation is not an allowed qualifier on employer, which sort of settles the debate. My remaining question is: should we remove the property scope constraint (Q53869507) that allows employer to be used as main value (Q54828448) and thereby clearly prefer it as being only a qualifier on occupation? Ainali (talk) 21:11, 2 December 2023 (UTC)Reply[reply]

No, because you might want to specify an employer without specifying an occupation. Or you might want to specify a series of employers, with start times and finish times, independent of the occupation. Jheald (talk) 21:26, 2 December 2023 (UTC)Reply[reply]
Same opinion as Jheald. However, we need to reduce to 0 the 335 cases of occupation as qualifier of employer. Anyway, I think we also need to discuss the same issue you report for occupation/employer for the case position held/employer. Currently having position held (P39) qualified with employer (P108) ( is much more common than the opposite (, although the disproportion is only 9-to-1 (46220 to 5104). I think this could be a good occasion to prohibit the less common option and possibly mass-move it to the other one. --Epìdosis 22:04, 2 December 2023 (UTC)Reply[reply]
I was pondering that, but I can't come up with a plausible scenario when I really would want to do it. It seems more like a theoretical possibility that we still could decide to disallow in favor of getting a neater data model that is easier to query. I think that value severely outweighs the loss in flexibility. Ainali (talk) 23:11, 2 December 2023 (UTC)Reply[reply]
I just checked how many items we have with an employer set, but no occupation through this query:
SELECT (COUNT(?item) AS ?count) WHERE {
?item wdt:P108 [] .
MINUS { ?item wdt:P106 [] . }
Try it!
However, it times out on WDQS, so use QLever to check it. It is almost 48,000. While that is a small proportion of the over 1,8 million times employer (P108) is used, it is still a quite high number. I still lean towards us unifying on one way to model this to make the data model consistent and predictable for querying/reuse, but admitting that we will struggle to retain all that data during the process. Ainali (talk) 08:22, 3 December 2023 (UTC)Reply[reply]
I don't see how this makes sense - occupation (P106) and employer (P108) seem to me orthogonal and should both be main statements. Employer definitely needs start and end date qualifiers - certainly for academics staying in one place for an entire career is unusual. They would likely have the same "occupation" that entire time so it makes no sense to have start and end times on the occupation. position held (P39) is something that does have start and end dates and would be associated with an employer in most cases, so tying that with employer (P108) would be more justified, but I don't think it makes sense to do that with occupation (P106). ArthurPSmith (talk) 15:43, 5 December 2023 (UTC)Reply[reply]
I agree with this. Occupations like “physician” and “writer” apply for a lifetime, whereas one might work for a single employer in a variety of roles (= position held) during one’s career. - PKM (talk) 20:32, 5 December 2023 (UTC)Reply[reply]
All your questions were answered during the 2021 talk. In short, even though you are a physician for all your life, this can be modeled just as easily as for someone who is changing occupation every time they change employer. And of courseit makes sense to have a start and end time for occupation, no one is born a “physician” and “writer” and most people retire at some age (sure there will be a few exceptions to this, but those can get end date the same as death date. Ainali (talk) 16:31, 6 December 2023 (UTC)Reply[reply]
So an unemployed or retired "physician" or "researcher" no longer has that occupation? What if the employer is a company that is not in Wikidata (very common)? I'm not sure I'm understanding the exact proposal here; maybe you can point to some specific examples that are modeled the way you think they should be and we can think about it a bit more clearly then? From looking at the notes on the talk I think the discussion may have been skewed by the "museum director" example - to me that's not an occupation, the occupation would be "archivist" or something like that; "museum director" is a position, just like "full professor" etc. ArthurPSmith (talk) 20:40, 8 December 2023 (UTC)Reply[reply]
Yes, I am suggesting that someone who is retired or unemployed (or dead for that matter) should have an end date on the occupation. For general examples, see Trevor Lovett (Q56180891) or Shūichi Mizuno (Q9336177), I would consider these well modeled. (If the employer is not an item yet, the employer qualifier could just be skipped. I don't suggest we need to require it to be added, only that if they are added, this is how.) Ainali (talk) 15:30, 10 December 2023 (UTC)Reply[reply]

Gadgets are not working edit

None of my gadgets (such as merging items) seem to be working. Do other people have the same problem? What might cause this? What can be done to resolve this? - Andre Engels (talk) 08:39, 3 December 2023 (UTC)Reply[reply]

@Andre Engels Do you have the gadget PrimarySources enabled? Try to disable it. Does this work for you? See MediaWiki talk:Gadgets-definition#PrimarySources for details. Raymond (talk) 09:26, 3 December 2023 (UTC)Reply[reply]
Thank you, that indeed resolved it. - Andre Engels (talk) 18:33, 3 December 2023 (UTC)Reply[reply]
Had the same problem since yesterday, works now fine.Tobibln (talk) 23:12, 3 December 2023 (UTC)Reply[reply]

Didier of Cahors and other medieval people with toponymic surnames edit

See Didier of Cahors (Q999529) for instance. Is there a way we can mark all the names like this. For instance we have all the Icelandic and Scandinavian people marked that use a patronymic. Should I create a new property called or "toponym for this person=" or "toponymic surname for this person=" to match "patronym or matronym for this person" RAN (talk) 17:21, 3 December 2023 (UTC)Reply[reply]

Not perfect but named after (P138) may work.
Furthermore, that is not a surname. Surnames appear a lot later.--Pere prlpz (talk) 11:12, 9 December 2023 (UTC)Reply[reply]

Need help regarding a PUT request. edit

Hi, I am struggling with below PUT request. It should add a third reference to the date of death statement (Property:P570) of Robert Nixon (Q7348050). Instead I receive a 404 Not Found:

    "code": "statement-not-found",
    "message": "Could not find a statement with the ID: Q7348050$64DF96AF-CE8D-4905-8B34-B6F0084A28B0"

This is peculiar since the statement id in the request is the id belonging to property P570:

OpenAPI definition:

The request:
Method: PUT
Content-type: application/json


    "statement": {
        "id": "Q7348050$64DF96AF-CE8D-4905-8B34-B6F0084A28B0",
        "rank": "normal",
        "references": [
                "hash": "3fd58cb48138a405e8a1e34c1b51835581627962",
                "parts": [
                        "property": {
                            "id": "P248"
                        "value": {
                            "type": "value",
                            "content": "Q17299517"
                        "property": {
                            "id": "P813"
                        "value": {
                            "type": "value",
                            "content": {
                                "time": "+2017-08-23T00:00:00Z",
                                "precision": 11,
                                "calendarmodel": ""
                        "property": {
                            "id": "P854"
                        "value": {
                            "type": "value",
                            "content": ""
                "hash": "f357aeb56f66e7932c0a54f7469d939a11ef8a23",
                "parts": [
                        "property": {
                            "id": "P248"
                        "value": {
                            "type": "value",
                            "content": "Q51343652"
                        "property": {
                            "id": "P5035"
                        "value": {
                            "type": "value",
                            "content": "n/nixon_robert"
                        "property": {
                            "id": "P1810"
                        "value": {
                            "type": "value",
                            "content": "Robert Nixon"
                        "property": {
                            "id": "P813"
                        "value": {
                            "type": "value",
                            "content": {
                                "time": "+2017-10-09T00:00:00Z",
                                "precision": 11,
                                "calendarmodel": ""
                "parts": [
                        "property": {
                            "id": "P813",
                            "datatype": "wikibase-item"
                        "value": {
                            "type": "value",
                            "content": {
                                "time": "+2023-12-03T00:00:00Z",
                                "precision": 11,
                                "calendarmodel": ""
                        "property": {
                            "id": "P854",
                            "datatype": "wikibase-item"
                        "value": {
                            "type": "value",
                            "content": ""
                        "property": {
                            "id": "P248",
                            "datatype": "wikibase-item"
                        "value": {
                            "type": "value",
                            "content": "Q11148"
        "property": {
            "id": "P570"
        "value": {
            "type": "value",
            "content": {
                "time": "+2002-10-22T00:00:00Z",
                "precision": 11,
                "calendarmodel": ""
    "tags": [],
    "bot": false,
    "comment": "Added reference to date of death via [[User:Mill 1]]'s edit app using Wikibase REST API 0.1 OAS3"

Any help is greatly appreciated. Mill 1 (talk) 22:15, 3 December 2023 (UTC)Reply[reply]

@Mill 1 It looks like you uncovered a bug. We have phab:T352644 for it now and will look into it more. Sorry for the issue. I'd also love to hear what you're building if you're willing to share. Since the REST API is new and still in development, feedback is very helpful. Lydia Pintscher (WMDE) (talk) 10:39, 4 December 2023 (UTC)Reply[reply]
Hello Lydia,
Thank you for your feedback. I noticed that this particular type of request was not working (replacing a statement) but thought it was faulty at my end.
Context: I created a .NET Core web application that implements CRUD actions on some Wikidata entities. The source can be found here:
As you will see, I focus more on the back end than the front end :) Some POST and PUT requests are actually called via a GET request via index.html.
Anyway, I created the application to automate wikidata edits regarding my main private project which is aimed at improving and standardizing the Wikipedia deaths per month lists regarding 1990 – 2005 (Read me). Mill 1 (talk) 11:08, 4 December 2023 (UTC)Reply[reply]

[Wikidata] Weekly Summary #605 edit

Freiherr edit

Viktor Freiherr von Erlanger, the noble title "Freiherr" isn't a honorific-prefix since it comes after the first name in German. Do we have a word that describes honorifics that come after the first name in German? RAN (talk) 23:07, 4 December 2023 (UTC)Reply[reply]

Yes, family name (P734). Karl Oblique (talk) 20:37, 7 December 2023 (UTC)Reply[reply]

Modelling NPO tax status edit

Hi everyone, I have some problems modelling Non-Profit-Organizations in Wikidata. In some countries, the non-profit status is separated from the legal form. For example, in the Netherlands you have stichting (Q19605764) or German foundation under civil law (Q56242138). Both have an ELF code (P10421) and are clear to use. But not all of them are tax exempted Public Benefit Organisation. Therefore, in the Netherlands algemeen nut beogende instelling (Q1977825) exists, in Germany charitable corporation (Q113805953)/Gemeinnützigkeit (Q66660868). Similar concepts exist in other countries. So my question is, how to model this in Wikidata to be able to select all algemeen nut beogende instelling (Q1977825) in the Netherlands or all stichting (Q19605764) that are algemeen nut beogende instelling (Q1977825)? Most of the time something like friendly society (Q1976354) is used as instance of (P31), but that's not very useful and very inconsistent. Newt713 (talk) 08:28, 5 December 2023 (UTC)Reply[reply]

You're right. There really isn't a consistent approach to this on wikidata. There is a similar situation with 501(c)(3) organization (Q18325436), in general, legal forms work a bit different way in American law than in other jurisdictions. From my point of view using instance of (P31) is not a good idea, using (multiple) legal form (P1454) is better but still imperfect. What about has characteristic (P1552)? Jklamo (talk) 12:39, 5 December 2023 (UTC)Reply[reply]
@Newt713, Jklamo: There's a start of some discussion about this on this project talk page - how should we model organization "types" of this sort? ArthurPSmith (talk) 15:49, 5 December 2023 (UTC)Reply[reply]

EnArgus Ontology edit

Hello Everyone,

I'm a project member of the EnArgus Project, which aims to bring more tranparency to energy politics and energy research funding in Germany. Part of the project is an OWL ontology which is being built by energy researchers and computer scientists collaboratively and is linked to a wiki also being built as part of EnArgus. We are considering publishing our ontology on Wikidata. Which steps should be taken?

Thanks in advance MickDe87 (talk) 13:48, 5 December 2023 (UTC)Reply[reply]

@MickDe87: I'm assuming the logical step here would be for you to propose a property, presumably an external id that allows linking Wikidata items to your ontology id's? ArthurPSmith (talk) 15:51, 5 December 2023 (UTC)Reply[reply]

Passed tags are always invalid using the REST API edit


None of the tags used in requests sent to the REST API are accepted. I always receive a 400 response (Bad Request):

    "code": "invalid-edit-tag",
    "message": "Invalid MediaWiki tag: \"mobile web edit\""

I pass the tag names stated in Methods tested: POST and PUT

Example request:
Method: PUT
Content-type: application/json


  "description": "Amerikaans journalist en schrijfster",
  "tags": ["mobile web edit"],
  "bot": false,
  "comment": "Edited description for Dutch language"

If it is not a bug please let me know where I can find a list of valid tags to add to an edit.

cc Lydia Pintscher (WMDE)

Regards, Mill 1 (talk) 16:29, 5 December 2023 (UTC)Reply[reply]

Not all tags can be applied by tools or manually. It looks like "mobile web edit" is one of them that can only be set by MediaWiki since it represents edits that are done via the mobile web UI and not other mobile apps etc. I hope that helps. Lydia Pintscher (WMDE) (talk) 15:02, 6 December 2023 (UTC)Reply[reply]
None of the tag names I tested worked. I tried five of them. This could be a coincedence of course. What would be an example of a valid tag to be used in my request? Mill 1 (talk) 09:37, 7 December 2023 (UTC)Reply[reply]
If you look at the Source column of the table in Special:Tags, the tags that are Applied manually by users and bots can be set via the REST API, the tags that are Defined by the software can only be set by MediaWiki. Ollie Shotton (WMDE) (talk) 11:43, 8 December 2023 (UTC)Reply[reply]
Thank you Ollie, I looked straight passed it. Where/how can I define custom tags? I do not have administrative rights. Mill 1 (talk) 19:37, 9 December 2023 (UTC)Reply[reply]

Luminaire data transcription edit

I am working on describing a historic type of street light (SRS201) that used to be common in the UK and the Low Countries. I wonder what property do I need to use to describe the lamp power in watts, and also how do I present variants that result in different lengths? Source: [2], page 327. --Minoa (talk) 23:09, 5 December 2023 (UTC)Reply[reply]

Entity Graph edit

Eh? has this died? Every item I look, it doesn't finish. EOT encountered. Jim.henderson (talk) 23:38, 6 December 2023 (UTC)Reply[reply]

@Jim.henderson: « Entity Graph » is a wide concept (older than Wikidata), I'm guessing you're are talking about a specific tool but can't figure which one; you'll to be more specific if you hope for any help. Cheers, VIGNERON (talk) 20:45, 7 December 2023 (UTC)Reply[reply]

adding the data of Great Immigrants (Q121359767) edit

I have an Excel spreadsheet of the all the Great Immigrants (Q121359767) "Award issued by the Carnegie Corporation of New York to celebrate immigrant contributions to American life." The list contains 719 names from the beginning, 2006, to present, 2023. The fields are Year Honored; First Name, Middle Name, Last Name, Occupation, Category and Subcategory. Ronald Sexton (talk) 01:54, 7 December 2023 (UTC)Reply[reply]

I think a good first step would be to figure out how to model the categories and subcategories of the award. Then for any recipients that already have a Wikidata item, add or update the award won property. Before you create items for recipients that don't already exist, I would spend the time establishing the person's Wikidata:Notability in addition to winning the award. Then once you're confident the person is indeed notable, create a fully developed out item for them including the award. -- William Graham (talk) 18:10, 7 December 2023 (UTC)Reply[reply]

Re: How to handle concepts of trans people? edit

Sorry to revive an archived thread (Wikidata:Project chat/Archive/2023/11#How to handle concepts of trans people?; started by @Bencemac:), but I did want to reply from the perspective of the Wikimedia Foundation's legal team. I know @ChristianKl: had looked for a reply.

Redirect of email address

Regarding the redirection of privacy wikidata org to privacy wikimedia org, it is true that Foundation Legal asked for the redirect. The intention was to route privacy requests to Foundation attorneys to evaluate. We certainly did not want to sabotage anything, and we should make changes if that is the result.

In practice, individuals are treating privacy wikidata org as a living persons issues email queue more generally. The vast majority of requests sent to privacy wikidata org are not requests the Foundation can action, since we do not edit the projects, and we end up referring the requests to info wikidata org. I wrote a post on the talk page of the Living people policy a while ago about this issue.

So one point to address from this discussion is whether more sensitive requests regarding content about living persons can be directed to info wikidata org or a new VRT queue monitored by Wikidata volunteers.

Regarding gender on Wikidata items

Through the email address privacy wikidata org (which currently redirects to privacy wikimedia org), the Wikimedia Foundation recently received a request to remove gender statements for a Wikidata item about a living person (or to delete it entirely). After evaluating the request, we forwarded the request to info wikidata org.

As we understand it, the proof for the gender statement on this particular person’s Wikidata item was based on the fact that the person changed their name (see also: Wikidata:Living people#Statements that may violate privacy).

We therefore appreciate this issue being discussed here, in hopes that these sorts of issues can be discussed more broadly. BChoo (WMF) (talk) 18:09, 7 December 2023 (UTC)Reply[reply]

Request for conversation / Talking: 2024 edit

Hi folks,

Recently, Maryana Iskander, Foundation CEO, announced a virtual learning and sharing tour, Talking: 2024. This is two years after the initial listening tour that Maryana launched before assuming her role. The aim is to talk directly with Wikimedia contributors around the world about some of the big questions facing the future of our movement. I'm writing here to warmly invite those of you interested to participate – on-wiki or by signing up for a conversation. The priorities that contributors identify in these conversations will become the driving force in the Foundation’s annual planning process, especially as our senior leadership and Trustees develop multi-year goals in 2024. Thanks for your time and attention. Looking forward to talking together. -Udehb-WMF (talk) 18:34, 7 December 2023 (UTC)Reply[reply]

please delete page edit

Please someone to delete page Q61149150 - Wikidata.

Thank you. Dionysus (talk) 19:13, 7 December 2023 (UTC)Reply[reply]

I removed Κατηγορία:Μυθιστορήματα του Μανουέλ Βάθκεθ Μονταλμπάν to the right page: Dionysus (talk) 19:16, 7 December 2023 (UTC)Reply[reply]

Please delete also pages,,,

Thank you very much! Dionysus (talk) 19:43, 7 December 2023 (UTC)Reply[reply]

@Dionysus Deletions should be requested at WD:RFD. Duplicates should be merged not deleted. Bovlb (talk) 20:15, 7 December 2023 (UTC)Reply[reply]
Bovlb, thank you for the explanation. I know almost nothing about Wikidata. I see that User:DeltaBot merged the pages. Always a bot merges the pages or i can do it myself? And how? ( Thank you again. Dionysus (talk) 09:29, 8 December 2023 (UTC)Reply[reply]
Bovlb, I guess my question was stupid. Sorry. Dionysus (talk) 19:10, 8 December 2023 (UTC)Reply[reply]
Never stop asking questions. You might like to read Merge and install the merge gadget. Bovlb (talk) 20:13, 8 December 2023 (UTC)Reply[reply]
Bovlb, thank you for your answer. And thank you for your help. Dionysus (talk) 12:43, 9 December 2023 (UTC)Reply[reply]
I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. --Matěj Suchánek (talk) 17:41, 9 December 2023 (UTC)Reply[reply]

Problem with a list edit

In this SPARQL code, I managed to see that 705 identifiers in Wikidata have serious underlying issues. This leads us to a list in Wikipedia whose relevance is in doubt after activating the code. I don't know whether to fix them one by one or if there is another solution. The description in English is in Spanish, and the description in Spanish is poorly written because the list had issues from outside Wikidata, and I can't find where. Besides being a highly dangerous primary source, I wouldn't know how to correct grammatical errors or, on the other hand, verify the truth of whether there is a 'heritage house SN' on 'such street.' Best regards. Berposen (talk) 01:38, 8 December 2023 (UTC)Reply[reply]

The items all have Guía Digital del Patrimonio Cultural de Andalucía ID (P3318) that link to, for example, I don’t know what you find so irritating about the data except it not being in your preferred language? Yes, it may have been better to leave English labels blank and come up with descriptions. Karl Oblique (talk) 06:11, 8 December 2023 (UTC)Reply[reply]
I see. Many, many lists have been generated with those codes about "Patrimonio Andalucía". The problem lies in that, in a list like this, the red links now constitute more than 70% of the list. In addition to the description in English, the identifier redirects to non-existent articles. In short, apart from the errors, the identifiers for 'addresses or streets' do not identify any article on the list." Berposen (talk) 06:39, 8 December 2023 (UTC)Reply[reply]
The items are relevant for Wikidata since I see no reason to doubt that they represent actual buildings. But if it’s about the list on eswiki, you could restrict the list to just the items that have articles by modifying the query like this. Karl Oblique (talk) 06:51, 8 December 2023 (UTC)Reply[reply]

Data Modelling Days: documentation and outcomes edit

Hello all,

The Data Modelling Days took place last week, with an estimation of 80 participants throughout the three days of online event. May discussions related to modelling and organizing data on Wikidata happened, together with presentations, raising issues, new ideas and suggestions on topics as diverse as living heritage, gender, EntitySchemas, conflations and duplications, Autofix, references, modelling data on a brand new Wikibase instance, and semantic web.

Most sessions are available as video recordings, you can find them in this playlist, as well as linked from the program. You will also find the collaborative notes, archived on wiki pages, and the slides if any. You can also find the slides directly in the related Commons category.

We hope that some discussions and issues raised during the event will be shared with the broad Wikidata community, for example through WikiProjects. Don't hesitate to start discussions here and there, using the presentations from the event as a support to start the discussions.

Many thanks to all of you who participated and contributed to the event!

If you have any questions or suggestions related to Wikidata data modelling or missing technical features, feel free to contact Lydia Pintscher (WMDE) and Arian Bozorg (WMDE). If you have any feedback about the event or suggestions for future events, feel free to reach out to me. Best, Lea Lacroix (WMDE) (talk) 07:21, 8 December 2023 (UTC)Reply[reply]

@Lydia Pintscher (WMDE): I think the main problem is that solving data model problems needs discussions that come to a consensus about modeling decisions. Given that many people have very many items on their watchlist, the watchlist is often not as good as a tool as it is in a project like Wikipedia to get people aware of discussions.
Pinging Wikiprojects used to be a way to get that to happen. We had more policy discussions and modeling discussions back when it was working and today it doesn't really because of the 50 person limit. Simply configuring the ping-project to have no limit would be one technical solution but there are probably also other solutions for the problem. ChristianKl❫ 17:06, 8 December 2023 (UTC)Reply[reply]

Proposal: add qualifier to allow for fuzzy property values to help with real-life data ambiguity edit

Proposal: certain values of some properties can get a qualifier which will signal that they are "fuzzy" which means that certain property's value is possible for a somewhat less-strict definition of a property or in general because our knowledge is limited. So, some Wikidata values will be strict and some others fuzzy.

This is much closer to real life than just discarding legitimate information which is a recipe for censorship-like abuse where different strictness is arbitrarily applied in different cases.

For example, Peter C. G., a scientist, stops maintaining his old webpage and creates a webpage for two-member institution instead (with him as director and another person as deputy) which does no longer count as his webpage. Even though, it still actually is a kind of his webpage and contains information about all of his recent articles and books, just not in a strict enough sense to reach a consensus, so it gets deleted as his official webpage. But there is an Oprah Winfrey's which is ran by Harper Productions and it is supposedly OK, and a scientist's site which is ran by him as a director of two person's team, is not OK. Fuzzy values are the saviors of data freedom. Fabius byle (talk) 15:44, 29 November 2023 (UTC)Reply[reply]

Hello Fabius byle, there are for example:
for unprecise, uncertain or outdated information.
Also see Wikidata:Events/Data_Modelling_Days_2023 M2k~dewiki (talk) 19:20, 29 November 2023 (UTC)Reply[reply]
Thank you, will look into it. Fabius byle (talk) 20:04, 29 November 2023 (UTC)Reply[reply]
When it comes to qualifiers, sourcing circumstances (P1480) and nature of statement (P5102) is quite common. That said, for Wikidata you'd want the data to be as unambiguous as possible, since computers deals in absolutes and can't reason very well about vague statements. (limiting ourselves to classic computing and ignoring AI of course) Infrastruktur (talk) 20:00, 29 November 2023 (UTC)Reply[reply]
@Infrastruktur, Sorry for being late to answer. Thank you. Sourcing circumstances (P1480) and nature of statement (P5102) seem like it! As I can see, there are two main reasons why our judgement on the values of certain properties can be imprecise: 1) because of the nature of reality itself, 2) because of the nature (or state) of our understanding of reality (which includes things on the fringe of the abstract definitions or defects of our abstract categories). But the reasons of the impreciseness are not as important as long as we try to stick with only providing data. Thanks again. @RudolfoMD, useful qualifiers! Fabius byle (talk) 11:07, 8 December 2023 (UTC)Reply[reply]

Should the foundation consider funding the QLever project? edit

For starters, I have no skin in the game, and no economic ties to any project or person I might refer to. And I'm sorry to say I can't very well prove that, so you will just have to take my word for it.

But as you are no doubt aware Wikidata suffers growing pains at the moment. From a pure CS perspective the Qlever engine is interesting because they seem to have realized that scaling horizontally isn't something that you could achieve very well. I mean Blazegraph does a damn good job, but in the end there is only so much you can do within the constraints you are given. And so the next logical step was, well let's make it as efficient as we possibly can on a single computer and they seem to have succeeded at that. Of course this can't scale into eternity but it would surely suffice for another 80 years or so.

What are the current concerns? They need a solution for SPARQL update that performs well. Once that is taken care of, the rest will work out I'm sure. I won't pretend to be knowledgeable in this area but I can at least honestly say I did read at least one paper on the design of this engine, so there is that. Of course I have to trust the people who wrote that paper but I'm fine with that.

Also you may heard of the Pareto principle, you can get 80 percent of the gain by spending 20 percent of the effort or something along those lines. In a nutshell, the design is not the problem, but taking things into production is going to require man-hours, and so this is where it makes sense to contribute money into development. What this will do is to drastically cut down the time it takes for QLever to go from a performant prototype into a good quality triple store/query engine.

What would funding give us? Ideally it should directly translate into developer time and that would benefit the Wikidata project as well assuming they adopt Qlever as their triplestore. I don't think it would be moral to expect the Freiburg team to be dictated by the WMF, but at the very least if they donate money they should be afforded some priority in terms of wish-list items.

What are your opinions on this? Infrastruktur (talk) 15:34, 8 December 2023 (UTC)Reply[reply]

I've come across their prototype and it has changed my opinion of what is even possible. For queries I wouldn't have dared to run before, I am getting results with tens of thousands of rows as fast as my connection can download the file. So, yes: I wouldn't hesitate to commit to this as a replacement for Blazegraph.
As to your actual question, I don't have a good sense of what the budget constraints are and what their project status is? It's a university, and they've gotten it this far without external motivation. Adding money can have counterintuitive consequences, and turning a research department into a vendor has a tendency to make everyone unhappy.
But, yes, it would obviously be worth some cash if that is what helps. Alternatively, I would be happy with this level of performance even if it means updates only every second day (I seem to remember it currently takes them about 30h to ingest the full data). Karl Oblique (talk) 16:32, 8 December 2023 (UTC)Reply[reply]
Not directly related to OP's funding question, but you might be interested in the evaluation of QLever at Wikidata:SPARQL query service/WDQS backend update/WDQS backend alternatives. See also phab:T339347, which relates to using QLever in federated queries. Bovlb (talk) 17:46, 8 December 2023 (UTC)Reply[reply]
I read it when they released it, but I went back to have a look at what they said about the only other serious contender which is Virtuoso. It has some quirks apparently but can scale to 100 billion triples. It uses a relational database with self-joins as a backing store, which is a technique that predates dedicated triplestores. Trying to do everything on their product makes it harder to focus on doing one thing well. But what makes me excited about QLever is the novel design ideas that went into it. Remember I said that monolinguals (labels etc.) was cache-hostile by design? This problem is solved in QLever. I don't think this would be realistically possible in a project as complex as Virtuoso. Infrastruktur (talk) 10:52, 9 December 2023 (UTC)Reply[reply]
What is QLever and what does it do for us? A quick Google search is not illuminating. -- William Graham (talk) 22:22, 9 December 2023 (UTC)Reply[reply]
Yeah, I don't think they have a marketing department. Its github page describes it as a "Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata". It does the same thing as the Wikidata Query Service. But since our current one is starting to struggle with the amount of data and traffic, they have been looking at replacing it with an alternative that can. Infrastruktur (talk) 10:27, 10 December 2023 (UTC)Reply[reply]
Side note: a quick Wikidata search would maybe enlighten you:
--Matěj Suchánek (talk) 11:29, 10 December 2023 (UTC)Reply[reply]

official website (P856) edit

How to add website to official website (P856) with language of work or name (P407) if the website was bilingual under the same address, but the second language was removed later? Eurohunter (talk) 09:14, 10 December 2023 (UTC)Reply[reply]

Interwikilinking of Wikimedia/ Wikipedia user pages (and personal categories on Commons) edit

I found that this had been discussed in 2013 here, here as well as in a request for comment of the same year.

While I can see that user pages should maybe be handled separately from links between Wikipedia articles and other pages it is unfortunate that no other solution has been found yet. It would be extremely useful to be able to switch between them and also a user's commons category (if there is one) easily and consistently on all project pages.

Experimenting with this I managed to connect de:Benutzerin:Claudia.Garad und C:Category:Claudia Garad (apparently through a loophole in Special:AbuseFilter/39, see here) so that the Commons category page now carries an interwikilink on the bottom left (not the other way around, though). I also failed at connecting C:User:Claudia.Garad to establish a visible link on her Commons category page or her German user page through Wikidata.

This kind of interlinking seems to be especially useful for 'notable personnel' such as Claudia Garad who is a member of the Wikimedia Foundation Austria. Although I would appreciate if it could be implemented for the 'common user' as well.

Any ideas ?

KaiKemmann (talk) 15:36, 10 December 2023 (UTC)Reply[reply]