Wikidata:Requests for comment/Must 'Serious' WikiData sources be selective?

An editor has requested the community to provide input on "Must 'Serious' WikiData sources be selective?" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.

If you have an opinion regarding this issue, feel free to comment below. Thank you!

Recent discussions at the Requests for Deletions board have led to an intractable disagreement between some editors.

The disagreement regards the interpretation of WD:N rule number two, which states an item is acceptable if: "It refers to an instance of a clearly identifiable conceptual or material entity that can be described using serious and publicly available references."

In the deletion discussion thread for Q117283304 User:Emu has asserted that: "We generally don’t accept state registries who have to document indiscriminately", and has pointed to a linked discussion on their user-page titled "Sources need to be selective".

My request for comment is simply this:

  • Do sources need to be 'selective' to be 'serious' sources under WD:N?

The reason that this RfC is necessary is that it seems to me that community consensus is arguably unclear at present. We have a number of editors in the Q117283304 deletion thread expressing a contrary position to that reached by a majority of editors in the deletion discussion for Q114557711.

It would be good to obtain a clear community consensus on this, so it is more clear to editors what the actual rule is that should be followed.

Kind regards Jack4576 (talk) 11:57, 3 September 2023 (UTC)[reply]

I'll tag participants in Q117283304 and Q114557711 as they may have an interest in this discussion. Pinging: User:Gymnicus, User:Granpar, User:Estopedist1, User:Emu, User:Jklamo, User:Dsp13, User:Андрей Романенко. Have also posted this RfC onto the Project chat. Jack4576 (talk) 12:01, 3 September 2023 (UTC)[reply]

Discussion edit

  • First of all, I think we have to clarify what is meant by "being described". State registries provide us, at best, with the name of the person, dates of their life, the places where they lived and some familial relations (sometimes also the profession is mentioned). I would dare to say that this is not a description. I understand description as portraying somebody or something as significantly distinctive from anybody/anything else. With this meaning — it is not Wikidata that has to be selective: it is our sources that are selective as long as not so many people are really described anywhere. Андрей Романенко (talk) 13:30, 3 September 2023 (UTC)[reply]
    I don't understand your view, with respect. All of those things you've mentioned here are descriptive attributes. You've said, "I would dare to say that this is not a description"... How? If you've provided a descriptive attribute about a subject, that is the same thing as describing a subject. Your statement seems to me self-evidently contradictory.
    Anyway, regardless of the above point you've made, I don't understand its relevance to this discussion.
    What I'd hoped to get consensus on is whether or not sources containing information on their subjects must be must be selective in their coverage; in order to be regarded as serious. The alternative proposition is that it is okay to rely upon sources that indiscriminately document all of their subjects. 'Find a Grave' is an example of an indiscriminate source. Jack4576 (talk) 14:01, 3 September 2023 (UTC)[reply]
    Consider the sentence John Smith is a human being. It is formally a sort of description as far as being a human being distinguishes that John Smith from a dog or a monkey. But you would probably agree with the idea that this description is too trivial, effectively void. This observation leads us to the conclusion that usually we (and probably the authors of Wikidata rules) mean by description (and not only by identifiability) something essential, something significant. That is my point: the set of data provided by indiscriminate sources is too trivial and void for what we imply as description (although formally it is the description, of course). And that is why, as for me, being listed by these indiscriminate sources does not give us the compliance to WD:N. Андрей Романенко (talk) 18:58, 3 September 2023 (UTC)[reply]
    I see your point now. If an entry is ‘John Smith’ is a human being, and we have a publicly available reference to identify that subject as an entity, I don’t see the issue with including that subject as the entry
    It seems to me the core mission of this site is to store the world’s information. If FindaGrave lists a person as being ‘John Smith’, ‘born in Arizona’; there is no harm to WikiData in having an entry for a John Smith, born in Arizona, with statements supported by the FindaGrave refURL
    Consider the case of somebody with a less generic name too. Imagine the person has a very unusual or unique last name, and nothing more. Then, we give them an entry due to their presence on FindaGrave.
    Years later, a person might search their last name on WikiData and notice that this person existed. That’s something that adds value to people’s lives.
    I don’t think we should be making value judgements about what information is or isn’t trivial to include. It’s safer to just include everything that’s verifiable, and remove everything that isn’t verifiable. Jack4576 (talk) 23:42, 3 September 2023 (UTC)[reply]
    On the one hand, I don't think FindAGrave qualifies as a reliable source. It's user generated content but it is much less a subject for many people's scrutiny than in Wikipedia. On the other hand, I don't understand this reasoning about "some person who might find something and be glad about it": I wish all the best to this person but they might search for their last name on Google and be sent directly to FindAGrave, there's no need in copying all possible information to the same Internet project, Worldwide Web does not work this way. Finally, I disagree about no harm. Listing all possible John Smiths from all possible cemeteries of the world results in the situation when it would be practically impossible to find the John Smith Q228024 among them. It is what they call noisy data. Андрей Романенко (talk) 02:27, 4 September 2023 (UTC)[reply]
  • We’ve have those kind of conversations here and here. In a nutshell:
    • Wikidata is neither a “place for self-promotion, or advertising/marketing” nor the “White or Yellow Pages” (WD:NOT). Including every company or ever person is not within Wikidata’s mission statement. Yes, this page isn’t an official guideline but it has been around for over ten years now and I can’t remember anyone who disputed its basic premise.
    • Our resources are limited. We neither have the database capability nor the personpower to store information about each and everything in a tidy and unbiased way. If there is no media outlet, nobody in academia or the library systems to cover somebody or something, then it’s probably not really worth our time anway.
    • Wikidata is a “secondary database“, it “reflects the diversity of knowledge available and supports the notion of verifiability” (WD:I). There is no diversity of knowledge in items that have to rely on one source, namely the subject itself or its members or employees. Since the government is often required to accept every application to things like the commercial register by law, it’s basically user-generated content by the company itself. It’s not organic content by an independent third-party as MisterSynergy once called this kind of data. --Emu (talk) 19:15, 3 September 2023 (UTC)[reply]
    • ‘Not including every company person’ is also something that is not included in WikiData’s mission statement
    • Limited resources are a matter for the WikiMedia foundation, not a matter for we editors. Further, if resources are the concern; imposing a rule that sources must be discriminate is something that brings about an unnecessary additional evaluative task. If resources are so limited, we’d be better off only policing the verifiability of subjects.
    • “ It’s not organic content by an independent third-party“ I don’t understand this argument. Information has to come from somewhere. If third parties have contributed to an otherwise serious government database, that in no way makes the government database any less serious a source.
    • “There is no diversity of knowledge in items that have to rely on one source”. Again, I don’t understand this point. Subjects that only have one supporting source are no less ‘diverse’ than subjects that have multiple sources. It seems strange to me that you’re reading a sentence claiming WikiData reflects the diversity of human knowledge, as imposing a restriction on what kinds of knowledge are allowed to be here. All data is welcome here, not only data with ‘diverse’ sources.
    • “Wikidata is neither a “place for self-promotion, or advertising/marketing” nor the “White or Yellow Pages” (WD:NOT). Including every company or ever person is not within Wikidata’s mission statement. Yes, this page isn’t an official guideline but it has been around for over ten years now and I can’t remember anyone who disputed its basic premise.”
    > You’re right it’s not an official guideline
    > I dispute it. There’s no reason WikiData couldn’t accidentally and coincidentally fulfil a function akin to the Yellow Pages. If that’s a byproduct of putting all the world’s info on here, then so be it, it’s not something we should actively work to counteract. I’m sure there are a few other edits that’d dispute it too. It’s pretty baffling to me that someone would assert what’s effectively an essay as having the status of a policy adopted by consensus; merely because it hasn’t been objected to. Thats just not the way major and critical policies are adopted.
    > What’s ‘self-promotion’ is a pretty subjective call. Better to just stick to verifiability to reduce the administrative burden.
    Jack4576 (talk) 00:01, 4 September 2023 (UTC)[reply]
    • Limited resources are a matter for the WikiMedia foundation, not a matter for we editors: That’s a bold claim that has no basis in reality. The limitations are very real. The existing rule about selective sources doesn’t imply an unnecessary additional evaluative task – endless arguing by people who want their own stuff kept do. You seem to be no stranger to this problem.
    • Information has to come from somewhere. True. We rely on the filtering of other people to determine if this information is trustworthy or relevant in the first place. Secondary database.
    • All data is welcome here: Except it isn’t. Just look at Wikidata:Introduction#What_does_this_mean? and you will find several restrictions. Also there is WD:N.
    --Emu (talk) 06:53, 4 September 2023 (UTC)[reply]
    • Determining whether or not a source is ‘selective’ is an evaluative task. If you’re so concerned about administrative burden, just discard it. There will be less of these ‘endless’ arguments about whether things should/shouldn’t be kept. Have you considered that maybe part of this intractability stems from how difficult it is to determine what ‘selective’ actually means? or the difficulties involved in determining that issue? Your snark is a little unbecoming, but I’ll look past it.
    • Sure, I don’t see the relevance of that to sources being ‘selective’ though. Verifiability (i.e. trustworthiness) is a completely distinct issue.
    • Regardless, none of those restrictions impose a requirement that an information source be ‘selective’. None of the prose in ‘what does this mean’ does so, and neither does WD:N.
    Jack4576 (talk) 07:56, 4 September 2023 (UTC)[reply]
    It’s generally not difficult at all. No user-generated content. That’s it. If you allow user-generated content there’s basically no concept of notability left, just create a Crunchbase/ISNI/Instagram/whatnot account or register a company with the authorities and there you go. That’s simply not what Wikidata is for and I have laid out the arguments to support my position.
    As to the snark: More than 50% of your more recent contributions are RfD related. There is no real other activity to speak of, apart from some new items with very questionable notability. You were kicked out of another project to issues related to AfD discussions. Yes, I’m well aware that you are probably a sockpuppet of another user but that makes things even worse. That doesn’t invalidate your arguments per se but it does show that maybe your main quest here is not finding common ground. --Emu (talk) 08:19, 4 September 2023 (UTC)[reply]
    Your concern with 'user-generated content' is misplaced. Firstly, it's off-topic to this discussion, which is about indiscriminate sources, not user generated ones. These are categorically different issues. Secondly, even the most serious of the world's databases include "user-generated" content in some sense. What matters is whether the administrator/overseer of the relevant database has any level of commitment in ensuring the data integrity of the information that they're overseeing. To me that is what makes something a serious source; not the identity of the person adding the information. (which isn't possible for us to know, ultimately.
    My main quest here is obviously finding common-ground as I'm here engaging in an RfC attempting to reach consensus. This is the best possible approach to find common-ground. With respect, I think its a much better approach than your strategy of pointing people toward your own interpretation of archived discussions on your user page.
    Turning to other matters; since for some reason you cannot drop the snark, let me clarify something for you. I'm not kicked out of Wikipedia, I'm only topic banned from certain discussions. I've continued to contribute regularly over the last few months including through RfCs and other policy questions. I have many contributions to WikiData, mostly indirectly through Wikipedia article contributions that are added here via bot.
    None of my recent items have even remotely questionable notability. My data entry for 'Myriad Sun' for example is supported by multiple journalistic references. My data entries for a few books are supported by national library entries. I have no idea what you're talking about. Surely you don't have issues with those entries ... I would dearly hope.
    And no I'm not a sock. Checkuser me if you wish. As an Admin you should be above making baseless accusations of sockpuppetry ... Throwing such a charge around is a pretty concerning and inappropriate behaviour for an Admin. Additionally... its just obviously not true. My user history on Wikipedia should demonstrate my bona-fides as a non-sock user. I've just taken an interest in notability issues on WikiData after witnessing a few deletion discussions talking about 'indiscriminate sources'. I find these discussions concerning as they seem to me to be an importation of problematic Wikipedia guidelines into WikiData guidelines; without actually doing so through the proper means. (which would be to obtain consensus to redraft WD:N and add those rules). Hence my recent participation at RfD. I'm just contributing to assist with the maintenance of the site, by applying the guidelines as written. Thankless task it seems.
    Jack4576 (talk) 09:52, 4 September 2023 (UTC)[reply]
  • I think we should have more specific rules for databases in our notability policy. For technical reasons Wikidata can't handle having items for all entities that are listed in some official database. This lead us to interpret for practical purposes "serious" more narrowly.
I don't think "selective" is a good way to solve this problem. In biology for example a database that links all genes of a certain taxa can be valuable to import into Wikidata even if it's not selective about which genes it imports.
In the case of the item in question, the issue is not self-promotion. In this case, it looks like Jklamo created a lot of item via open-refine with jobs like https://editgroups.toolforge.org/b/OR/acc4867937a/ and did not make a bot approval for them. Deleting the items once at a time makes little sense. The solution is likely to either undo the whole upload or leave it be. Randomly deleting single items from it seems a bad strategy. ChristianKl11:56, 4 September 2023 (UTC)[reply]
Thanks ChristianKl. May I ask, what are those technical reasons? Jack4576 (talk) 12:54, 4 September 2023 (UTC)[reply]
There's no software that manages a database in a way where that database can be queried with SPARQL and that easily scales as large as desirable while providing the performance that the query service currently provides. ChristianKl19:03, 4 September 2023 (UTC)[reply]
If thats the case I think we need to define what subjects do/don’t belong here more clearly. ‘Serious’ sources is too wide a net if we’re going to face capacity issues with the guideline as written. Perhaps this is a discussion for another day. Thank you Christian Jack4576 (talk) 13:21, 5 September 2023 (UTC)[reply]
I think I understand your issue with the wording but I don’t think those genes would really face notability problems regarding “selectivity” here: I’m not a biologist but it’s my understanding that those genes are representations of some biologic reality, maybe with some sort of interpretation. So in a way, life itself is selective – you can’t just make up genes the way you can make up companies, careers as Youtubers or serial entrepreneurs or similar things. Also, I imagine that those databases aren’t just data dumps provided by the general public but rather have some sort of scientific concept and curation process or at the very lease require some sort of training to enter data there – so even more selectivity. --Emu (talk) 21:02, 5 September 2023 (UTC)[reply]
  • English is not my first language so I will not enter the lexicographical details of "serious" and "selective" (and as most wikidatian are in the same situation, likewise the community shouldn't) but yes, obviously "state registries" are usually not good enough (and not just of notability, also to populate the item as having an empty item is obviously no use to anyone). An other important point for me here, is that "references" is plural so we need at least 2 references (in that case, one may be low quality but only if the others are good quality). Cheers, VIGNERON (talk) 14:15, 4 September 2023 (UTC)[reply]
    Its not obvious to me. Why aren't these sources good enough? (Assuming the item isn't empty) Jack4576 (talk) 15:03, 4 September 2023 (UTC)[reply]
  • My 2 cents: to interpret these details, we need to defer to mission statements. If Wikidata's mission is to "represent the sum of human knowledge", I reason that the source needs to represent a source of knowledge and not simply a source of data. Perhaps we should assume that the conceptual entity needs to be of reasonable encyclopedic value for other humans. Maybe some wording change on WD:N? TiagoLubiana (talk) 20:25, 5 September 2023 (UTC)[reply]
    IMO all data is a form of knowledge, I don't think we can draw out a meaningful distinction in that way, YMMV.
    On its face 'reasonable encyclopedic value for other humans' sounds good, but for many years people asserting "encyclopedic" as a threshold on Wikipedia has resulted in certain communities being marginalised. Its part of Wikipedia having well-documented biases in coverage. Personally I quite like the WD:N policy as its worded currently, and think we should stick to it. (which doesn't include 'selective') Happy to contribute to any debate in future if the site wishes to consider changes to WD:N though. Jack4576 (talk) 09:08, 6 September 2023 (UTC)[reply]
  • I agree with Emu, especially with regard to Wikidata:What Wikidata is not and user-generated content. I also think that the question in dispute is a bit misleading since the wording "serious and publicly available references" already implies a selectivity – we don't accept every reference. If there is a "source" (not really the appropriate term here I guess) that does not have criteria for inclusion/exclusion, it is not "serious" in the WD:N meaning, I think. After all, I have the impression that the majority of contributors agree which references support notability and which do not. --Dorades (talk) 09:18, 6 September 2023 (UTC)[reply]
    The issue is not whether we are selective; of course we are. The issue is whether we accept sources that -themselves- are selective in the subjects that they cover (within their chosen topic, where the database would otherwise be regarded as serious) Jack4576 (talk) 13:21, 6 September 2023 (UTC)[reply]
  • White or Yellow Pages: I think this gets down to how many data points a source gives for a person. Appearing in a phone book or directory would just give one data point that could be added to a Wikidata entry for a person, a residence= , and the people in a current phone book are living. That is not enough info to properly disambiguate people with common names. At Findagrave the people are dead and it gives DOB and POB as well as DOD and POD as well as family relations. There is also no concern about living people. --RAN (talk) 03:07, 11 October 2023 (UTC)[reply]