About this board

Previous discussion was archived at User talk:IagoQnsi/Archive 1 on 2020-08-18.

Jura1 (talkcontribs)

Hi IagoQnsi

Thanks for creating that catalog in MxM. It helped interwiki prefix at Wikimedia (P6720) finally get closer to completion.

BTW, I removed the language prefixes, as they can refer to Wikipedia, but also specific Wikisource or Wiktionary languages.

Reply to "MxM 3788 for P6720"
Amadalvarez (talkcontribs)

Hi. I recently voted against the creation of some properties related to sports statistics. I proposed a more generic solution that would allow us to have the information without having to create one property for each situation.

You have participated in the initial vote and I want to share with you the details of my counter-proposal before I publish it.

It would be interesting to have your expert opinion / suggestions on whether to incorporate any changes before proposing it as a substitute property. The work page is: User:Amadalvarez/sports statistics property.

Please leave comments on its talk page. Thanks

cc:@ArthurPSmith

Reply to "Sports statistic properties"
Peteforsyth (talkcontribs)

Hi, I'm delighted to see you have added so many items for local newspapers. I've found several so far that are redundant of existing items (The Bend Bulletin, the Hood River Glacier, and the Klamath Falls Herald). Do you know of a good way for search for and resolve duplicates, or should I just continue searching for them manually?

Also, you may be interested in a campaign this connects to: News On Wiki. We are trying to create a few hundred new English Wikipedia articles about small newspapers by the end of February. -~~~~

IagoQnsi (talkcontribs)

Hi Pete. Apologies for those duplicates; it just wasn't feasible for me to go through and dedupe all the ones that the automatching missed, as there were some 19,000 newspapers iirc. I don't know of a good way to find and merge duplicates. I'd be happy to give you my OpenRefine project files if you think you could do something with those. I suspect the duplicates aren't too numerous, as many of the papers I imported have been defunct for decades, and many of the ones that still exist did not already have items anyway. I figured editors would gradually stumble upon the duplicates over time and whittle them away.

Peteforsyth (talkcontribs)

Hi, I've delved into this in some more depth now. I resolved most of the duplicates in the state of Washington, I've dabbled in Oregon and Wisconsin, and I think I have a pretty good sense of where things stand. It seems to me that probably well over half of the ~19k items you imported were duplicates. There are two things going on; the first is pretty straightforward, the second has more nuance and room for interpretation.

First, there appear to have been three major imports of newspaper items: Sk!dbot in 2013, 99of9 in 2018, and items for which a Wikipedia article existed (rolling). A whole lot of your items were duplicates of existing items, that had come about by these processes. (An example is the Capital Times. But I've merged yours, so you'll have to look closely.) I know that the 2018 import was based on the USNPL database (and a handful of other web databases). I have no idea what Sk!dbot used as a source.

Second, there are items that some might consider duplicates, while others wouldn't. Consider a case like this:

  • The (a) Weekly Guard changed its name to the (b) Guard.
  • The Guard and the (c) Herald (which had both daily and weekly editions at various times, and went through three different owners) merged.
  • The (d) Herald-Guard has continued.

Many newspaper databases (Chronicling America, Newspapers.com, etc.) consider a, b, c, and d four or more distinct items, and may or may not do a great job of expressing the relationships among the items. In WikiProject Periodicals, we discussed it in 2018, and concluded that we should generally consider a, b, c, and d one item, and attach all four names to it (as alternate labels). See the items merged into the Peninsula Daily News for an example of how your items relate to this principle.

Peteforsyth (talkcontribs)

Unfortunately, all this adds up to a setback to the News On Wiki campaign, which has relied on reasonably tidy and stable Wikidata data (and the in-progress PaceTrack software) to track our progress (in addition to having improvement to relevant Wikidata content as a core component of our mission). There are two impacts:

  • Prior to your import, this query returned about 6,000 results. It was a little cumbersome, but it was possible to scroll and zoom around. Now, it returns about 25,000 results, and it's sluggish.
  • Prior to your import, the green items (indicating that there was a Wikipedia article) were pretty prominent. But now, the map looks mostly red, making it less useful as a visual indicator of how thoroughly English Wikipedia covers U.S. newspapers.

The second problem results in part from some stuff that predates your import, and maybe I can figure out a way to address it. If a city has 4 papers and one of them has a Wikipedia article, it would be better to have a green dot than a red dot (indicating that at least one newspaper in that city has a Wikipedia article). But unfortunately it goes the other way. I bet I can adjust the code to make that change, or maybe even find a more graceful way of handling it than that.

Anyway, just wanted to give you an overview of what I'd learned. I don't know whether you discussed this import at WikiProject Periodicals (or a similar venue) prior to performing it, but if not, I'd urge you to do that in the future, to at least have a chance of detecting these kinds of issues ahead of time. I know it's a learning process for all of us, so please don't take that as anything but a suggestion on how to improve future imports.

If you do have thoughts on how to address any of the issues I brought up, I'd be very interested to hear.

IagoQnsi (talkcontribs)

Hi Pete, thanks for the detailed message and the time you've put into this. My apologies for the high duplicate rate -- I had expected it to be much lower. I think the core issue is really that it's just hard to de-dupe newspapers due to how they're named; many newspapers have similar or identical names, and newspapers are often known by several name variants. My goal wasn't really to import new newspapers so much as to import newspapers.com links -- it just worked out that I wasn't able to automatically match that many of the links to existing items.

I don't know that there's an easy solution to this situation. Perhaps we could have better tooling for identifying likely duplicates, but I think this is fundamentally a problem that requires lots of manual cleanup.

IagoQnsi (talkcontribs)

I also do wonder if the rate of duplicates you've found stands up across the entire dataset, as Newspapers.com's collection isn't evenly distributed across the country. They seem to have a particular emphasis in the middle of the country -- in Kansas, Nebraska, and Oklahoma. When I was working on the import, I found that a lot of these newspapers were very obscure; perhaps they existed for two years in a town that existed for ten years but has now been abandoned for a century. I actually had to create a surprising number of new items for the towns these newspapers existed in, as they had not yet made their way into Wikidata. This is why I went forward with the import despite the volume of new items -- it seemed likely to me that a majority of them were indeed completely new.

IagoQnsi (talkcontribs)

By the way, how are you finding all these duplicates? Do you just use that map? I'd be happy to help out in the de-duping process.

Matthias Winkelmann (talkcontribs)

I found a cool 400 duplicates with this most straightforward query: Identical places of publication, Identical labels, and no dates of inception/dissolution that would differentiate them:

SELECT distinct ?item ?other ?itemIncept ?otherIncept ?itemPubLabel ?otherPubLabel ?np ?label ?otherLabel WHERE {

 ?item wdt:P7259 ?np.
 ?item rdfs:label ?label. 
 FILTER(LANG(?label) = 'en').
 ?other rdfs:label ?label.
 ?other (wdt:P31/wdt:P279*) wd:Q11032 .
 FILTER(?other != ?item).
 OPTIONAL { ?item wdt:P291 ?itemPub.
          ?other wdt:P291 ?otherPub. }
 OPTIONAL { ?item wdt:P571 ?itemIncept. 
           ?other wdt:P571 ?otherIncept. }
 FILTER(?itemPub = ?otherPub)
 SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

}

Will probably not show much after I've merged these. But you can find thousands more with a it of creative, such as dropping "The" from labels etc. Example Chadron Record (Q55667983) The Chadron Recorder (Q100288438)

Checking for duplicates at that level is what I would consider the bare minimum level of care before creating thousands of new items.

Duplicate items are far worse than missing data, because they create an illusion of knowledge/truth, i. e. the consumer of such data will wrangle with "unknown unknowns" instead of "known unknowns". Witht that in mind, it's simply unacceptable to create tens of thousands of items when the work is shoddy enough to warrant a disclaimer in the description: "Created a new Item: adding Sports-Reference college basketball players (likely contains some duplicates)" (see here for 500 out of x usages of that: ).

About half of the duplicates I cleaned up were instances where you created both the original and the duplicate, meaning you didn't even deduplicate within your own data. Simply sorting alphabetically would have made it easy to sort this out... Is something I would usually say. But many of these cases have (had) consecutive IDs, meaning they were sorted alphabetically. You just didn't care enough to quickly scroll through the data?

IagoQnsi (talkcontribs)

You're right, I should have caught those. I assumed that Newspapers.com would have done such basic de-duping, as their website presents each newspaper as being a distinct and complete entity. Clearly I was mistaken in this assumption. Mea culpa.

The batch of basketball players I tagged as "likely contains some duplicates" was the set of players in which OpenRefine had found a potential match, but with a very low level of confidence. I manually checked a number of these and found that the matches were completely wrong most of the time, but occasionally there was a correct match. To me the rate of duplicates seemed fairly low and so I figured it was worth having rather than leaving a gap in the SRCBB data.

Although I agree that I could and should have done more to clean up the Newspapers.com batch, I disagree that no data is better than data with duplicates. Duplicates are easily merged, and de-duping is a job that is excellently handled by the crowdsourcing of a wiki. It's very difficult for 1 person to solve 5000 potential duplicates, but it's very easy for 5000 people to stumble upon 1 duplicate each and de-dupe them.

Matthias Winkelmann (talkcontribs)

I just noticed you also added 16k+ aliases that are identical to the labels, which is a complete waste of resources, among other things. As to who should clean that (and everything else) up, I disagree with the idea that "it's very easy for 5000 people to stumble upon 1 duplicate each and de-dupe them", except in the sense that it is easier for you. I'll also cite this:

"Users doing the edits are responsible for fixing or undoing their changes if issues are found."

(from Help:QuickStatements#Best practices))

IagoQnsi (talkcontribs)

Certainly I agree that problems should be dealt with by the original editor. I should have deduped more and I take responsibility for that. I was talking about deduping of the type that can't be done automatically -- things that require manual inspection to determine that they are the same item. I had thought when I uploaded the dataset that the duplicates would be primarily of this type. I'm sorry that my data upload was lower quality than I had thought and intended it to be.

IagoQnsi (talkcontribs)

@Matthias Winkelmann Here's an example of the kinds of complex cases that I didn't want to systematically apply merges to. The town of Cain City, Kansas has three similarly named newspapers on Newspapers.com: Cain City News, The Cain City News, and Cain-City News. I initially merged all three of these into one item, but upon further investigation, I discovered that only the 2nd and 3rd were duplicates; the 1st entry was a different newspaper. Cain City News (Q100252116) had its Vol. 1 No. 1 issue in 1889, while The Cain City News (Q100252118) had its Vol. 1 No. 1 in 1882 (and its archives end in 1886). In merging newspapers just on the basis of having the same title and the same location, you bulldoze over these sorts of cases. This is why I was so hesitant to merge seeming duplicates -- they often aren't in fact duplicates.

Peteforsyth (talkcontribs)

Oh my, I thought I had replied here long ago, but must have failed to save.

First, I want to say, while it's true this issue has been pretty frustrating to our campaign and ability to access information about our subject matter and our progress, I understand that it came about through good faith efforts. I do think there are some important lessons for next time (and I think it's worth some further discussion, here and maybe at a more general venue, to figure out exactly what those lessons are -- as they may not be 100% clear to anybody yet.)

Specifically in response to the comment immediately above, about the Cain City News: Personally, I strongly disagree with the conclusion; I understand that such merges would be sub-optimal, but in the grand scheme of things, if the choice is:

  • Create tens of thousands of items, of which maybe 40% are duplicates, or
  • De-dupe, potentially merging some hundreds or even thousands of items that are not actually the same, and then create far fewer items

I think the second option is VASTLY superior. These are items that did not previously exist in Wikidata; to the extent they are considered important, they will be fixed up by human editors and/or automated processes over time.

Furthermore, with many smart minds on the problem, it might be possible to substantially reduce the number of false-positives in the de-duping process, so that most items like the Cain City News get caught and fixed ahead of time. (Maybe.)

Which is all to say, it's my understanding that best practice in these cases is to make meaningful consultation with other Wikidata editors prior to importing thousands of items, and to allow some time for the discussion to unfold and ideas to emerge. I think this particular instance really underscores that need. Maybe others would agree with your assessment about items like Cain City, or maybe they would agree with me; but without asking the question, how do we know? We don't. It's important in a collaborative environment to assess consensus, prior to taking large-scale actions that are difficult to undo.

Anyway, I think it would be a good idea to bring some of the points in this discussion up at WikiProject Periodicals or similar, and get some other perspectives that could inform future imports.

Reply to "Newspaper items"
Matthias Winkelmann (talkcontribs)

While resolving the duplicate that is The Dallas Morning News (Q889935) vs The Dallas Morning News (Q100292555) (an issue others have already brought up, as I see), I noticed your use of archives at (P485) on all these newspapers.

That is, at the very least, redundant with newspaper archive URL (P7213). "Newspaper" is an actual word of the English language, and not just the parasitic website that is newspaper.com. That should be obvious from the property translations: even without knowing any of those languages, you will notice that none refer to "newspaper", which they would if it were a proper noun. Capitalisation, the absence of ".com", and the parallel existance of Newspapers.com paper ID (P7259) should be further clues, as well as the discussion page and original proposal.

Further, archives at (P485) does not refer to an archive of the paper, but rather "the paper's archives". That is: it would, consistent with its use for presidents or really anything else, refer to a collection of all the institution's artefacts.

IagoQnsi (talkcontribs)

Thanks for the explanation, and for cleaning up those bad statements. I'll be more conservative about adding properties like those in the future.

IagoQnsi (talkcontribs)

I've just run a batch of university sports clubs where I accidentally added some items' entity ID as an alias. This is obviously unhelpful, so I'm going to undo it. I'm creating this talk page note to serve as a to-do item, and so that anyone else who may discover the issue knows that I'm working on reverting it.

Twitch: removal of P5797 in Wikidata property

2
LotsofTheories (talkcontribs)

Twitch channel ID (P5797) was removed from Wikidata property in your edit here. It is related to Twitch. Originally edited by @Pamputt. I like the idea of Wikidata property (P1687) having many values, but maybe it shouldn't have too many values? Please enlighten me why you removed that claim because I want to learn how I should edit future items with "Wikidata property" in it, this can help me in cleaning them up maybe.

IagoQnsi (talkcontribs)

P8761 (Sports-Reference.com college football school ID)

6
Arturo63 (talkcontribs)
IagoQnsi (talkcontribs)

I've added a few new claims but overall it looks good to me. Anything specific you wanted me to look at?

Arturo63 (talkcontribs)

There are two problems with Q6827329 (Miami Hurricanes football) -> Q21502404 and Q21503247. Thank you.

IagoQnsi (talkcontribs)
Arturo63 (talkcontribs)
IagoQnsi (talkcontribs)

They look good to me. I've made a few tweaks but they seem alright. If you're wondering why the links don't work, I think that's just a caching issue -- give it a day or two and the IDs will start being linked.

Tagishsimon (talkcontribs)

You might usefully have indicated the gender of basketball players in your recent upload - can presumably be calculated from the competition class. I'm adding 5k of them right now. Please do better.

IagoQnsi (talkcontribs)

I probably should have for the new players, in hindsight. I didn't include gender because I had existing players in my upload (who could potentially have other values that aren't "male"). The rudeness really isn't necessary; I've put many hours of good faith effort into preparing this upload.

Tagishsimon (talkcontribs)

You're requiring that each of these records - there have been how many so far? 50k? need to be re-edited to add the gender, which is a complete waste of resource. You have the capacity to change what your bot is doing. It is not rude to ask you to do better than you are doing, when you are deliberately doing badly. Stop your bot. Make the change.

IagoQnsi (talkcontribs)

OpenRefine doesn't really handle stopping and resuming very well, so I'd rather not stop it. It might take more time to fix the mess caused by cancelling the upload than it would to just add gender after the fact.

Yupik (talkcontribs)

Your upload seems to have also added a new statement in addition to existing statements in qids instead of adding the new reference to the existing statement. As you can run a bot to fix this, please do so I don't have to fix them by hand.

Reply to "basketballists"

Please do not take it back without asking?

2
Sezgin İbiş (talkcontribs)

The world is not just made up of the English-speaking US, UK and Australia. There are other countries outside of your country, and the values in your language in these countries can have different meanings. While this is the case, your undoes without giving information can create meaningless expressions. Therefore, please contact me before giving direction to my language and my language.

IagoQnsi (talkcontribs)

I'm sorry my edit referenced English instead of Turkish, but my point still stands. Although football club (Q17270000) and association football club (Q476028) have very similar or even identical names in many languages (Turkish and English included), they refer to different concepts, and thus, should not be linked with a said to be the same as (P460) statement. To indicate that two items have similar names in one or more languages, the property to use is different from (P1889), not said to be the same as (P460).

GZWDer (talkcontribs)

In many cases you conflated multiple people in one item.

GZWDer (talkcontribs)

You can see more examples here - currently 221 cases.

IagoQnsi (talkcontribs)

Hmm, I thought I was pretty careful about that. I think the number should be relatively small, as I'm mostly creating new items rather than modifying existing ones. After my edits finish uploading, I'll do a new batch to clean up items with multiple SRCBB values. Thanks for the heads up.

IagoQnsi (talkcontribs)

Oh, that's a lot. Crap, I'll stop it.

IagoQnsi (talkcontribs)

I think I know what I screwed up; I'll clean up my mess. Blugh. Thanks for the heads up -- I checked a few early items but I should have kept a closer eye on it.