Wikidata:Requests for permissions/Bot/VorontsovIEbot
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved --Lymantria (talk) 10:47, 11 April 2018 (UTC)[reply]
VorontsovIEbot edit
VorontsovIEbot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: VorontsovIE (talk • contribs • logs)
Task/s: Parse English wikipedia for infoboxes with dates, transform them to wikidata statements (if they are missing), manually check for errors and import to wikidata via QuickStatements.
Code: code is a part of geo-history initiative to create an interactive map of historical events. Its code lie in a repository. Most relevant files are date_extractor.py, parsing_dates.py and make_quick_statements.rb. One can grasp bot possibilites at test_parsing_dates.py with examples of parsed dates.
Also one can take a look at curration log infos_to_check.tsv. Records marked with ok are correctly recognized records - it was manually verified. But this info is based solely on wikipedia articles as there are no other bulk sources of such information and many infoboxes doesn't have references.
Function details: The bot looks for infoboxes with date statements (for now I work on military conflict), collect names of pages with dates filled. Then the bot tries to read date (removing most of html/wikitext markup, then reading Start/End date templates and then parsing plain text) or date range. It transforms date/date range into human readable form so that an operator could manually check whether original text in date field of infobox matches a parsed one; for most usual cases bot checks that it can write the date obtained in the same way it was written in infobox. It helps to proceed simple cases (but manual verification step not excluded for reliability).
For the first stage I also skip complex dates (like "Late April"), and dates with references (I'm going to put references manually). Calendar is mostly not set, so I don't set it too (wikidata still has default treatment of calendar). In cases when it's directly specified I will set calendar manually.
After verification step the bot generates QuickStatements query for passed examples (with imported from Wikimedia project (P143) and retrieved (P813) statements in sources. Properties used are point in time (P585) for dates and start time (P580) plus end time (P582) for date ranges.
Number of statements for military conflicts will be 10-20 thousands.
upd: In some cases military conflict infoboxes are placed on pages related to involved person or place but not on a page of person. Thus I additionally filter such pages (page title and infobox title must match). Non-obvious cases are verified manually.
--VorontsovIE (talk) 09:53, 26 March 2018 (UTC)[reply]
- Oppose if the only sources for this will be imported from Wikimedia project (P143). Wostr (talk) 23:49, 27 March 2018 (UTC)[reply]
- Unfortunately there are no other massive sources of data on historical events to match data against so it's not realistic to fill wikidata with well referenced data item-by-item. Only 5.6 thousand of more than 15 thousand of conflicts has wikidata statements and that number didn't changed significantly for 1.5 years when I first started to parse dates on wikipedia. Near 95% of articles are older than 1.5 years so corresponding pages are more or less polished themselves.
- Also, the very nature of historical knowledge is uncertain so different sources assign a bit different dates (because we can define start and end of an event by different markers, e.g.: end of military actions or formal signing of agreement etc). It'd be really hard to model such qualifiers in wikidata. Dates even without clarifying statements and sources will at least reduce uncertainity about date (if not exact date will be better than no info at all); it's hard to believe that unsourced date in infobox differs too much from actual dates.
- At the same time, DBpedia date parser is really bad for historical dates, so people has no alternative way to programmatically get dates of historical events. VorontsovIE (talk) 08:11, 28 March 2018 (UTC)[reply]
- Support Wikipedia-quality data is much better than no data. Data and references will improve over time here. --Magnus Manske (talk) 12:06, 28 March 2018 (UTC)[reply]
- No data is better than not sourced data. Wostr (talk) 19:10, 28 March 2018 (UTC)[reply]
- Sources are just in two clicks from you while you can find them on wiki page. If data is not referenced on wiki, that's a bigger problem (but if data is doubtful, 'citation needed' template solves the problem). I can hardly imagine someone who relies on wikidata as the last and the only source of True. VorontsovIE (talk) 21:30, 28 March 2018 (UTC)[reply]
- Right now there is already too much complaints about further integration between Wikipedia and Wikidata on my home wiki. One of the problem is that WD data is usually unsourced and copied from Wikimedia projects. I'm not so optimistic like Magnus above – imho unsourced data will remain in such state for many years. Wostr (talk) 22:39, 28 March 2018 (UTC)[reply]
- Unfortunately, exactly as you said, lots of data will be unsourced for a very long time, because referencing is rather hard and unpleasant work. I'm not sure that wikipedia itself would be possible if every article was rigorously checked for referencing its content (and I guess, most references were never checked to find whether they actually state the mentioned facts). --VorontsovIE (talk) 23:21, 28 March 2018 (UTC)[reply]
- Right now there is already too much complaints about further integration between Wikipedia and Wikidata on my home wiki. One of the problem is that WD data is usually unsourced and copied from Wikimedia projects. I'm not so optimistic like Magnus above – imho unsourced data will remain in such state for many years. Wostr (talk) 22:39, 28 March 2018 (UTC)[reply]
- Sources are just in two clicks from you while you can find them on wiki page. If data is not referenced on wiki, that's a bigger problem (but if data is doubtful, 'citation needed' template solves the problem). I can hardly imagine someone who relies on wikidata as the last and the only source of True. VorontsovIE (talk) 21:30, 28 March 2018 (UTC)[reply]
- No data is better than not sourced data. Wostr (talk) 19:10, 28 March 2018 (UTC)[reply]
- Support As per Magnus Manske's comment above. It's very easy for consumers of Wikidata data to reject statements without proper references if they want to, so best they have the choice to use the best we have to offer at present. NavinoEvans (talk) 15:51, 29 March 2018 (UTC)[reply]
- Comment There might be problem with BC dates [1], possibly solved since. Not sure if the retrieved date is of much help.
--- Jura 22:32, 30 March 2018 (UTC)[reply]- Yes there was an off-by-one error (I didn't noticed difference between RDF and JSON formats for the first time). That single BC-date example was exactly to find whether script generate correct dates for such a case, so I immediately fixed date formatting. Now everything should be ok. Here is an example --VorontsovIE (talk) 00:35, 31 March 2018 (UTC)[reply]
- With majority support as it now is, I am ready to approve this request in a couple of days unless new objections are raised. Lymantria (talk) 17:41, 4 April 2018 (UTC)[reply]
Comment It took 340 years between the adoption of the Gregorian calendar in the Papal States (1583, the date Wikidata uses to to decide if a date is Julian or Gregorian by default in the user interface) and the adoption in Greece in 1923 (the last country to switch from Julian to Gregorian). The English Wikipedia uses the calendar in force where the conflict took place (or it's supposed to) or should explain if the two sides were using different calendars. I don't know if Quickstatements has a default, or what it is, but clearly just using the default is wrong. You will have to examine each article to determine which calendar was used. Jc3s5h (talk) 18:24, 4 April 2018 (UTC)[reply]
- Can you please give me a link to a wikipedia policy of date indication. I didn't found explicit rules which tell whether editors should use Gregorian dates or dates that were in force. Anyway it's arguable that editors stick to these rules anyway: for example dates of pre-Julian period are converted to Julian/Gregorian proleptic calendar (but usually there are no cues, which one of these calendars was used). Unfortunately, as far as I know, wikidata doesn't allow not to set calendar and doesn't allow to use "either-or" statements. For example, there are lots of dates for which day or even month are not specified and thus we can tell the month/year with accuracy ±1 even if we don't know the calendar used. But wikidata allows only to change accuracy to a next level (year/decade) -- VorontsovIE (talk) 14:33, 5 April 2018 (UTC)[reply]
- The link to the English Wikipedia guideline is w:Wikipedia:Manual of Style/Dates and numbers#Julian and Gregorian calendars. Your concern about conversion of pre-Julian calendars is valid. I suggest if you can't figure it out in a particular case, you don't import the data. Instead, put an appropriate tag in the article, such as a template for "Disputed" or "Citation needed", and wait for someone to fix the article. Jc3s5h (talk) 19:42, 5 April 2018 (UTC)[reply]
- Thank you for a link. BTW, found in wikidata dates tutorial there is a solution for such cases: "If it can't be determined if the date is in one or the other calendar, that date should be entered in the default calendar with the qualifier sourcing circumstances (P1480) = unspecified calendar (Q18195782)". I guess, it's a simple way for the first time and then dates with this statement can be refined one-by-one. -- VorontsovIE (talk) 06:37, 6 April 2018 (UTC)[reply]
- The link to the English Wikipedia guideline is w:Wikipedia:Manual of Style/Dates and numbers#Julian and Gregorian calendars. Your concern about conversion of pre-Julian calendars is valid. I suggest if you can't figure it out in a particular case, you don't import the data. Instead, put an appropriate tag in the article, such as a template for "Disputed" or "Citation needed", and wait for someone to fix the article. Jc3s5h (talk) 19:42, 5 April 2018 (UTC)[reply]
- Can you please give me a link to a wikipedia policy of date indication. I didn't found explicit rules which tell whether editors should use Gregorian dates or dates that were in force. Anyway it's arguable that editors stick to these rules anyway: for example dates of pre-Julian period are converted to Julian/Gregorian proleptic calendar (but usually there are no cues, which one of these calendars was used). Unfortunately, as far as I know, wikidata doesn't allow not to set calendar and doesn't allow to use "either-or" statements. For example, there are lots of dates for which day or even month are not specified and thus we can tell the month/year with accuracy ±1 even if we don't know the calendar used. But wikidata allows only to change accuracy to a next level (year/decade) -- VorontsovIE (talk) 14:33, 5 April 2018 (UTC)[reply]
- As no further comment is given, I consider the last objection resolved. --Lymantria (talk) 10:47, 11 April 2018 (UTC)[reply]