Floatingpurr

How to load the dataset of Italian Schools?

Latest comment: 6 years ago34 comments3 people in discussion

Hello, I'm wondering what is the correct process for loading a dataset. Some week ago, I tried to open up a new purpose here, but now I do not know how to proceed. Floatingpurr (talk) 09:47, 12 April 2018 (UTC)Reply

Seems to be plenty of supply on that page, but hardly any demand. Multichill (talk) 19:28, 12 April 2018 (UTC)Reply

I see. So what how can I do if I want to contribute loading datasets? Floatingpurr (talk) 11:24, 13 April 2018 (UTC)Reply

There are probably three or four main tasks associated with this proposed upload: 1. Working out whether we already have records for any of the schools 2. working out the coding so we can represent e.g. Region or Province to our values for Italian regions & Provinces 3. Actually uploading the data, probably using QuickStatements 2 (Q29032512) and possibly 4. creating a new property for CODICESCUOLA (and, who knows, thinking about CODICEISTITUTORIFERIMENTO and whether we need to support that ID). So it's a non-trivial task, I'm afraid, one which will take a considerable amount of effort to achieve. There is not an automagical solution. So the question is, whether you have the time and aptitude to do the work? I can probably give you pointers. --Tagishsimon (talk) 14:28, 13 April 2018 (UTC)Reply

Thanks for your kind reply. So, I'd like to contribute and to try loading datasets for becoming confident with this process. I go over your points. 1. There are definitely record of such schools, for example this one Liceo Classico Massimo D'Azeglio (Q3268994). I guess the only way to get them all is string matching, right? How do I merge existing data? 2. Again, also in this case string matching is the only way, isn't it? 3. Aren't there APIs for loading data (e.g., the ones the bots harness)? Do I need a bot to load them? 4. I may start loading just basic info. I understand it's not a trivial task and I do not know how much time I can dedicate to this task. Anyway, I still do not find a clear way for contributing with huge data loadings, like this one. Floatingpurr (talk) 23:52, 13 April 2018 (UTC)Reply

1. Yes to string matching, but probably based on some useful queries first, such as this which picks out all schools described in it.wiki and which have wikidata items. I think the idea here would be to add a column to the SCUANAGRAFESTAT20171820170901 spreadsheet giving Q ids for schools that have wikidata items (and leaving blank that column where a school cannot be found).

The string matching will not be easy - the exact form of the strings used on wikipedia or wikidata will vary from those in the spreadsheet, so it may be a painstaking process.

2. Much the same approach can be taken for provinces; the idea is to add another column to the spreadsheet giving the Q id for each province. (We can probably forget about Region, since wikidata will have a Province to Region mapping in place already.

3. The input method is not so important ... the key thing is that we need data in a format that wikidata can cope with, so it is 1 & 2 that are most important. [QuickStatements https://www.wikidata.org/wiki/Help:QuickStatements] is the tool I would use - it can create items, and add properties and qualifiers to items.

4. It will be useful to have the School ID - CODICESCUOLA - I guess. As noted, a property proposal would have to be made for this; not something I have ever done. It might take some time for that process, and so school IDs could be added later.

So, if I was you, I would be adding columns to the spreadsheet and finding Q values for existing school and province items (1 & 2) and also 5. running some reports to understand what data is already associated with existing school items, so that I can work out for each school, do we have an item and what information do we have on the item. Once I have all that information, I would be generating from the speadsheet tab-separated data in the format that Quickstatements needs in order to add to modify items.

There are other columns in the spreadsheet ... again, we would need to evaluate what use, if any, will be made of each column of data, and whether there is an existing wikidata encoding for the value (e.g. we may have suitable items for values for DESCRIZIONETIPOLOGIAGRADOISTRUZIONESCUOLA) or whether we need to create a new property. --Tagishsimon (talk) 00:23, 14 April 2018 (UTC)Reply

Great! Thank you very much for that explanation. I'll try to start analyzing this scenario. I hope I can get back to you with good news :) Thank you once again! Floatingpurr (talk) 23:01, 15 April 2018 (UTC)Reply

@Tagishsimon: Hey, I've just completed tasks #1, #2. Further details in the import hub. Now data look fine, right? :) Floatingpurr (talk) 15:13, 17 April 2018 (UTC)Reply

That looks very good. Exactly what was needed. As is, you're at a point where you can start an import - we have a handle on the school name, its commune, whether it already exists in wikidata, and its website. Now lets critically examine all the other columns and work out what, if anything, we're going to do with them. Do we want the school ID? Can we say what sort of a school it is (and does wikidata have a code for this, so that instead of saying P31=School, we can say P31=This_type_of_school.) It is not, of course absolutely necessary to do this step; its just this is the best time to do it. I suggest maybe work through this list in this pattern (and change anything I got wrong / come up with better P suggestions).

AREAGEOGRAFICA = Area Name -> not required, region maps to area
REGIONE = Region Name -> not required; province maps to region
PROVINCIA = Province Name -> not required; commune maps to province
CODICEISTITUTORIFERIMENTO
DENOMINAZIONEISTITUTORIFERIMENTO
CODICESCUOLA
DENOMINAZIONESCUOLA = School name -> Label
INDIRIZZOSCUOLA = address (?) -> P969 (P969)
Wikidata School
CAPSCUOLA
CODICECOMUNESCUOLA
DESCRIZIONECOMUNE = Commune -> located in the administrative territorial entity (P131) ?
Wikidata City (Comune)
DESCRIZIONECARATTERISTICASCUOLA = ??? -> Possible instance of (P31) source?
DESCRIZIONETIPOLOGIAGRADOISTRUZIONESCUOLA = ??? -> Possible instance of (P31) source?
INDICAZIONESEDEDIRETTIVO
INDICAZIONESEDEOMNICOMPRENSIVO
INDIRIZZOEMAILSCUOLA = was this email? I dont have the spreadsheet open right now -> if so, then email address (P968)
INDIRIZZOPECSCUOLA
SITOWEBSCUOLA = official website -> official website (P856)
SEDESCOLASTICA

Loading the data will be no more complicated than extracting sets of values from the spreadsheet ... 5 minute job. Then sit back and watch quickstatements do its thing. So. Let me know about this step, and then we can talk more about loading. (And, let's think about the schoolID ... do we want to hold that? If so, we need to make a new property proposal, or find an existing property that can house the info.) --Tagishsimon (talk) 16:12, 17 April 2018 (UTC)Reply

And a final point for now ... again, without opening the spreadsheer, if the school name is in UPPER CASE, we'll need to make it into Title Case ... and we need to generate descriptions in, say, EN and IT. (And, I guess, work out if the EN and IT labels are the same ... it's not strictly necessary for us to have EN Label & Description, but we might as well. Ditto any other languages you care about. --Tagishsimon (talk) 16:45, 17 April 2018 (UTC)Reply

First of all, thank you once again for your help. Let's start from the list, highlighting what is clear ✅, what to load in Wikidata 📥 and warnings ⚠️:

DATASET = Just the name of the file of origin (original data were split in 4 files, according to the italian organization of schools. For example, some schools are managed by private or local institutions) ✅
AREAGEOGRAFICA = Area Name -> not required, region maps to area ✅
REGIONE = Region Name -> not required; province maps to region ✅
PROVINCIA = Province Name -> not required; commune maps to province ✅
CODICEISTITUTORIFERIMENTO (this field is not available for a subset of the dataset but it could be useful (see final notes) ⚠️ 📥
DENOMINAZIONEISTITUTORIFERIMENTO (this field is not available for a subset of the dataset) ⚠️
CODICESCUOLA = School ID (new property!) ⚠️ 📥
DENOMINAZIONESCUOLA = School name -> Label ✅ 📥
INDIRIZZOSCUOLA = address (?) -> P969 (P969). Or betterlocated on street (P669), since this is just the street name. If we wanted the complete address we should merge: INDIRIZZOSCUOLA, CAPSCUOLA and DESCRIZIONECOMUNE. ✅ 📥 ⚠️
Wikidata School = mapping to existing Wikidata item ✅ 📥
CAPSCUOLA = italian zip code ✅
CODICECOMUNESCUOLA = not required (it's just the commune code. It's already present in Wikidata.) ✅
DESCRIZIONECOMUNE = Commune name ✅
Wikidata City (Comune) = Item of Commune in Wikidata -> located in the administrative territorial entity (P131) ✅ 📥
DESCRIZIONECARATTERISTICASCUOLA = ??? -> Possible instance of (P31) source? ⚠️ (see this gist for values distribution)
DESCRIZIONETIPOLOGIAGRADOISTRUZIONESCUOLA = ??? -> Possible instance of (P31) source? (see this gist for values distribution). I guess this is the best field for instance of (P31), even if we need a mapping for avoiding the creation of a plenty of very specific categories ⚠️ 📥
INDICAZIONESEDEDIRETTIVO not important ✅
INDICAZIONESEDEOMNICOMPRENSIVO not important ✅
INDIRIZZOEMAILSCUOLA = was this email? I dont have the spreadsheet open right now -> if so, then email address (P968) ✅ 📥
INDIRIZZOPECSCUOLA = Certified email address ✅
SITOWEBSCUOLA = official website -> official website (P856) ✅ 📥
SEDESCOLASTICA = not important ✅
[] = Italy -> country (P17)

Final notes.

This dataset is really fine-grained. We have literally all possible flavors of courses offered by each school. For example, TOPS105001, TO1M056001 and TOPC075007 (yes, they are some CODICESCUOLA :) ) are physically the "same" school. We do have 3 distinct entries because such a school offers 3 different "flavors" (I guess: scientific high-school, classical high-school and school for children). There are other cases similar to this one that we cannot group so easily. Anyway, is acceptable having distinct records for such cases? If so, sometimes we may insert a property for CODICEISTITUTORIFERIMENTO as a pointer for the "main" school. Unfortunately, this record is not available for all the data.
I believe it's important to define/use properties for:

CODICESCUOLA: range > id of the italian schools code (alphanumeric).
CODICEISTITUTORIFERIMENTO: range > (school entity) the "main" school of a school. If it exists, it could be also equal to the same school e.g.: (schoolA CODICEISTITUTORIFERIMENTO schoolA).

A lot of caveats, as you have seen. Should I try a basic data import anyway? Floatingpurr (talk) 23:41, 17 April 2018 (UTC)Reply

It's too late right now for me to give much input & reaction to your update. Some things to think through, like CODICEISTITUTORIFERIMENTO (where, if they are classed as distinct schools, albeit co-located and possibly under the leadership if a central school, then we should as you suggest have an item for each and point to the parent school. Though note we can used the parent school QID to do that pointing. (and I'm not sure without looking which property we would use)).

If you want to rush ahead, yes, you could starting importing. I think I would be creating as new columns in your spreadsheet the data required to drive Quickstatements, as described at Help:QuickStatements. So in particular, https://www.wikidata.org/wiki/Help:QuickStatements#Item_creation and https://www.wikidata.org/wiki/Help:QuickStatements#Add_statement_with_sources (we should ideally provide as refs stated in (P248) and reference URL (P854) and retrieved (P813). You'll need two slightly different approaches, for schools which do not exist in wikidata, and schools which do exist ... clearly, you do not create the latter, but use their Qid to add data.

If you do the data load now, and want to deal with other columns later, then you will have to extract the Qids of the items you created and marry them back up to the spreadsheet for your next round of updates. For this you'll need a simple-enough SPARQL report covering the set of data you're looking at. I can do one if you need.

So, yes, your choice. Me, I would, do any additional analysis, e.g. on school type, which is required - in other words, I would try to get all of the data sorted out before doing any load, so that I only have to do one load. Then create in the spreadsheet columns of data that can be cut & pasted into quickstatements; and then test a single row through quickstatements, and if happy with that, run the whole thing in.

This, for instance, is a row from a Quickstatements thing I did some time ago for Malawi politicians, for your interest.

Q46622996 P39 Q46624811 P580 +2014-06-19T00:00:00Z/11 S854 "http://www.statehouse.mw/malawi-government-cabinet/" S854 "https://www.nyasatimes.com/mutharika-reshuffles-malawi-cabinet-kumpalume-fired-massi-elevated-kachikho-back-ghambi-dropped-kasaila-demoted/"

That is the sort of thing you will need to get in new spreadsheet columns.

Note, as well, that whilst we have one row per school, quickstatements will need several rows per school, such as:

CREATE

LAST LEN "Rome Central High School"

LAST DEN "High School in Rome"

LAST LIT "Roma Liceo Centrale"

LAST DIT "High School di Roma"

LAST P31 Q3914 S248 Q26437540 S854 "http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole" S813 +2018-04-89T00:00:00Z/11

LAST P856 "http://some_web_site.it"

etc

So, maybe lay data out so that it can be put through a text editor which can insert the necessary line breaks. Hope that makes some sense.

So here's an example from some Zambian politician stuff I was doing ... I generated this row in a spreadsheet, exported it to a text editor, and replaced ZZZ with a line break.

CREATE ZZZ LAST Len "Harry Kamboni" ZZZ LAST Den ”Zambian politician” LAST P21 Q6581097 ZZZ LAST P102 Q1781632 ZZZ LAST P31 Q5 ZZZ LAST P106 Q82955 ZZZ -Q45384522 P39 Q18607856 P580 +2016-08-11T00:00:00Z/11 P2715 Q19428934 P2937 Q45380990 P768 Q45391297 P4100 Q1781632 S854 "http://www.parliament.gov.zm/members-of-parliament" Q45384522 P39 Q18607856 P580 +2016-08-11T00:00:00Z/11 P2715 Q19428934 P2937 Q45380990 P768 Q45391297 P4100 Q1781632 S854 "http://www.parliament.gov.zm/node/5459"

Alright, just to recap. Step 1: completing the mapping for DESCRIZIONETIPOLOGIAGRADOISTRUZIONESCUOLA. Step 2: creating properties for CODICESCUOLA and CODICEISTITUTORIFERIMENTO. Step 3: loading data. You are right, CODICEISTITUTORIFERIMENTO ranges to new QIDs of schools in this dataset. So it could be tricky. Is Quickstatment the only way for loading data? What about APIs or bots? Bye, Floatingpurr (talk) 10:48, 18 April 2018 (UTC)Reply

There may be other methods; I don't have experience of them. Putting together a pointer from one school to another will take two passes of quickstatements:

Load data using CREATE LAST
Run a report of e.g. P31=school, Country=Italy (we should include a country= property) .. i.e. a report which has all our schools, even if it picks up other items. Include the Label in the report.
Add Qids to spreadsheet using a match on Label=school_name as the key, so now we have a primary-key Qid column
Use CODICEISTITUTORIFERIMENTO in the spreadsheet, e.g. with vlookup, to add a foreign-key Qid column to substitute for CODICEISTITUTORIFERIMENTO
Load another quickstatement, this time using the harvested primary-key Qid and the CODICEISTITUTORIFERIMENTO foreign-key Qid for whichever property we use for the relation ... parent organization (P749) or part of (P361) maybe.

And, of course, if we do that, it's no problem if it takes some time to get new properties for CODICESCUOLA and CODICEISTITUTORIFERIMENTO, because we have the Qids and can load that data later. Step 1 is, finish any & all outstanding data mappings ... e.g. where are we with type of school?

Ok! So, for the moment I'll start completing the mapping for DESCRIZIONETIPOLOGIAGRADOISTRUZIONESCUOLA in the spreadsheet and preparing the first import without caring about CODICESCUOLA and CODICEISTITUTORIFERIMENTO. Floatingpurr (talk) 14:51, 18 April 2018 (UTC)Reply

Alright, we are almost done! As you can observe here, ItalianSchoolsBot has been loading data for some hours. I'll need a 2nd run for loading CODICEISTITUTORIFERIMENTO, rendered by parent organization (P749) and has subsidiary (P355). My question is: what should I do in cases where CODICEISTITUTORIFERIMENTO is referred to the same item? Should I state it explicitly e.g., ?item wdt:P749 ?item (?item wdt:P355 ?item) and or should I omit those properties? Floatingpurr (talk) 18:23, 7 May 2018 (UTC)Reply

If I understand well, omit. To check: CODICEISTITUTORIFERIMENTO is the reference number listed against a subsidiary, which points to the parent. You are saying some schools have a CODICEISTITUTORIFERIMENTO pointing at themselves? If so, omit.

Yes, I just looked in on your bot, very hard at work. Well done, Floatingpurr, very well done. I take my hat off to you. --Tagishsimon (talk) 19:30, 7 May 2018 (UTC)Reply

Yes, your check is right. Sometimes, in the official dataset we do have CODICESCUOLA = CODICEISTITUTORIFERIMENTO if a school has no other "parent" schools. I asked this question because, you know, I should load what the original dataset says. But in that case, it's probably better to omit the property for avoiding redundancy. As always, thanks for your precious support. We are really close to the goal! :) Floatingpurr (talk) 19:45, 7 May 2018 (UTC)Reply

I see you're on round two - joining children to parents. Good to see :) --Tagishsimon (talk) 00:11, 11 May 2018 (UTC)Reply

Yes, I hope coming back with good news soon :) PS: I've also created the bot home page!

Tagishsimon we are done!!!!!!!!!!!!!!!!!!!!! :)))) Floatingpurr (talk) 23:41, 12 May 2018 (UTC)Reply

Good job, Floatingpurr. Excellent to see it come to a conclusion, and great to see a well-formed and well-referenced dataset. Now you need to turn detective and find the next dataset that needs importing. I'm sure the Italian government must have other treasures :). I might, sometime, see if I can knock some sense into the URLs in the schools source files. I had a look and can see why you dismissed them from this round - all sorts of issues. --Tagishsimon (talk) 23:55, 12 May 2018 (UTC)Reply

Ready to go

@Tagishsimon:, I think I'm almost ready to go. Here (https://drive.google.com/open?id=1gPhxQcLvL9AKT-g9VTxvNcW0A-qPCGYE) you can find the file ready for QuickStatements. As you'll see, I used the syntax for version 2 (whit "pipes"). It should be fine. I stated a new item creation for new schools otherwise I added statements to existing items. For example,

New item:

CREATE||LAST|LIT|"Lombardi" Airola||LAST|DIT|"Istituto Tecnico Industriale di Airola in provinica di Benevento (Italia)"||LAST|LEN|""Lombardi" Airola"||LAST|DEN|"Technical Institute in Airola in the province of Benevento (Italy)"||LAST|P31|Q3914|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P31|Q3803834|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P969|"LARGO ANGELO RAFFAELE CAPONE, 82011 AIROLA"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P281|"82011"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P131|"Q55798"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P968|"BNIS00800R@istruzione.it"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11

Existing Item:

Q3831889|LIT|L.Classico "Morgagni"||Q3831889|DIT|"Liceo Classico di Forli' in provinica di Forli-Cesena (Italia)"||Q3831889|LEN|"L.Classico "Morgagni""||Q3831889|DEN|"Classical Lyceum in Forli' in the province of Forli-Cesena (Italy)"||Q3831889|P31|Q3914|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||Q3831889|P31|Q5518|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||Q3831889|P969|"VIALE ROMA 1/3, 47122 FORLI'"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||Q3831889|P281|"47122"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||Q3831889|P131|"Q13367"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||Q3831889|P968|"FOPC04000V@istruzione.it"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||Q3831889|P856|"www.liceoclassicoforli.gov"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11"

Am I right? 2 questions:

How do I get Wikidata QIDs of all the schools in my dataset for adding the CODICESCUOLA and CODICEISTITUTORIFERIMENTO properties in the future?
What happens if a statement already exists for an existing item? I think I have no remaining effort to manage those cases one by one. :D

Floatingpurr (talk) 14:58, 20 April 2018 (UTC)Reply

URL to the spreadsheet is kaput - gives a 404 error. Getting Qids later - search this page for vlookup ... there is advice there, but if you need, I can do a reconciliation between the spreadsheet and the Qid values for you. I /think/ quickstatements ignores the instruction to add a property that's already there, if the value you're trying to add is the same as the current value. Don't worry very much about it; either I'll do some work on your spreadsheet, or else we'll look later with a report. Given the ratio of new to existing schools is about 160:1, I don't think it's a big problem. Might be late tonight - say 8 hours time - before I can dig into this. --Tagishsimon (talk) 15:06, 20 April 2018 (UTC)Reply

But very quickly, I see (having swapped pipes for line breaks)

Probably do not need quotes around the Qid - not sure if quickstatements minds or not.

LAST
P131
"Q55798"

Address seems to be in upper case. ""LARGO ANGELO RAFFAELE CAPONE, 82011 AIROLA" That's suboptimal.
You're setting two P31 values, Q3803834 and Q3914. But Q3803834 is a subclass of Q3914, so you need only to have the Q3803834 ... it is redundant to have the Q3914
There does not seem to be a P17 value (country) being set.
Quotes on LIT are wrong: "Lombardi" Airola
Quotes on LEN are wrong: ""Lombardi" Airola"
We're missing the website

All that said, you can go a head: put one row into quickstatements and see what happens ... look at the record once done and see if you're happy with it.

Ok, I'm gonna fix problems and share a new link. Floatingpurr (talk) 17:24, 20 April 2018 (UTC)Reply

Well, here is the new link: https://drive.google.com/open?id=1X2Hi5CwCI2VtA9AUYKPHX9NGQfNMkxiA. I fixed aforementioned problems. I'm not sure all categories subclass the School class. For the moment, I've left the double statements about P31. Regarding websites, they are only available in a subset of data (Unluckily, not in examples above). Let me know and, as always: thanks! :) Floatingpurr (talk) 17:50, 20 April 2018 (UTC)Reply

On a quick check, all I can see wrong right now are, where the website does not start with http, I think the load will fail. I appreciate the websites in the source are a mix of with & without http ... I think I would be inclined to calculate new website values, along the lines of =if(lower(left(a1,4)<>"http","http://"&a1,a1) ... presuming the source url is in cell a1. `

And on P31. It really is better not to put two P31s in. Better by far to get a list of the distinct items you're using for P31 and we can check that they all have instance of school. So the rule would be, if you have a P31 that is not Q3914, then do not have a P31 of Q3914. I don't, quickly, see any other issues, but maybe time to run one through quickstatements and see what happens. --Tagishsimon (talk) 18:12, 20 April 2018 (UTC)Reply

Ok, so I had to remove web site statements. The problem is wider than we thought. We have some url like:

http//ffff.foo
https//fff.foo
w.w.w.foo.fo

and so on and so forth. I removed them since I thought we cannot rely on this information. Regarding P31, I removed the basic School statements. Some records are still classified as School since there was not a more suitable category for them. Here we go: https://drive.google.com/open?id=1WQEi_T7eHnuYX3KDdx2mtwsAUfNOTLGV Floatingpurr (talk) 22:28, 20 April 2018 (UTC)Reply

Okay, well, we can come back to those later. I'm having issues getting into your spreadsheet ... google drive won't let me see a preview, but does give me the option to download a CSV. However the CSV expands to about 56k rows, and then has some rubbish at the end. So... please check the length of your spreadsheet before you do anything ... compare it with the source spreadsheet count of rows, lest you now have more rows than you should.

Looking at what has been loaded, I'm seeing only one thing ... the Quickstatement is split across two cells or two columns (which might be as a result of the CSV import ... the split happens on the data of the address field).

e.g. for my row 8, I see in Column L

CREATE||LAST|LIT|"Monreale Ii-Pioppo"||LAST|DIT|"Scuola Primo Grado di Monreale in provinica di Palermo (Italia)"||LAST|LEN|"Monreale Ii-Pioppo"||LAST|DEN|"Primary School in Monreale in the province of Palermo (Italy)"||LAST|P31|Q9842|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P969|"Via Papa Giovanni Paolo Ii Snc

and in column M,

90046 Monreale"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P281|"90046"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P17|Q38|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P131|Q207448|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P968|"PAIC85800D@istruzione.it"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11

If you're happy with the number of rows in your spreadsheet, and if this column L & M split is just local to me, I guess you might start throwing data at quickstatements ... do 1 row first, then 10 ... when you're satisfied, let rip! --Tagishsimon (talk) 22:52, 20 April 2018 (UTC)Reply

Lines number is ok, probably the problem you are experiencing is due to the way your editor splits the rows. I tries one row (without pressing "run") and it has been properly parsed! Very good!! Anyway, it seems impossible comping 65k+ rows in the quickstatements modal (web pop up). Are there some methods for the bulk loading (again: 65k+ rows)? Floatingpurr (talk) 00:44, 21 April 2018 (UTC)Reply

Quickstatements is the only method I'm familiar with. I hear what you say - very many rows. On that question, still some concern from me. The file I downloaded from http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole was called SCUANAGRAFESTAT20171820170901 and has 51216 rows of data. Where does your 65k rows come from? Meanwhile I guess your options are to feed the data, in batches, to quickstatements. Or investigate in the direction of 3._Work_with_the_Wikidata_community_to_import_the_data. A tool called mix'n'match might be the solution, but I have no experience of uploading a catalogue to it. User:John Cummings may be able to help here - I think he's familiar with it. I'd be interested to see you run at least one row in through QS, just so I can see some of the data as an Item. --Tagishsimon (talk) 01:24, 21 April 2018 (UTC)Reply

Quickstatements and the problem with huge data cardinality

User:Tagishsimon, as you noticed, the number of rows in the dataset is 65k+ (see this file). As mentioned in the import hub, I merged the 4 datsets in this page (http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole). Indeed, data are partitioned in 4 different files according to italian "organization" of schools. In the aforementioned spreadsheet, I inserted the belonging dataset in the first column. I decided not to insert the detail of belonging dataset in the wikidata statements for Quickstatements, since I thought it was just a confusing information.

As you requested, I've just loaded this statement:

CREATE||LAST|LIT|"D. Buzzati Limana"||LAST|DIT|"Scuola Primo Grado di Limana in provinica di Belluno (Italia)"||LAST|LEN|"D. Buzzati Limana"||LAST|DEN|"Primary School in Limana in the province of Belluno (Italy)"||LAST|P31|Q9842|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P969|"Via Tofane 1, 32020 Limana"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P281|"32020"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P17|Q38|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P131|Q40323|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11||LAST|P968|"mailto:BLIC816001@istruzione.it"|S248|Q3858490|S854|"http://dati.istruzione.it/opendata/opendata/catalogo/elements1/?area=Scuole"|S813|+2018-04-20T00:00:00Z/11

Here the result: https://www.wikidata.org/wiki/Q52083858

Regarding the cardinality problem, also batch mode of quickstatements seems not working. It cannot parse 65k+ statements via browser and, if I am not wrong, there are no alternative options for the upload. I think I need something different (e.g., bot). Floatingpurr (talk) 11:51, 21 April 2018 (UTC)Reply

Number of rows: okay. Yes, there were other files. I blame my (lack) of Italian. Q52083858 looks lovely. The 'Stated in' qualifier is complaining about a Value Constraint issue, but in my view that is as much to do with bad constraints.

As I said, I'm not familiar with other ways. You may just have to bite the bullet and put them in x,000 by x,000. Or chase around looking for another way, which might take just as long to do.

I'm assuming that although you have changed the case of the school name, you have not changed any of the characters. We will be relying on a lower(string) match against the lower(IT Label) later in the process, to get QIds for the spreadsheet so we can, for instance, put the school code or web address in later.

I know the size problem is frustrating, especially when you are so near. Sorry about that. --Tagishsimon (talk) 13:13, 21 April 2018 (UTC)Reply

Don't worry, you did a great job and your Italian isn't so bad since you dealt with the semantic of those datasets. Really! :) Yep, I noticed the 'Stated in' issue. If necessary, I can create a new item representing the datasets batch (release 201718), thus we can use that entity as object for 'Stated in'. What do you think about this hypothesis?

That bullet is too hard to bite, I think I'll look for other ways that can be useful also for other bulk loadings.

Yes, the school name is the same. Just changed the case.

Keep trying to get this work across the finish line in other ways. :) Floatingpurr (talk) 12:39, 22 April 2018 (UTC)Reply

Yes, I guess creating a new item for the website of the Ministry of Public Education avoids that problem. I did so for a single reference for a single item yesterday, and maybe your 300k+ pointers to that source probably also deserve the extra effort. Disappointed through: I thought you were about to push us over the 47 million (valid) items mark. We're only 40k away. :) --Tagishsimon (talk) 13:01, 22 April 2018 (UTC)Reply

I hope I can effectively contribute soon :) In the meanwhile, I updated the "stated in" reference in our example: Q52083858. Floatingpurr (talk) 15:16, 22 April 2018 (UTC)Reply

Bot

Bot permission requested. Floatingpurr (talk) 10:04, 23 April 2018 (UTC)Reply

Reports

Properties of IT schools listed in the spreadsheet

Here, for interest, is a report which will produce the properties and values associated with IT schools already in wikidata. Might be useful somewhere along the way, either as a prompt for the sorts of properties we should use, or so that we can add claims to existing school items which are missing them.

SELECT ?item ?itemLabel ?property ?propertyLabel ?value ?valueLabel WHERE {
  VALUES ?item {wd:Q13392468 wd:Q13551426 wd:Q16165640 wd:Q16542945 wd:Q16566632 wd:Q16566646 wd:Q16566669 wd:Q16566673 wd:Q16572443 wd:Q16572448 wd:Q16572452 wd:Q16572455 wd:Q16572458 wd:Q16572464 wd:Q1657777 wd:Q16601110 wd:Q16601117 wd:Q17462244 wd:Q1761894 wd:Q17633472 wd:Q17633486 wd:Q18086316 wd:Q18288170 wd:Q18288472 wd:Q18288550 wd:Q18289085 wd:Q18785474 wd:Q20008304 wd:Q20008317 wd:Q20009723 wd:Q21236495 wd:Q24935097 wd:Q24935262 wd:Q24935394 wd:Q24940073 wd:Q24940282 wd:Q24940783 wd:Q24942345 wd:Q24942506 wd:Q28668695 wd:Q28669891 wd:Q28670170 wd:Q28670495 wd:Q28670829 wd:Q28671453 wd:Q29838860 wd:Q30880416 wd:Q30880516 wd:Q30880518 wd:Q30880808 wd:Q30880864 wd:Q30888196 wd:Q30888791 wd:Q30888898 wd:Q30889045 wd:Q30889048 wd:Q30889474 wd:Q30889478 wd:Q30890020 wd:Q30890024 wd:Q3155749 wd:Q3268994 wd:Q3625124 wd:Q3682938 wd:Q3682943 wd:Q3682961 wd:Q3682962 wd:Q3682963 wd:Q3682969 wd:Q3689750 wd:Q3689752 wd:Q3689757 wd:Q3689758 wd:Q3719870 wd:Q3735815 wd:Q3747159 wd:Q3803580 wd:Q3803625 wd:Q3803626 wd:Q3803659 wd:Q3803663 wd:Q3803674 wd:Q3803675 wd:Q3803689 wd:Q3803690 wd:Q3803696 wd:Q3803702 wd:Q3803715 wd:Q3803809 wd:Q3803810 wd:Q3803819 wd:Q3803821 wd:Q3803822 wd:Q3803835 wd:Q3803837 wd:Q3803838 wd:Q3803839 wd:Q3803840 wd:Q3803842 wd:Q3803843 wd:Q3803844 wd:Q3803845 wd:Q38297984 wd:Q3831879 wd:Q3831880 wd:Q3831881 wd:Q3831883 wd:Q3831884 wd:Q3831885 wd:Q3831886 wd:Q3831887 wd:Q3831888 wd:Q3831889 wd:Q3831890 wd:Q3831891 wd:Q3831892 wd:Q3831894 wd:Q3831897 wd:Q3831898 wd:Q3831899 wd:Q3831900 wd:Q3831901 wd:Q3831902 wd:Q3831903 wd:Q3831904 wd:Q3831905 wd:Q3831906 wd:Q3831907 wd:Q3831909 wd:Q3831910 wd:Q3831911 wd:Q3831912 wd:Q3831916 wd:Q3831917 wd:Q3831918 wd:Q3831920 wd:Q3831921 wd:Q3831922 wd:Q3831923 wd:Q3831924 wd:Q3831925 wd:Q3831926 wd:Q3831927 wd:Q3831928 wd:Q3831929 wd:Q3831931 wd:Q3831932 wd:Q3831933 wd:Q3831934 wd:Q3831935 wd:Q3831936 wd:Q3831939 wd:Q3831940 wd:Q3831941 wd:Q3831942 wd:Q3831943 wd:Q3831944 wd:Q3831945 wd:Q3831946 wd:Q3831947 wd:Q3831948 wd:Q3831949 wd:Q3831950 wd:Q3831951 wd:Q3831952 wd:Q3831955 wd:Q3831956 wd:Q3831957 wd:Q3831958 wd:Q3831960 wd:Q3831964 wd:Q3831967 wd:Q3831969 wd:Q3831970 wd:Q3831971 wd:Q3831972 wd:Q3831973 wd:Q3831974 wd:Q3831975 wd:Q3831976 wd:Q3831978 wd:Q3831980 wd:Q3831981 wd:Q3908284 wd:Q3953200 wd:Q3953211 wd:Q3953219 wd:Q3953255 wd:Q3953318 wd:Q3953319 wd:Q3953322 wd:Q3953323 wd:Q3953324 wd:Q3953377 wd:Q47472652 wd:Q48804614 wd:Q48804825 wd:Q48805925 wd:Q48810529 wd:Q5515 wd:Q766265 wd:Q944748 }
  ?property wikibase:directClaim ?wdt .
  ?item ?wdt ?value . 
  SERVICE wikibase:label { bd:serviceParam wikibase:language 'en' }
}
order by ?item ?property

Try it!

Change language 'en' to 'it' if you want Italian language labels.

Here are the distinct P31 values used by existing schools:

http://www.wikidata.org/entity/Q4830453 business enterprise
http://www.wikidata.org/entity/Q16463 lyceum
http://www.wikidata.org/entity/Q3914 school
http://www.wikidata.org/entity/Q2651004 palazzo
http://www.wikidata.org/entity/Q4671277 academic institution
http://www.wikidata.org/entity/Q9826 high school
http://www.wikidata.org/entity/Q41176 building
http://www.wikidata.org/entity/Q126807 kindergarten
http://www.wikidata.org/entity/Q189004 college
http://www.wikidata.org/entity/Q16560 palace
http://www.wikidata.org/entity/Q269770 boarding school
http://www.wikidata.org/entity/Q31855 research institute
http://www.wikidata.org/entity/Q207694 art museum
http://www.wikidata.org/entity/Q383092 art school
http://www.wikidata.org/entity/Q1744598 class president
http://www.wikidata.org/entity/Q159334 secondary school
http://www.wikidata.org/entity/Q2221906 geographic location
http://www.wikidata.org/entity/Q5518 liceo classico
http://www.wikidata.org/entity/Q1981640 liceo linguistico
http://www.wikidata.org/entity/Q3831968 liceo scientifico
http://www.wikidata.org/entity/Q149566 middle school
http://www.wikidata.org/entity/Q1391439 Istituto tecnico economico

and here the distinct properties used by existing schools:

http://www.wikidata.org/entity/P131 located in the administrative territorial entity
http://www.wikidata.org/entity/P17 country
http://www.wikidata.org/entity/P18 image
http://www.wikidata.org/entity/P625 coordinate location
http://www.wikidata.org/entity/P373 Commons category
http://www.wikidata.org/entity/P856 official website
http://www.wikidata.org/entity/P31 instance of
http://www.wikidata.org/entity/P571 inception
http://www.wikidata.org/entity/P138 named after
http://www.wikidata.org/entity/P646 Freebase ID
http://www.wikidata.org/entity/P281 postal code
http://www.wikidata.org/entity/P969 located at street address
http://www.wikidata.org/entity/P287 designed by
http://www.wikidata.org/entity/P276 location
http://www.wikidata.org/entity/P154 logo image
http://www.wikidata.org/entity/P136 genre
http://www.wikidata.org/entity/P149 architectural style
http://www.wikidata.org/entity/P84 architect
http://www.wikidata.org/entity/P1435 heritage designation
http://www.wikidata.org/entity/P2186 Wiki Loves Monuments ID
http://www.wikidata.org/entity/P3503 LombardiaBeniCulturali building ID*
http://www.wikidata.org/entity/P825 dedicated to
http://www.wikidata.org/entity/P1566 GeoNames ID
http://www.wikidata.org/entity/P2671 Google Knowledge Graph ID
http://www.wikidata.org/entity/P3749 Google Maps CID
http://www.wikidata.org/entity/P935 Commons gallery
http://www.wikidata.org/entity/P213 ISNI
http://www.wikidata.org/entity/P2427 GRID ID
http://www.wikidata.org/entity/P3417 Quora topic ID
http://www.wikidata.org/entity/P463 member of
http://www.wikidata.org/entity/P1315 People Australia ID
http://www.wikidata.org/entity/P1071 location of final assembly
http://www.wikidata.org/entity/P214 VIAF ID
http://www.wikidata.org/entity/P112 founded by
http://www.wikidata.org/entity/P159 headquarters location
http://www.wikidata.org/entity/P227 GND ID

Schools in Italy with Commune where this exists

This is the skeleton for a report on IT schools ... can be expanded to include other properties. I've not checked, but there's every possibility that some of the schools you found do not have a P17 of Italy, or a P31 which is school (Q3914) or a subclass of Q3914, and so will not appear in the report.

SELECT ?item ?itemLabel ?commune ?communeLabel WHERE {
  ?item wdt:P31/wdt:P279* wd:Q3914.
  ?item wdt:P17 wd:Q38.
  OPTIONAL {?item wdt:P131 ?commune}.
  SERVICE wikibase:label { bd:serviceParam wikibase:language 'it' }
}
order by ?itemLabel

Try it!

"Istituto d'Arte Statale" (ISA) recorded as "Accademia di belle arti"

Latest comment: 5 years ago4 comments2 people in discussion

I see you recorded ISAs, which are high schools as "Accademia di Belle Arti"s, which are universities. Can you correct this? --Ogoorcs (talk) 20:05, 21 August 2018 (UTC)Reply

Hello Ogoorcs! Thank you for your kind suggestion. The bot mapped Italian ISTITUTO D'ARTE to art academy (Q383092). Unless that is not the perfect class, as you noticed, that was the only way I found to get also Italian institutes dealing with art in queries looking for such school flavors. I will try to design a clearer modeling in (eventual) next bot runs. Thanks :) Floatingpurr (talk) 15:02, 22 August 2018 (UTC)Reply

I just noticed that someone was importing MIUR databases in Wikidata, otherwise I would (probably) offered you an hand with the imports :D

Anyway I can understand your difficulty, right now does not even exist an item for ISAs. Do you have direct control over items edited by your bot? -- Ogoorcs (talk) 20:33, 22 August 2018 (UTC)Reply

No problem :) What do you mean with "direct control"? Anyone can write/edit/delete/merge items : ) The ones you mentioned definitely need some polish. I'll take that into account in case of further massive update with bot. Unless anyone else fixes those problems beforehand. Floatingpurr (talk) 09:45, 24 August 2018 (UTC)Reply

nome scuole

Latest comment: 5 years ago2 comments2 people in discussion

Ciao, come mai hai importato le scuole italiane senza mettere nel campo Etichetta un nome comprensibile, tipo Scuola elementare Collodi. Ciao Susanna Giaccai (talk) 11:29, 30 August 2018 (UTC)Reply

Ciao Susanna, grazie per l'osservazione. Ho pensato di non riportare il tipo di scuola (e.g., Scuola elementare) in rdfs:label perché mi pareva una ridondanza con quanto già asserito con instance of (P31) primary school (Q9842). Ciao, Floatingpurr (talk) 10:04, 31 August 2018 (UTC)Reply

Discussioni sparse

Latest comment: 5 years ago5 comments2 people in discussion

FYI [1] (sei tu?), Wikidata_talk:WikiProject_Italy#Item_di_scuole_inseriti_in_automatico.. --Nemo 17:04, 29 September 2018 (UTC)Reply

Wow, che mi sono perso? Non sapevo dell'esistenza di WikiProject_Italy! Scrivo di là, Grazie! Floatingpurr (talk) 23:22, 29 September 2018 (UTC)Reply

Ciao Nemo, ho postato un mese fa ma tutto tace : ) Floatingpurr (talk) 12:25, 31 October 2018 (UTC)Reply

Approfitta della calma se vuoi fare qualche sistemazione! È normale che l'interesse del momento vada a scemare, al momento credo che nessuno abbia chiaro come contribuire a quei dati. --Nemo 18:18, 31 October 2018 (UTC)Reply

Capito! Intanto ho descritto in WikiProject_Italy come sono arrivato fin lì. Per quanto riguarda ulteriori edit, visto il polverone che si è sollevato sulle licenze di rilascio degli open data pubblici, mi sono fermato in attesa che sia fatta chiarezza sulle reali possibilità di utilizzo di queste informazioni. (vedi qui) Floatingpurr (talk) 18:32, 31 October 2018 (UTC)Reply

high school (Q9826)

Latest comment: 5 years ago1 comment1 person in discussion

Ciao! Potresti sostituire le istanze di high school (Q9826) con upper secondary school (Q57775519) dove P17=Q38? In Italia non abbiamo le high school ;-) --Horcrux (talk) 10:46, 24 November 2018 (UTC)Reply

Sigle non uniformi

Latest comment: 5 years ago2 comments2 people in discussion

Sarebbe inoltre fondamentale adeguare le sigle dei vari istituti ad uno standard unico e predefinito e aggiungere alias adeguati per favorire la ricerca degli item. --Horcrux (talk) 11:48, 25 November 2018 (UTC)Reply

ciao Horcrux! Rispondo qui ad entrambe le osservazioni. Ho messo in pausa i miei contributi su questo dataset a fronte di queste criticità che sono ancora in fase di discussione. Sto attendendo un po' di chiarezza prima di procedere. Per il momento, grazie per gli spunti :) Floatingpurr (talk) 15:10, 25 November 2018 (UTC)Reply

Non Disponibile

Latest comment: 5 years ago3 comments2 people in discussion

Hi! You’ve created a ton of statements postal code (P281)Non Disponibile, see constraint report here. Could you please clean them up? (There are some other constraint-violating Italian schools as well, I’d be happy if you could manage them too, as you should know the data sources and the Italian postal code system better than me.) —Tacsipacsi (talk) 12:58, 19 April 2019 (UTC)Reply

Hey Tacsipacsi! Thanks for your remarks. I'll try to look into this as soon as possible. Cheers. Floatingpurr (talk) 13:35, 19 April 2019 (UTC)Reply

Hi Tacsipacsi I've just fixed all the ~800 postal code (P281)Non Disponibile of Italian Schools. Bye! Floatingpurr (talk) 16:08, 20 June 2019 (UTC)Reply

Istituti comprensivi

Latest comment: 3 years ago1 comment1 person in discussion

Non sarebbe meglio usare https://www.wikidata.org/entity/Q60977885 per gli istituti comprensivi? Sto pensando di aggiornare i dati che trovo con la regex: ^[a-z]{2}ic Qualcosa in contrario? Grazie Francians (talk) 07:20, 14 March 2021 (UTC)Reply

Add topic