Wikidata:WikiProject Cultural heritage/Reports/WLM on WD (Italy)
This report describes the process of ingesting data about Italian heritage and cultural properties from Wiki Loves Monuments project lists. The project was initially carried out in the summer/fall 2016 by User:Nvitucci in coordination with User:Cristian Cenci (WMIT) and Yiyi (volunteer), and later continued by AlessioMela, Nemo_bis (volunteer) and Laurentius (volunteer) in 2018.
Data model
editTo be eligible for Wiki loves monuments Italia, Wikidata items must have the following properties:
- located in the administrative territorial entity (P131), usually pointing to the municipality;
- instance of (P31) as a subclass of geographical feature (Q618123);
- country (P17) with value Italy (Q38).
Whenever possible, they should also have the following properties to properly identify the object:
- street address (P6375)
- coordinate location (P625)
- heritage designation (P1435) set to Italian national heritage (Q26971668), if the monument is under restrictions of the Italian Cultural heritage law (this does not apply to all monuments in Wiki loves monuments).
All objects which actually participate in Wiki Loves Monuments also have:
- Wiki Loves Monuments ID (P2186): the Wiki loves monuments identifier, an alphanumeric code composed of ISTAT ID (P635) for the region + Italian cadastre code (municipality) (P806) + number assigned to the item by whoever processed the authorization (see more below). May have as qualifiers:
- time restrictions for participation in the contest:
- point in time (P585) for single-year authorizations,
- start time (P580) for multi-year authorizations,
- end time (P582) for multi-year but not perpetual authorizations;
- who has authorized the participation of the monument (note that an authorization is not always needed):
- to specify which part of the monument can by depicted:
- applies to part (P518) with the appropriate value (e.g. exterior (Q1385033), façade (Q183061), interior (Q28939874), defensive wall (Q57346), garden (Q1107656), building (Q41176), cloister (Q1430154));
- object named as (P1932) could be necessary in some cases to carry the description used by the source
- time restrictions for participation in the contest:
- source imported from Wikimedia project (P143) Wiki Loves Monuments Italia (Q19960422) for all the statements that have been created from Wiki loves monuments data.
Additionally we strive to add statements which help improve coverage and illustration of the objects, such as:
- image (P18) (see #Images, categories and links for tips on how to expand coverage);
- panoramic view (P4291) (especially for cities)
- Commons category (P373) (infrequently also Commons gallery (P935) or topic's main category (P910)).
Finally, many items use some of the following properties (see a list of examples):
- part of (P361), has part(s) (P527), service retirement (P730);
- other geographical properties, including:
- information on related entities and contacts, including phone number (P1329), fax number (P2900), email address (P968), official website (P856).
Less used, but potentially useful, are many other properties such as Wikidata:List of properties/art and Wikidata property for Wikivoyage.
Additionally, the OpenStreetMap element about the monument should link to Wikidata, using the wikidata: key.
WLM identifiers
editSince 2015, Italian WLM identifiers are 10-characters strings like 00A0000000
, where:
00
: ISTAT ID (P635) of the region (2 digits)A000
: Italian cadastre code (municipality) (P806) of the municipality (4 alphanumeric characters)0000
: counter, starting from0001
(4 digits)
For instance, the first monument registered in the municipality of Abbadia San Salvatore (Q91096) has been assigned the identifier 09A0060001
: 09
stands for Tuscany, A006
for Abbadia San Salvatore, and 0001
is used because it is the first monument.
A code conversion table is available.
With a SPARQL query you can list the IDs have been already assigned in a municipality or the IDs that have a certain prefix (the query may be slow).
A different format is used for some specific categories of monuments and for identifiers defined before 2015 (which were just the 6 digit ISTAT ID (P635) of the municipality + 4 digit counter, supposed to be unique across all monuments).
For veteran trees, a unique identifier was produced by prefixing 6 alphanumeric characters to the catalog code without slashes. A "bis" has been added in case of duplicate catalog codes.
Activities in 2016
editObjectives of the project
editThe main objectives of this project were:
- to make the generation of WLM lists on Wikipedia easier;
- to move as much valid information as possible to Wikidata, for easier management and interoperability with other data sources.
Source data
editOur first step was to build a stable “internal” database starting from the lists built over previous editions of the Wiki Loves Monuments contest, in order to make the data management easier. In order to do so, we had to:
- deal with monuments that have been added for some editions and whose authorization was revoked (or not renewed) for later editions;
- deal with a legacy ID system, which could possibly cause the duplication of items;
- clean the identifiers: some IDs had a -MIBAC suffix that was added in the past to support a custom template (see commons:Template:Italy-MiBAC-disclaimer), but had otherwise no special meaning.
The database was initially kept as an OpenOffice spreadsheet file in order not to disrupt the existing processes of (new) data insertion while retaining machine readability; it was later converted to a TSV file for easier processing and transformation. This database was (and is) our central source of data related to WLM, and was used to create WLM lists as well as create/update Wikidata items.
List creation
editWe built a tool to generate (or update) Wikitext in order to help the creation of WLM 2016 lists from the newly created database; the tool made it easier to add custom information to selected monuments sets (e.g. to add local prize fields to monuments from selected cities) without resorting to bots. Later on we extended the tool to create (or update) Wikidata elements related to the monuments.
Wikidata items update/creation
editThe update and creation of Wikidata elements was carried out in phases.
- First of all we searched for existing Wikidata items, with a combination of automated search (e.g. by matching the monument Wikipage and/or name) and manual check. This was needed because in some cases the Wikipage, although already existing, was not related to the monument directly but rather to a set of monuments or even to the city where the monument is located; in other cases, the monument name was not found on Wikidata because it was not included in the label, or because an alias was used instead.
- We then updated the existing elements via the QuickStatements tool; we created the statements to be loaded (see the #QuickStatements details section) using our extended tool.
- We identified the elements that could be created automatically, again via the QuickStatements tool: the best showcase for this process is the list of Pompeii items #QuickStatements details discussed extensively here.
- We created a template (still on a user page only) meant to fetch a single item’s information from Wikidata, with the intention to use it formally with the 2017 edition; we decided not to use it for the 2016 edition because it would have required several changes to the lists. The template can be found here: it "blends" the existing WLM template, where all the information regarding a monument are inserted manually, with another template (wrapping a LUA module) meant to extract such information from Wikidata when a Wikidata identifier is available.
- We also developed a Web tool to make the direct insertion of monument data in Wikidata easier, but the tool is still experimental.
QuickStatements details
editThe QuickStatements tool is used to batch-insert content into Wikidata. QuickStatements is “unsupervised” (i.e. the content is just inserted with no formal verification process except for the format of what is inserted); since it’s possible to insert unverified data and (even in large amounts), care is required when adding content.
Schema
editWe mostly used classes and properties that already existed in Wikidata when we started (summer 2016), especially the existing property WLM ID (that comes along with constraint on the use of other properties). We needed to create the Italian national heritage (Q26971668) item to assign as a requested value for the heritage designation (P1435) property, although we found that there might be some shortcomings (e.g. is it ok to use it also for natural heritage? Can it be used even if there is no official heritage list?)
Updating existing items
editThe “safest” route is to only update existing elements with further information, in our case with a WLM ID, labels and (sometimes) addresses. Example:
Qxxxxxx Lit "name" Qxxxxxx P2186 "0123456789" S143 Q19960422 Qxxxxxx P17 Q38 S143 Q19960422 Qxxxxxx P131 Qyyyyyyyy S143 Q19960422 Qxxxxxx P1435 Q26971668 S143 Q19960422
These statements would update the element with Q number Qxxxxxx by:
- giving it the Italian label name;
- assigning it the Wiki Loves Monuments ID (P2186) 0123456789;
- assigning Italy (Q38) as the country where it is located;
- assigning the element with Q number Qyyyyyyyy as the municipality where it is located;
- assigning it the heritage designation (P1435) of Italian national heritage (Q26971668) .
All the statements are qualified with the imported from Wikimedia project (P143) qualifier, so to say that this information has been imported from Wiki Loves Monuments Italia (Q19960422) .
Creating new items
editThe somewhat "riskier" route is to create a new Wikidata element as we did for Pompeii buildings; one should first make sure that the element does not already exist, since the risk is to create duplicates. That said, here is an example:
CREATE LAST Lit "{nome} ({regio}.{insula}.{pos})" LAST Ait "Pompei {regio}.{insula}.{pos}" LAST Aen "Pompeii {regio}.{insula}.{pos}" LAST P17 Q38 S143 Q19960422 LAST P131 Q36471 S143 Q19960422 LAST P31 Q109607 LAST P1435 Q26971668 LAST P2186 "0123456789" S143 Q19960422 LAST P276 Q43332 S143 Q19960422 LAST P361 Qxxxxxx S143 Q19960422 LAST P528 "{regio}.{insula}.{pos}" P972 Q27055447
These statements would create an item with:
- both a label and a description (in Italian);
- an English description (see #Challenges);
- Italy (Q38) as its country location;
- Pompei (Q36471) as its administrative location;
- ruins (Q109607) as its type;
- a Wiki Loves Monuments ID (P2186) ;
- Italian national heritage (Q26971668) as its heritage designation (P1435) ;
plus some information specific for Pompeii items:
- Pompeii (Q43332) as its "conceptual" location;
- Qxxxxxx as the insula (Q26960982) it is part of (P361) ;
- catalog code (P528) {regio}.{insula}.{pos} as a catalogue of Pompeii buildings (Q27055447) for the property catalog (P972) .
For Pompeii, the insertion of all the items (~2000) took several minutes. In order not to use up too many resources, we loaded such statements in chunks depending on the item's regio (Q26912005) (there are 9).
Results
editThere are now ca. 4,500 WLM items on Wikidata spanning more than 100 types, the five most represented being:
- ruins (Q109607) (2,000+ items);
- church building (Q16970) (400+ items);
- palazzo (Q2651004) (300+ items);
- monument (Q4989906) (200+ items);
- museum (Q33506) (100+ items).
Most of these types were extracted directly from monument names (possibly with some text manipulation, e.g. character substitution or the use of synonyms such as "chiesa" and "chiesetta" or "palazzo comunale" and "palazzo municipale"). All the monuments belonging to each class were manually checked before insertion in order to avoid duplicates (with the exception of most of the Pompeii monuments, which were created from scratch).
Challenges
editDuring the migration project we faced a number of challenges.
- The creation of stable IDs: since 2012 the monuments IDs are created within the WLM project since there is no comprehensive, unified, national Monument DB (although some effort is being made now);
- "Noisy" data: since any municipality can propose its one monuments to the list, we had (and still have) some items that are not actually "monuments" but rather points of interests or cultural properties (even though the “monument” definition is vague sometimes).
- Relevance (as a consequence): is every item from WLM lists a "notable" Wikidata item? After some discussions we opted for the “Yes, because it supports the WLM project” route, although we still had doubts about hamlets, main squares, and other elements to be treated as "monuments".
- Manual work (especially for verification) often needed.
- Time needed to agree on data structure.
- Discussions about the creation of new items (especially for Pompeii):
- When several monuments are grouped together (e.g. arcades, buildings of a hamlet, or Pompeii buildings), is every single building a “monument” on its own or is it just a part of a larger monument (i.e. "arcades", or the Pompeii site)?
- Since there is no official list of codes for all the buildings in Pompeii, is it possible to use a non-official (but well documented) source?
- How should types (and English labels/description) be obtained for a building? In some cases we could extract the type from a monument name, otherwise this should be added afterwards (but the addition of items without at least a type is discouraged).
- Question about dates: should we use Julian or Gregorian dates?
- Address validation: we planned to add them after some further verification, to possibly make use of some geolocation.
- What is the "address" of a natural resource (e.g. a park or a river)? Can the address provided by a municipality (and inserted in the WLM lists) be always considered "valid"?
Conclusions
editOur conclusion is that this migration process is not technically difficult per se, but it brings many decisions and questions to be answered from a more general point of view. Our main results were:
- cleaner data (especially data from Emilia Romagna region, thanks to their monument database);
- cleaner and stable IDs;
- a way to (semi-)automate creation and update of WLM-related Wikidata items from a database, possibly with custom rules;
- a way to make the creation and update of WLM lists easier, so that information does not have to be scattered and repeated.
We already planned to make some fixes and updates for the 2017 edition of WLM.
2018 continuation
editLuoghi della cultura
editBot codebase
editSee: https://github.com/synapta/wikidata-mibact-luoghi-cultura
All the code of the bot is published on Github with a documentation (in Italian) both at a high level to understand the flow of data, both at running level. The bot can also be launched in the future to upload to Wikidata any new data that MIBACT will publish. This is a real hypothesis given that during the activity the so-called cultural places considered have gone from 26,899 to 27,513. (Some of them proved to be particularly poor or duplicated and were discarded by the bot).
Before the activity on Wikidata there were 13,310 Italian monuments according to the query:
After the activity the number has increased to 34,495. Considering the 27,290 edits made by the bot, 20,085 creations and 7,205 updates were made to existing items.
Frontend
editAt the address above, we loaded an interface that uses a more detailed version of the previous query as a data input with a search engine on the possible monuments for WLM. The data is updated automatically with a few minutes delay compared to Wikidata. So even future automatic or manual entries can be viewed on that table.
In the research hole you can enter a municipality, a province or a region to see the monuments of that place.
Wikipedia integration
editSee: w:it:Progetto:Wiki_Loves_Monuments_2018/Monumenti/Piemonte/Città_metropolitana_di_Torino.
In this example page on Wikipedia we applied the use of the Template:Wikidata_list to automatically create the lists once generated by hand like Progetto:Wiki Loves Monuments 2017/Monumenti/Piemonte/Città metropolitana di Torino.
Export
editUse a Wikidata SPARQL query to export items and their WLM-related data to a spreadsheet.
Another query can be used to export a smaller dataset with the "codice catasto" of the municipality where the object is located.
Further tweaks to the data
editData is continuously being improved. Next steps include:
- adding end time (P582) qualifier to the Wiki Loves Monuments ID (P2186) statements (when authorizations have an expiry; by September) e.g. [1]
- adding maintained by (P126) to a greater amount of entities (where most relevant) e.g. [2]
Tuscany
editTuscany has a parallel organization since 2018, summarized here.
The region produces a considerable outptut of the national image upload (25-40%). As stated in the past also inside in WMI, it's therefore statistically wrong to analyze the Italian data as a whole, you should compare if possible Tuscany and Italy without Tuscany, because the processes are different.
One of the reason of the separate organization is the need of a more "wiki" approach, and the the one the reduction of mistakes. Originally, the problem of clean -up of massive imports, but in general the need to a constant check up f the process. According to the data of local volunteers, circa 10-15% of WLM information provided by local authorities are wrong (mix-up of different concepts, wrong properties of the places, minor mistakes of addresses). The network of volunteers carefully check them and verify with the offices, this is considered a necessary step to reduce more time-consuming corrections later.
Tuscan items have in general more properties, more links to commons categories, more IDs. IDs related to cultural heritage on Wikidata are often produced as a result of the Tuscan Wikidata activity in the framework of WLM (Art Bonus ID (P8564),Visit Tuscany ID (P8083),Arachne building ID (P6787),Pietre della Memoria ID (P5726),BeWeb church ID (P5611),TCI destination ID (P5601))
The system was discussed abroad, for example at the WikiData Days 2019 in Portugal.
Since 2020 the Tuscan volunteer network started to improve the OTRS system used to store the permissions, uploading them on Commons. Currently 80% of the permissions can be entirely handled by the community (with the exception of I.D. information) with great reduction of costs and increase in efficiency. It's also easier now to monitor evolution of the competition, and use such information as reliable sources for statements.
Statistics and reports
editTips
editSPARQL queries can help with various tasks:
- search the Q-ids corresponding to a list of WLM-IDs, e.g. for comparison with the original source of the data
- search WLM-ID statements for a list of municipalities
- list IDs used more than once (they need to be made unique) and search the items with prefix
haswbstatement:P2186=
See also the recent changes connected to the WLM lists.
Statistics
edit- All monuments in Wiki loves monuments Italia
- All monuments of WLM-IT without an image (P18) (about 8200 as of August 2019, vs. about 5000 with an image)
- Elenco di comuni che hanno autorizzato sulla base di schede create da AlessioBot
- Province italiane per numero di monumenti
- Regioni italiane per numero di monumenti
For various uses:
- Titles of it.wiki articles connected to WLM-IT monuments
- Usage of applies to part (P518): complete list and number of occurences
- Usage of approved by (P790): plain list and grouped by authorizer
- Most used qualifiers of Wiki Loves Monuments ID (P2186): [3]
Reports for cleanup and data improvement
editSee #Data_model for additional information on why these properties are important for WLM. See /Property coverage for the coverage of main properties.
Issues that are likely to create problems for Wiki loves monuments:
- Issues with located in the administrative territorial entity (P131):
- Elements without located in the administrative territorial entity (P131): [4]
- Elements whose located in the administrative territorial entity (P131) is not an Italian municipality: [5]
- Elements located in Italy whose located in the administrative territorial entity (P131) is not part of an Italian province [6]
- Issues with dates:
- Identifiers with both a point in time (P585) and a start time (P580) or end time (P582): [7]
- Identifiers whose start time (P580) is a later date than end time (P582): [8]
- Identifiers whose start time (P580), end time (P582) or point in time (P585) are less precise than a year [9]
- Identifiers whose start date or end date are earlier than 2012 or later than 2020
Other issues:
- Elements without country (P17) Italy (Q38): [10]
- Elements without instance of (P31): [11]
- Elements whose instance of (P31) as not a subclass of geographical feature (Q618123): [12]
- Elements where location (P276) and located in the administrative territorial entity (P131) coincide: [13]
- In Friuli-Venezia Giulia intermunicipal territorial union (Q27961023) have replaced provinces. At some point in the future the data will probably need to be updated accordingly.
To support improving the data:
- Monuments in Mantova province linked and not linked by OpenStreetMap
Images, categories and links
editYou can use existing data to semi-automatically add image links (P18):
- add P18 with WDfist, query takes ~10 min; see also narrower query for WLM monuments only
Several reports help find items ripe for improvement: