Wikidata:Property proposal/local to a language context

local to a language context

edit

Originally proposed at Wikidata:Property proposal/Generic

   Not done
Descriptionproperty to identify the concepts related to the group of territories where a language is spoken
Representslanguage territory (Q8561610)
Data typeItem
Domainlanguage territory (Q8561610), group of territories where a language is spoken. This is the language context. A concept of a language context is related to both the territorial entities (political territorial entity political territorial entity (Q1048835) and country country (Q6256)) and the culture culture (Q11042).
Allowed valuesany language Qitem.
Example 1Naples (Q2634)Italian (Q652), Neapolitan (Q33845)
Example 2pesto (Q9896)Italian (Q652)
Example 3Juventus FC (Q1422)Italian (Q652)
Example 4Vasco Rossi (Q17171)Italian (Q652)
Example 5Pep Guardiola (Q164038)Catalan (Q7026)
Example 6Sagrada Família (Q48435)Catalan (Q7026)
Example 7Andorra la Vella (Q1863)Catalan (Q7026)
Example 8crema catalana (Q842566)Catalan (Q7026)
Example 9Inca civilization (Q3404008)Quechua (Q5218)
Example 10Huascarán (Q200935)Quechua (Q5218)
Example 11Quechua (Q134936)Quechua (Q5218)
Example 12Chavin de Huantar (Q732554)Quechua (Q5218)
Planned usefor creating selections of prioritized content to bridge the gaps and help language editions reach a higher cultural diversity in their contents.
See also
  • language used (P2936): language widely used (spoken or written) in this place or at this event or organisation
  • languages spoken, written or signed (P1412): language(s) that a person or a people speaks, writes or signs, including the native language(s)
  • language of work or name (P407): language associated with this creative work (such as books, shows, songs, broadcasts or websites) or a name (for persons use "native language" (P103) and "languages spoken, written or signed" (P1412))
  • official language (P37): language designated as official by this item
  • native language (P103): language or languages a person has learned from early childhood
  • native label (P1705): label for the items in their official language (P37) or their original language (P364)
  • on focus list of Wikimedia project (P5008): property to indicate that an item is of particular interest for a Wikimedia project. This property does not add notability. Items should not be created with this property if they are not notable for Wikidata. See also P6104, P972, P2354.

Motivation

edit

This property is required to identify the most important items that relate to the context where a language is spoken, whether it is composed of one country or region or several. These items can be located in places, but also on traditions, language, politics, agriculture, biographies, events, etc. They are a collection local to a language context (e.g. Branbury cake is local to the English context, but also Time Square or the comedian David Mitchell).

We suggested this property because in order to bridge the content gaps between language editions it is essential to identify which articles are “local” (see Cultural Context Content or related papers[1][2]), as they tend to be more developed because of their most direct knowledge and access to sources. By identifying which articles are local to every language context it is possible to create lists of essential or vital articles that can be considered to guarantee a minimum of content cultural diversity.

As said, in a language context there are all kinds of topics. According to the most recent results from the method proposed by the project Wikipedia Cultural Diversity Observatory (WCDO), the extent of content that relates to a language context is around 25% of the articles in the largest 40 Wikipedias. In smaller languages, the percentage is much smaller as they have not devoted enough attention to represent their context. This is clearly a barrier to all the Wikipedia project achieve the sum of human knowledge.

This property requires a language item. So, based on the data provided by the project WCDO we will create triplets to mark the articles (100 to 500) of that relate to the language context of the three hundred language editions. These are named Top CCC articles.

References

edit
  1. Miquel-Ribé, M., & Laniado, D. (2018). Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions. Frontiers in Physics.
  2. Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions. Proceedings of the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM. 2334-0770

--Marcmiquel (talk) 15:41, 13 January 2020 (UTC)[reply]

Discussion

edit
@Marcmiquel: Isn't this use case taken care of now by the combination of location properties and language used (P2936) on the geographic items? For example Sagrada Família (Q48435) located in the administrative territorial entity (P131) ... Catalonia (Q5705) which has language used (P2936) Catalan (Q7026) (and others). Or are you trying to accomplish something else here? ArthurPSmith (talk) 20:38, 13 January 2020 (UTC)[reply]
@ArthurPSmith, you are right that some are tackled with location and language properties. The creation of the dataset of articles that belong to a language context for each language edition uses these two properties (and many other aspects, as you can consult in the links/papers). The final selection of articles local to a language context is richer though - it includes items that range from traditions, people, places, language traits, etc. Having these collections is essential to later select the most relevant part of each language context, which should be prioritized in translation to other language editions. This is what the "Top CCC lists" is doing. They are algorithm-generated lists of 100-500 articles of the most essential articles of each language context that every other Wikipedia should have in order to ensure a minimum coverage of the existing Wikipedia cultural diversity. The purpose of this property is to be able to search the gaps using Wikidata queries. --Marcmiquel (talk) 11:02, 14 January 2020 (UTC)[reply]
@Marcmiquel: So it's not just location + language, but also some notion of importance? How would you prevent this property from automatically being applied to, say, every village in Italy, or every type of pasta, to add "Italian" as the context? Or would that be ok? If there's a significance criterion involved then I don't think the label quite matches what you are trying to do here. ArthurPSmith (talk) 18:28, 14 January 2020 (UTC)[reply]
It occurred to me that you are perhaps trying to duplicate this property: on focus list of Wikimedia project (P5008)? Would that existing property meet your needs? ArthurPSmith (talk) 18:30, 14 January 2020 (UTC)[reply]
@ArthurPSmith: it is not just location + language, it's everything related to the context where the language is spoken. The notion of importance lies in just using the property for 100 or 500 items. However, the cultural context content of a language is more extense (in the Italian Wikipedia it is a 17.17% of its content, in the English a 44.91%, in the Catalan a 16.62%). With this property as presented, they could apply this property to the entire extent of cultural context content. I think it would be ok, but to mark the first 100 or 500 there should be another way. Or this property should have a criterion of importance in it. The on focus list of Wikimedia project (P5008) is good for specific projects, but here we are talking about a property with possible 300 values (languages). If we apply it to 100 items per language, then we have 30,000 items which are the "most relevant content for cultural diversity in Wikipedia". --Marcmiquel (talk) 08:53, 16 January 2020 (UTC)[reply]
Thanks, I fixed it. --Marcmiquel (talk) 10:51, 14 January 2020 (UTC)[reply]
@ChristianKl: Thanks for your answer. Yes, a cultural context is richer than location and language, it's what they create in it the speakers of that language. Could you explain how do you imagine this new property? Isn't it the one I am suggesting? --Marcmiquel (talk) 10:21, 16 January 2020 (UTC)[reply]
I think that language used (P2936) should be used on Naples (Q2634) and that language used (P2936) might be subproperty of (P1647) of the newly created property. I don't think the newly created property should be used directly for Naples (Q2634). ChristianKl15:38, 16 January 2020 (UTC)[reply]
Sorry @ChristianKl:, I'm not sure I understood what you propose as scheme. Could you explain it a bit more? Likewise, could you please guide me a bit on how we should proceed. I'm new at Wikidata property proposal. Thanks. --Marcmiquel (talk) 12:05, 19 January 2020 (UTC)[reply]
Remove all the cases where the usecase is already covered by language used (P2936) or located in the administrative territorial entity (P131) from the list of examples. ChristianKl08:34, 20 January 2020 (UTC)[reply]
If I remove these other cases covered by these properties, I might also need to remove many other properties. Isn't it possible to have a certain redundancy? --Marcmiquel (talk) 19:19, 28 January 2020 (UTC)[reply]
In general, people doing gap analysis with Wikidata tend to get the completeness of Wikidata or the ground truth wrong. I haven't gone through your papers though.
  1. Can you explain the reasoning (ideally with references) that links "Pesto" to Italian language (second sample above)?
  2. Why has Naples (Q2634) Italian (Q652), but not Neapolitan (Q33845), Latin or Ancient Greek (Q35497)?
  3. Why does the "domain" above mention territory, but the samples use other (people, food, etc.)?
  4. Please fill in the "description" field.
  5. I added a few related properties as "see also" above.
If it's merely an idiosyncratic approach, I think one would want to go with on focus list of Wikimedia project (P5008). --- Jura 09:10, 20 January 2020 (UTC)[reply]
@Jura1:Presenting the idea here is part of the exploration of Wikidata's potential to bridge the gaps. I'm very open to check other possibilities. I'll try to answer your questions. 1. The reasoning behind linking Pesto to the Italian is that Pesto is a concept related to the Italian language territory (territories where Italian is spoken) since it was created there. I collect Cultural Context Content (CCC), which is usually called local content, and it contains all the concepts related to the territories where the language is spoken. This includes people, places, things, recipes, etc. I suggest you check the papers or the project Wikipedia Cultural Diversity Observatory I posted in this section. 2. Yes, indeed, it could be any of these languages (not Latin, as it is not in current use, just in the Vatican). 3. Just did it. 4. Thanks, they make sense. I appreciate your feedback. --Marcmiquel (talk) 19:13, 28 January 2020 (UTC)[reply]
  • (1) For "Pesto", that relationship is already indicated by "country of origin" = "Italy".
(2) I added Neapolitan (Q33845) to the sample. If you exclude Latin, you'd also exclude Ancient Greek I suppose.
(3) "domain" in the proposal would be the class of items that could hold it. Apparently it's any (populated) geographic location, not just "language territory (Q8561610)", + a few other classes of items.
I tend to agree with ChristianKl that if the relationship is already covered by another property (or a combination as shown), there isn't much use of adding another one. It would be good to see use cases that aren't covered.
BTW Another gap analysis done for itwiki: it:Progetto:Coordinamento/Wikidata/Italiani_senza_voce. --- Jura 14:28, 30 January 2020 (UTC)[reply]