Home

 

Models

 

Queries

 

Members

 

About

 

Discuss

 

This page details how datasets are described on Wikidata.

Model datasets edit

Wikidata requires examples for developing data models. In some WikiProjects the selected examples will be more visible to public examination than others. Here at WikiProject Datasets, the models we choose are very likely to be the subject of discussion among students using datasets and among anyone discussing Wikidata itself. Because of the extra visibility, be mindful when choosing examples to present as models. There are some popular datasets which are encumbered with usage restrictions, a branding story for corporations who have values in conflict with those of the wiki community, or which lack the quality and relevance to merit presenting to large numbers of learners.

Ideally, the datasets which the Wikidata community models here have the following characteristics:

  1. free and open
  2. large and diverse enough for use in student exercises including machine learning, while also usable in subsets for smaller projects
  3. the subject of journalism or academic publication in the humanities which considers the data for social significance
  4. relevant or meaningful across language and cultural barriers
  5. has no close tie to any corporate brand
  6. promotes public benefit
  7. easy to understand and appreciate even among people who know nothing of datasets

In addition to the datasets themselves here we also need models for dataset creators, tools which use the datasets, and other related concepts. Perhaps not all examples be a perfect example, but please try for the best. Keep Wikidata a community managed project which promotes ethics and values for public benefit.

Proposed model datasets edit

Please edit this section freely and discuss the datasets on the talk page. Perhaps we could seek to identify about 10 datasets to model. The following datasets may meet the above criteria.

  1. Wikidata?
  2. OpenStreetMap
  3. Something from Internet Archive?
  4. Sloan Digital Sky Survey (Q840332)
  5. Laser Interferometer Gravitational Wave Observatory (Q255371)
  6. Protein Data Bank (Q766195)
  7. something from National Institute of Standards and Technology (Q176691), perhaps from their AI project
  8. something from the World Bank?
    1. https://data.worldbank.org/
  9. CKAN
  10. Human Genome Project
  11. Global Biodiversity Information Facility (Q1531570)
  12. The General Index (Q108864972)
  13. ?? your suggestion
Unacceptable or not preferred
  1. Internet Movie Database (Q37312) because only personal use and noncommercial
  2. CiteSeerX (Q2715061) because of noncommercial license
  3. AGROVOC (Q292649) this United Nations multilingual agriculture vocabulary has a CC-By license, so incompatible with Wikidata

Properties edit

Dataset edit

See also E189, a ShEx for clinical trials.

Title ID Data type Description Examples Inverse
countryP17Itemcountry: sovereign state that this item is in (not to be used for human beings)Wusung Radio Tower <country> People's Republic of China-
instance ofP31Iteminstance of: that class of which this subject is a particular example and member; different from P279 (subclass of); for example: K2 is an instance of mountain; volcano is a subclass of mountain (and an instance of volcanic landform)PARAMOUNT trial <instance of> clinical trial-
copyright licenseP275Itemlicense: license under which this copyrighted work is releasedInkscape <copyright license> GNU General Public License, version 2.0-
DOIP356External identifierserial code used to uniquely identify digital objects like academic papers (use upper case letters only)Ecological guild evolution and the discovery of the world's smallest vertebrate <DOI> 10.1371/JOURNAL.PONE.0029797-
language of work or nameP407Itemlanguage: language associated with this creative work (such as books, shows, songs, broadcasts or websites) or a name (for persons use "native language" (P103) and "languages spoken, written or signed" (P1412))Autobiografia di Alice Toklas (translation) <language of work or name> Italian-
inceptionP571Point in timedate of establishment: time when an entity begins to exist; for date of official opening use P1619Society of Jesus <inception> -
start timeP580Point in timestart time: time an entity begins to exist or a statement starts being validJapan <member of> League of Nations
<start time> 10 January 1920
-
end timeP582Point in timeend time: moment when an entity ceases to exist or a statement stops being validJapan <member of> League of Nations
<end time> 27 March 1933
-
sponsorP859Itemsponsor, Tabula gratulatoria, research sponsor and patron of the arts: organization or individual that sponsors this itemKilian Jornet <sponsor> Salomon-
main subjectP921Itemtopic, matter and subject: primary topic of a work (see also P180: depicts)Marina <main subject> Rocco Granatastatement is subject of
applies to jurisdictionP1001Itemjurisdiction and jurisdiction: the item (institution, law, public office, public register...) or statement belongs to or has power over or applies to the value (a territorial jurisdiction: a country, state, municipality, ...)European Central Bank <applies to jurisdiction> Eurozone-
number of participantsP1132Quantitynumber of participants: number of participants of an event, e.g. people or groups of people that take part in the event (NO units)2008 Summer Olympics <number of participants> 32-
titleP1476Monolingual textoriginal title and title: published name of a work, such as a newspaper article, a literary work, piece of music, a website, or a performance workNature <title> Nature-
short nameP1813Monolingual textabbreviation: short name of a place, organisation, person, journal, wikidata property, etc. Used by some Wikipedia templates.Orange County <short name> Orange-
file formatP2701Itemfile format: file format, compression type, or ontology used in a fileArt & Architecture Thesaurus LOD Dataset <file format> ZIP and N-Triples-
funderP8324Itemfunder: entity that gives money to a person, organization, or project for a specific purposeComparing the Benefits and Harms of Three Types of Weight Loss Surgery -- The PCORnet® Bariatric Study <funder> Patient-Centered Outcomes Research Institute-

Dataset creator edit

  • organization
  • individuals
  • ambiguous cases - how to model Wikidata's creators, for example?

Publication featuring a dataset edit

  • Intended for modeling academic articles which feature a dataset
  • possibly also for books, journalism, or other popular media featuring a dataset

Tool using a dataset edit

  • Intended for modeling digital tools which rely on a dataset
  • Examples: Wikipedia uses Wikidata; weather and mapping services use public data which government shares

Set profiled in a dataset edit

  • intended for either special Wikidata items for the profiled collection, or perhaps just the general concept Wikidata item
  • Titanic passengers --> Titanic passengers dataset

Other issues edit

Modeling Wikipedia infoboxes edit

In addition to modeling datasets in Wikidata we eventually need to model datasets for infoboxes across Wikipedias. As of 2021 the French language Wikipedia is still one of the few but growing number to embrace the linking of Wikidata and Wikipedia for infoboxes. Here are a set of examples from French Wikipedia for modeling datasets in infoboxes.

Modeling with intent to import to Wikidata edit

Modeling for FAIR compatibility edit

The following page documents a possible mapping between DCAT or schema.org and Wikidata:

DCAT - Wikidata - Schema.org mapping