Wikidata:WikiProject Datasets

Aim and Scope edit

The aim of the present project is to coordinate the way datasets that serve as sources for data ingestion are described in Wikidata.

The project more specifically serves to coordinate the following activities:

the identification of relevant vocabularies for the description of datasets (such as DCAT or schema.org);
the implementation of such vocabularies on Wikidata;
the definition of additional data fields needed for the preparation, execution and monitoring of data ingestion processes on Wikidata;
the ingestion of metadata about relevant datasets into Wikidata;
the enhancement and completion of such data.

Items edit

Generic edit

data set (Q1172284)

Specific edit

Properties edit

Ways to find datasets edit

(1) datasets already easily accessible via the programming languages and libraries that are used in teaching, e.g. the datasets package in R or the example datasets in Python modules like sklearn, seaborn or pydataset

(2) datasets deposited in machine-learning focused repositories like the UCI Machine Learning Repository or Kaggle

(3) datasets described in Wikipedia pages like https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research List of datasets for machine-learning research - Wikipedia These datasets are used for machine-learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets. en.wikipedia.org or their Wikidata equivalents

(4) any of the datasets originating from the Wikimedia ecosystem, e.g. full or incremental data dumps, recent changes streams or pageview data

(5) datasets deposited in disciplinary repositories, e.g. those that can be found via the cross-disciplinary search in Re3data

(6) datasets in cross-disciplinary repositories like Zenodo, figshare or Open Science Framework

(7) datasets in UVA-hosted databases or repositories like LibraData or Collective Biographies of Women

(8) datasets available via Google Dataset search or similar

(9) datasets accessible through dataset managers like Quilt

(10) datasets associated with specific publications (e.g. textbooks, technical reports or research papers)

(11) subsets of any of the above that are in the public domain or available under an open license

Starting at the teaching end, we could review

(a) what datasets are already being used in university courses (e.g. via Open Syllabus)

(b) the aspects (or lenses, to use your wording) of data science for which they are currently used in teaching (e.g., clustering, classification, regression, prediction, ethics)

(c) whether they could be used in other courses and/ or to teach other aspects of a given course topic

(d) whether they are public and openly licensed

(e) aspects / lenses currently taught at a given institution with toy (or no) datasets where real datasets would be beneficial

(f) datasets used for teaching data science elsewhere, e.g. in The Carpentry lessons or in Berkeley's Data 8 course

Ways to Contribute edit

There are several ways you can contribute to the project:

Help identify and implement relevant vocabularies for the description of datasets.
Create Wikidata items for datasets that have been or are to be ingested into Wikidata.
Add links to related projects on Wikipedia; draw their contributors' attention to this project page.