Tulong:Tungkol sa data

This page is a translated version of the page Help:About data and the translation is 25% complete.
Outdated translations are marked like this.

Ang Wikidata ay isang libreng batayan ng kaalaman na maaaring basahin at pamatnugutan ng parehong tao at makina. Isa lamang ito sa maraming mga proyekto na batay-sa-wiki na itinataguyod at pinapanatili ng Wikimedia Foundation, isang organisasyong hindi kumikinabang at libre ang nilalaman na marahil ay pinakakilala dahil sa Wikipedia. Ang bawat proyekto ng Wikimedia Foundation ay may kani-kanyang tinutuunan — halimbawa, ang Wikipedia ay para sa mga nilalaman ng ensiklopedya, sinusuportahan naman ng Wikimedia Commons ang mga larawan at iba pang mga talaksan na medya, at ang Wiktionary ay nagbibigay ng impormasyong leksikal tungkol sa mga salita tulad ng mga kahulugan at kasingkahulugan. Ang tuon ng Wikidata ay ang nakabalangkas na data.

Inilaan ang pahinang ito bilang isang pangkalahatang-ideya ng nakabalangkas na data. Kung pamilyar ka na sa nakabalangkas na data, ngunit nais na matuto nang higit pa tungkol sa tukoy na paggamit nito sa Wikidata, kung paano gamitin at puntahan ang data sa Wikidata, o kung paano mag-ambag ng data ng iyong sariling proyekto sa Wikidata, mangyaring magpatuloy sa bahagi tungkol sa pag-kawing ng data.

Pag-unawa sa Wikidata

Ang nakabalangkas na data ay tumutukoy sa data na naayos at naimbak sa isang tinukoy na paraan, madalas sa hangaring i-encode ang kahulugan at panatiliin ang mga ugnayan sa pagitan ng iba't ibang mga puntos ng data sa loob ng isang dataset.

Ngunit ano ang data, gayon pa man? At bakit ka dapat magmalasakit sa partikular na nakabalangkas na data?

Pagtukoy sa data

Malaking data, pang-eksperimentong data, bukas na data, metadata — maaaring nakabasa o rinig ka na ng ilan o kahit na lahat sa mga katagang ito dati.

Nangangahulugan ng kaunting pagkakaiba ang bawat kataga ngunit ang lahat ay bumubuo sa isang pangkaraniwang pag-unawa sa data at potensyal nito para sa paglalarawan at pagpapabuti ng ating pag-unawa sa mundo sa ating paligid.

Bilang isang basal na konsepto, ang data ay maaaring isipin bilang isang pauna sa impormasyon, nangangahulugang ang impormasyon ay maaaring mahihinuha o magmula sa data.

Ito ay dahil ang data kapag pinakuluan sa pinakabuod nito ay simpleng isang pangkat ng "mga pamantayan tungkol sa mga bagay. Ang mga halagang ito ay maaaring bilang o bilang tulad ng isang sukat o isang dami. Maaari din ang mga itong maging ayon sa husay, tulad ng isang paglalarawan o isang paghahambing. Halimbawa, masasabi nating ang "8,848 m (29,029 ft)" ay isang halaga ng data tungkol sa taas ng Bundok Everest at ang "pula" ay isang halaga ng data tungkol sa kulay ng isang kotse.

Tulad ng naunang nabanggit, ang impormasyon ay hindi pareho sa data ngunit sa halip ay isang produkto ng katipunan at pagsusuri ng data. Halimbawa, ang "8,848" (data) ay isang medyo walang kahulugan na numero sa sarili nito kahit na alam natin na ito ay ang taas ng isang bundok; masasabi lamang natin na "Ang Bundok Everest ay ang pinakamataas na bundok sa buong mundo na 8,848 m" (impormasyon) kung may kamalayan tayo sa mga karaniwang sukat ng taas at kapag alam natin ang taas ng iba pang mga bundok. Nagiging mas madali ang paggawa ng mga ganoong hinuha, pagkuha ng mga bagong pananaw at kaalaman, at pagtatag ng mga katotohanan kapag nakabalangkas ang data-babalikan natin ang ideyang ito mamaya.

Saan ang data?

Ang data ay nasa kapaligiran natin. Maraming mga uri ng mapagkukunan ng data, kabilang ang pampinansyal, biyolohikal, at panlipunan na data. Kahit na ang pahinang ito ay may data! Halimbawa, mayroon itong kabuuang bilang ng salita, mga petsa kung kailan ito nilikha at huling binago, simuno at paksa, bilang ng mga nakakita sa pahina, at ibang mga wika kung saan puwedeng buksan ang mga nilalaman.

Gayunpaman, habang ang lahat ay potensyal na mapagkukunan ng data, ang data na hindi naitala at naayos ay maaaring hindi na umiral. Nang walang isang napapailalim na istraktura, ang data ay lilitaw na walang kahulugan at nakakabigong magbigay ng kapaki-pakinabang na impormasyon.

Sa organisado, nangangahulugan kami na ikinategorya sa isang pamantayan at hindi malabong pamamaraan. Ang nakaayos at nakakategorya na data ay ang tinutukoy namin kapag sinabi namin na nakabalangkas na data.

 
Nagtatampok ang Wikidata ng isang nakabatay-sa-pormularya na pagpasok para sa pagdaragdag ng data sa mga bagay

Saan ang pagkabalangkas?

On the web, structure reigns. Most websites are created using HTML, a markup language which provides the basic scaffolding, or structure, of a web page.

Markup languages are also used for tagging and describing page content so that search engines, bots, and applications like RSS feeds can easily process and "understand" it. For example, <title> tags tell machines what the name of a website is.

Instead of supporting the structure and common elements of a web page, Wikidata provides structure for all the information stored in Wikipedia, and on the other Wikimedia projects. Wikidata is based on the Mediawiki software as is any other Wikimedia project, extended by Wikibase, the software which powers Wikidata and is designed to manage large amounts of structured data. Structure is not directly added to the content of Wikipedia or other Wikimedia site pages, as in tables or lists, nor is any knowledge of markup languages, data schemas, object notation, or other special syntax required by Wikidata users; instead, data is added to and edited in Wikidata through user-friendly input forms.

All data stored on Wikidata can be used to generate all kinds of automated and up to date lists or tables or other structured pages in any Wikimedia site or elsewhere.

Table 1
Data for Mountains
Mountain Property Value
Mount Everest height 8,848 m
K2 hauteur 8,611 m
Kanchenjunga height 8,586 m
Lhotse height 27940 ft

Structuring data

For an example on the importance of structure, let's look at Table 1. In this table we can see data for the four highest mountains on Earth. If we would like to know a particular piece of information, such as the height of the second highest mountain in the world, we should be able to look at the provided data and find out the correct value. However, only three of the four mountains have their data categorized as a height value, and only two of those three mountains have values in metres. While we know that height and hauteur (French for height) can be understood as equal to each other, and how to convert metres to feet or vice versa, a machine, such as a bot or a computer program may not.

It would be much easier for both humans and machines to process the information and answer the original question about the second highest mountain when all underlying data is recorded in a similar way even if the presentation differs.

Modeling data

Collections of structured data, like Wikidata, are organized according to a data model. Data models are machine-readable, meaning they can be understood by a computer. While computers are powerful, they are often not as smart as us when it comes to simple reasoning. For instance, in the example above, a machine would not be able to know that height and hauteur are the same unless they were explicitly told somehow that was the case.

Table 2
Data for Mountains
Mountain Property Value
Mount Everest continent Asia
K2 continent Asia
Kanchenjunga continent Asia
Lhotse continent Asia
 

Data models vary based on the analysis needs, scope and conceptual framework of the dataset, and the technical requirements of a system. However, all data models typically will specify what kind of data can be supported by a system and what relationships between values can be understood and represented. For example, a data model could specify that height and hauteur be mapped to each other so that both terms represent one concept, or that measurements in feet be automatically converted into metres. The Wikidata data model shapes the way that data can be edited and added to the system by users. It is also a work in progress, with new data types being added to the model over time.

The data model also essentially translates human natural language patterns into something that can be processed by machines. For example, in English we might say:

"Mount Everest is the highest mountain in the world"

This is also the raw, unstructured format of content currently on Wikipedia and all other Wikimedia sites.

On Wikidata, this would be represented by a statement, which consists of a property-value pair about an item, in this case Earth:

Earth (Q2) (item)highest point (P610) (property)Mount Everest (Q513) (value)

Additionally, Wikidata would also hold a statement about the item for Mount Everest (indicating it is a mountain):

Mount Everest (Q513) (item)instance of (P31) (property)mountain (Q8502) (value)

Note that because other items can be used as the values for statements, and all items have their own unique page on Wikidata, this means that all items in the system can be linked together through a series of statements. Because Wikidata uses a machine-readable format, this interlinking of data allows new relationships and connections to be discovered and processed by machines. For example, in Table 2 we see new data for our mountains, this time about their geographical location by continent but nothing about their heights. Assuming this continent data was linked to the mountain height data, we would feel more confident making predictions or drawing certain conclusions about it, like saying that Asia is home to the world's highest mountains.

Linking data

Besides being a collection of structured data, Wikidata also supports linked data. Linked data refers to the practice of publishing structured data so that it can be interlinked.

For Wikidata this means that volunteer-contributed data can also be linked to other datasets, databases, and data sources from all around the web and from diverse initiatives outside of the Wikimedia family. For example, Wikidata currently allows interlinking with datasets and databases as diverse as Google Books, Canmore (one of the Historic Environment Scotland databases), the Vatican Library, OmegaWiki, and MusicBrainz.

 
example of a simple statement consisting of one property-value pair
 
example of a more complicated statement consisting of one property-value pair, qualifiers, and a reference

By following linked data principles and practices, Wikidata is also able to support and be used by other projects.

Linked data principles

Wikidata uses unique identifiers, or uniform resource identifiers (URIs), for all its items as per linked data standards.

While Wikidata uses a unique data model, its content can be exported in RDF, a widely used and standard format for linked data. In Wikidata terms, a statement is composed of an item and a property-value pair. For those familiar with linked data concepts, an item can be viewed as the subject part of a triplet; the property represents a triplet's predicate; and a value is used to express the object of a triplet.

However, Wikidata statements may also contain elements beyond the subject-predicate-object, such as references and qualifiers (for more information, see Help:Statements). This makes it complicated to fully represent Wikidata's content using the language of RDF—more information on these challenges can be found in the document "Introducing Wikidata to the Linked Data Web".

Contributing data

If you have datasets you would like to contribute to Wikidata, please see Wikidata:Data donation.

Accessing data

The data in Wikidata is published under the Creative Commons Public Domain Dedication 1.0, allowing the free reuse of the data. You can copy, modify, distribute and perform the data, even for commercial purposes, all without asking permission.

See Data access for details about the different ways to programmatically access Wikidata's data.

See also

For related pages, see:

For additional information and guidance, see:

  • Project chat, for discussing all and any aspects of Wikidata
  • Wikidata:Glossary, the glossary of terms used in this and other Help pages
  • Help:FAQ, frequently asked questions asked and answered by the Wikidata community
  • Help:Contents, the Help portal featuring all the documentation available for Wikidata