Help:Data modelling

In Wikidata, the community and WikiProjects have created guidelines as to what types of things should have items and what properties should be on those items. These guidelines are what as known as data models. Data models make sure data is structured in a predictable way so that it can be reliably queried for.

This page describes some guidelines for creating such data models so that data in Wikidata can be efficiently accessed, consistent in data representation, and easily understood.

Determining the class instances should use edit

Should all instances of motor car (Q1420) be instance of (P31)motor car (Q1420)or instance of (P31)motor vehicle (Q752870)? "Well clearly motor car (Q1420) of course!", you might say. But what if we wanted to find all instance of (P31)motor vehicle (Q752870)? Then that query would take longer! This dilemma describes the first consideration you should make when figuring out a data model: What do we want to query the most? Your classes should orient themselves so that the queries you will be making will be prioritized.

Making all motor car (Q1420) instance of (P31)motor vehicle (Q752870) has another advantage too: properties can be used to differentiate the instances of motor vehicle (Q752870) that can be used to describe objects outside of the domain of motor vehicle (Q752870)s. For example, if we had a "number of wheels" property, that property could be used on items that are not motor vehicle (Q752870). This would allow for us to query for the number of wheels on things that are not necessarily motor vehicle (Q752870)s.

As you can see, this is a balancing act. It's essential that you define the class instances should use with community input and agreement as such a decision will impact the structure of likely thousands of items and would be a large mess to clean up if it ever needed to be fixed.

Items should have one instance of (P31) statement edit

Instances should aim to have one instance of (P31) statement. Why? Take for example 2020 Summer Olympics (Q181278). At this revision it has these instance of (P31) statements:

instance of (P31)
  Summer Olympic Games (Q159821)
0 references
add reference
  postponed sports event due to the coronavirus pandemic (Q89033277)
0 references
add reference
  organization (Q43229)
0 references
add reference


add value

Looking at these, you would agree that it should definitely have a Summer Olympic Games (Q159821) value, however, the postponed sports event due to the coronavirus pandemic (Q89033277) and organization (Q43229) seem extra and unnecessary. And they are! Why? Because we can express postponed sports event due to the coronavirus pandemic (Q89033277) using properties and organization (Q43229) is a statement that conflates the item. You can see the conflation article to learn more about why conflation is bad, but why is expressing something using a class rather than properties bad? Well, further down in the item's statements we see:

The fact that this sports event was delayed because of the coronavirus pandemic is expressed here too!

Because this is expressed here as well we also run into the issue of data duplication. Data and relationships should not be duplicated in Wikidata. This is to prevent data in one place being updated and then being required to be updated in the other place it is expressed as well. For example, if someone adds a reference to one statement, well then the reference needs to be added to the other statement as well. And most of the time, it is not, so avoiding data duplication altogether is the best solution to preventing this issue.

So which data structure should document the fact that the Olympic Games were postponed? The postponed sports event due to the coronavirus pandemic (Q89033277) class or the start time (P580) property? The property always should if possible!

The primary advantage of properties is that they can usually be used to express traits about an item that are not specific to the item itself, and can thus be used on other types of items as well! This allows us to easily query for items that contain the property that may not be of a particular type.

For example, if we wanted to query for all events that had ever been postponed for some reason, if 2020 Summer Olympics (Q181278) only contained instance of (P31)postponed sports event due to the coronavirus pandemic (Q89033277), then our SPARQL query for finding all of the postponed events that included 2020 Summer Olympics (Q181278) would be much harder to write!

It is for this reason and data duplication that properties are preferred over classes.

This brings us back to the reasons why we should only have one instance of (P31) statement. Items that usually have more than one:

  1. Are conflated or
  2. Are instances of classes that could be better described using properties

Avoid creating and using metaclasses edit

Metaclasses are essential in some ontological situations, however, for regular usage they should be avoided as they can lead to data duplication.

All metaclasses classes are instances of should already express that they are a subclass of the particular class their metaclass is a metaclass of. We have the first level class level for a system. So that you know the superclasses of an item without requiring making a metaclass. If we had a metaclass for every class then we'd be duplicating every subclass of (P279) statement in Wikidata for no reason. That'd be crazy. For this reason, it's best to not create them to use regularly.

Situations in which creating metaclasses would be warranted would be for example music genre (Q188451). music genre (Q188451) is a metaclass that is naturally present and defined in our society, so it makes sense that it should have an item. "type of fruit from Indonesia" would not be an appropriate metaclass.

Don't create overly-specific classes edit

The preceding sections leads us into the next recommendation: Don't create overly-specific classes!

For all the reasons listed above as to why we should have general yet-optimized classes, make sure items only have one instance of (P31) statement, and avoiding specific and unnecessary metaclasses - you can probably infer that we should not be creating overly-specific classes!

Drawing from our previous example, "fruit from Indonesia" would be an overly-specific class that could better be described using subclass of (P279)fruit (Q3314483)and country of origin (P495)Indonesia (Q252).

Additionally, it should probably be merged in order to prevent its accidental usage. For example, users who are creating new items that would be a "fruit from Indonesia" might see it as option possible when adding a instance of (P31) statement and add it instead of fruit (Q3314483). As we saw in #Items should have one instance of (P31) statement, this could ruin someone's queries!

However, it should not be merged if it is contextual necessary for usage in another item. For example, if a scholarly article was about "type of fruit from Indonesia", then it would necessitate the existence of this item.

To prevent users from using such a needed "type of fruit from Indonesia" class that should not be used in instance of (P31) statements, constraints like a "none-of value-type constraint" have been proposed. Users have also prevented users from using creating and using specific classes by creating properties like form of creative work (P7937) which acts as an extension of instance of (P31) to only allow specific subtypes of literary and musical works to be used on music items which are also not allowed in instance of (P31) in order to restrict and enforce their usages.

Relating similar, but conceptually different entities edit

Let's say you want to find all the music albums by an artist. Well, there are two different types of conceptual representations that we think of when we think of an album. An album as in the overall release of all the individual releases of an album and the individual releases in their different formats. To elaborate, when you say, "did you hear Taylor Swift's new album?" you would be referring to the overall release. However when you say, "Taylor Swift released her album on Spotify and vinyl", you are referring to the individual releases: the one of Spotify, and the other one on vinyl.

On Wikidata we create items for both the "overall release" (release group (Q108346082)) and the "individual release" (audio release (Q115669410)).

So how do we relate these two? Well, you might have noticed that release group (Q108346082) is a group (Q16887380). So yes, you should create groups to relate similar concepts.

But why? Couldn't we make the audio release (Q115669410) an instance of (P31) the "overall release"? Well yes, we could, but as you saw in the example situation, we talk about "overall releases" a lot in our lives, so it's clearly something we'd like to query for as well. If we made all "overall releases" subclass of (P279)audio release (Q115669410)and then "releases" instance of (P31) those "overall releases" then we'd run into a problem!

How do we distinguish "overall releases" from actual subtypes of "release"?

For example, audio release (Q115669410) has the subtypes of album release (Q108352648), extended play release (Q108346556), and single release (Q108352496). They all have subclass of (P279)audio release (Q115669410) too! Well, we could enforce that those three subclasses should be the only subtypes of album release (Q108352648), however, what if we had a release that didn't fall under one of those classes? Then we wouldn't be able to distinguish the "overall release" with the subtypes of album release (Q108352648) if we were querying for direct subclasses of album release (Q108352648). Well, we could create another subtype and make the "overall release" that type, but then that could mean that a new subtype would have to be created every time we have an outlier. As you can see, we become "locked" to this particular class level when we use this system.

Because release group (Q108346082) and audio release (Q115669410) are both something we want to query for and they have this type of parent-child relationship that can't be expressed using the class system, the group system comes to the rescue! The group system allows us to track "overall releases" by making them instance of (P31) release group (Q108346082) and any other subclass of release group (Q108346082) we want! And the same applies to audio release (Q115669410)!

So how do relate these? We use release of (P9831), which is a subproperty of (P1647)part of (P361)!