Wikidata:Grouping statements

Grouping of statements is an attempt to describe some of the problems associated with ordering statements into some kind of sane groups. This is a writeup of some observations from a subproject and may not reflect consensus on this project. (You be warned!)

There are other attempts on creating groups, note for example Wikidata:Requests for comment/Organizing statements, sitelinks, and external identifiers and mw:Requests for comment/Statement group ordering. Those are about static ordering and more or less broken by design.

Basically there are two fundamentally different ways to group statements; they can be grouped statically once and for all, or they can be grouped dynamically each time they are needed. The former is the simplest to implement but will often give a strange result, especially if a group have few or no members at all. That makes it often preferable to have a dynamic grouping that can adapt to the number of available members.

There are also different purposes for the grouping; statements going into the text, the infoboxes, and "AboutTopic" fragments. Grouping with the purpose to generate natural language is a bit premature now, but this must be the ultimate goal if we want to stop making non-maintained stubs by bots.

So there are at least six different types of groupings and ways of creating them. In general we don't want to create a system where the whole or large parts of the system must be updated at once, we want incremental updates. That too creates some constraints on available methods.

Basic cluster analysis edit

Assume we only want to manage something associated with the properties themselves. In this case we can add a property describing the property itself. This meta property are used to list the other properties that can go in the same group as this one. We have an meta association with those other properties.

When we use such a property in a statement for an item we can find the meta associations, and if the mentioned properties exist, then we can start building a cluster for group G with those statements. If there are only statements using properties naming other such properties from the same group, then nothing much interesting happen, but if there are properties from other groups things gets interesting or even if properties name conflicting groups.

A group can be named, but it is strictly speaking not necessary. It is although a bit simpler to model in Wikidata if we say that a group has a name. Build a table holding all the named groups.

Each group is given a position on random in a space. We can initialize this position to a known value, and it will often make the algorithm converge somewhat faster. Each statement is likewise given a position on random. If the properties for the statement only names a single group, then reuse the position for the group, otherwise pick a position on random between the groups. (One solution is to use the mean with a little noise.)

Pick one of the statements from the item and update the position of the group a little towards the position for the statement, and the statement a bit more towards the position for the group. Repeat this for all the statements in the item. It is slightly better to repeat this on random for the statements, as this breaks cycles. After each run normalize the span to avoid the whole set lumping together. Slowly the statements will move towards its preferred group.

Use a distance metric to measure which statements belongs to which groups, and note those that have less than a given limit of members. If those are in some other group they can be released from the inferior group. After more iterations those will move towards other groups, ending up in the most important one given some measure of importance.

By increasing the number of dimensions the groups are more likely to bypass each other. Less dimensions makes them more likely to stick together.

This is really just basic cluster analysis.

Fuzzy cluster analysis edit

Now assume that the distance we are moving are given by some closeness measure between the fixed statement and some other tested statement belonging to the group. Such a closeness measure can be calculated from the some metric like the path length for each subproperty to a common ancestor property. This is like the small world distance in the type hierarchy.

Such distance metrics can also be calculated from other, more involved ways. It is for example possible to group external links on language used on the site, noting that some languages are close enough to be intelligible for most users while other are less usable.

In general the weights are calculated between a pair of statements by a method identified by the specific pair of properties.

Use of a closeness measure will make it possible to express that some statements have rather weak connections, thereby making it possible to extend the tested set.

It is possible to use hard coded weights, but it could give less maintenance to calculate them on the fly. If the weights are produced as a result of some heavy analysis, then it is obviously better to store the weights somehow.

This is more or less fuzzy clustering.

Grouping vs ordering edit

Sometimes we don't only want to know which specific group a statement belongs to, but we also want to organize the groups in a specific order. One way to do that is to mark specific meta associations so we know if they are before or after some other named meta association. If an association A should be before some other association B, then we penalize B if that is not the case and drags it "down" (whatever that should be). Likewise we can do the opposite and pull B "up" if it lags behind A.

This kind of positional grouping can be necessary to maintain a given script during natural language generation, but also to maintain a natural information flow in an infobox. For example should birth-information come before death-information.

Note that the actual "sort-order" is not necessary if the purpose is to create a grouping, only if we want a specific sort order we need to set whats before and after other groups.

Often we only have ordering between some of the groups or properties, and the remaining groups or statements flows where it fits in. If there are constraints on ordering the iterations should run until all are satisfied or a time limit is reached.

More often than not the ordereing is hard if used for natural language generation. If we can't fulfill the ordering the statement must be removed. The reason is that we must establish a theme before we can go on to a rheme, ie. the text must say who the father is before making claims about the father and so forth.

Example edit

 <p:birth date> group <q:birth data> .
 <p:birth place> group <q:birth data> .
 <p:death date> group <q:death data> .
 <p:death place> group <q:death data> .
 <q:death data>
   title "Death" ;
   sort-order 800 ;
   after <q:birth data> .
 <q:birth data>
   title "Birth" ;
   sort-order 900 ;
   after <q:death data> .

Note that we can write <birth data> after <death data> . or <death data> before <birth data> ..

See also edit