Help:Dataset sizing

Purpose edit

This page aims to list and define a few standard metrics suitable to be determined on a subset of Wikidata items.

For metrics used elsewhere, it attempts to provide queries that can be used on Query Server.

Version edit

This is the version as of 20200823075556. Please use the "permanent link" on the left side when quoting this page.

Introduction edit

Sample queries to select the items:

  • sleds: SELECT ?item WHERE { ?item wdt:P279* wd:Q181388 }
  • tennis: SELECT ?item WHERE { ?item wdt:P641 wd:Q847 }


Knowledge Graphs on the Web -- an Overview (Q86997852) proposes a few metrics:

  • a. # instances
  • b. # assertions
  • c. average linking degree
  • d. median ingoing edges
  • e. median outgoing edges
  • f. # classes
  • g. # relations
  • h. average depth of class tree
  • i. average branching factor of class tree (average width of class tree)
  • j. ontological complexity

They are described at "3. Comparison of Knowledge Graphs" in the paper.

Discussion at Wikidata:Request a query#Dataset sizing.

The queries below are mostly based on truthy main statements (wdt:), not qualifiers (pq:), references (pr:), sitelinks, or labels/descriptions/aliases. Please help expand/add alternate ways to calculate.

A few other metrics are included as well.

Basic metrics edit

number of instances edit

definition
number of distinct items
#  a. # instances
SELECT (COUNT(DISTINCT ?item) as ?nb_instance)
WHERE
{
     ?item wdt:P279* wd:Q181388 .
     # ?item wdt:P641 wd:Q847 .
}
Try it!

number of assertions edit

#  b. # assertions
# Tbd: include sitelinks?
SELECT (SUM(?st) as ?nb_assertions) 
WITH 
{
    SELECT DISTINCT ?item ?st 
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
        # ?item wdt:P641 wd:Q847 .
        ?item wikibase:statements ?st . 
    }      
} as %a
{
  INCLUDE %a 
}
Try it!


average linking degree edit

#  c. average linking degree
# TBD: include incoming links?
SELECT (AVG(?st) as ?avg_linking_degree)
WITH 
{
    SELECT DISTINCT ?item ?st 
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
        # ?item wdt:P641 wd:Q847 .
        ?item wikibase:statements ?st . 
    }      
} as %a
{
  INCLUDE %a 
}
Try it!

median ingoing edges edit

#  d. median ingoing edges: number of ingoing edges
# after the below, calculate median on ?nb_ingoing_edges
SELECT ?item (COUNT(?wdt) as ?nb_ingoing_edges) 
WITH 
{
    SELECT DISTINCT ?item
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
     # ?item wdt:P641 wd:Q847 .
    }      
} as %a
{
  INCLUDE %a 
  ?p wikibase:directClaim ?wdt ; wikibase:propertyType wikibase:WikibaseItem .
  [] ?wdt ?item 
}
GROUP BY ?item
Try it!


median outgoing edges edit

#  e. median outgoing edges: number of outgoing edges
# after the below, calculate median on ?nb_outgoing_edges
# alternative method: include external id properties
SELECT ?item (COUNT(?wdt) as ?nb_outgoing_edges) 
WITH 
{
    SELECT DISTINCT ?item
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
     # ?item wdt:P641 wd:Q847 .
    }      
} as %a
{
  INCLUDE %a 
  ?p wikibase:directClaim ?wdt ; wikibase:propertyType wikibase:WikibaseItem .
  ?item ?wdt []
}
GROUP BY ?item
Try it!


number of relations edit

#  g. # relations
# currently properties. Could be expanded to other

SELECT (COUNT(DISTINCT ?wdt) as ?nb_relations) 
WITH 
{
    SELECT DISTINCT ?item
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
     #  ?item wdt:P641 wd:Q847 .

    }      
} as %a
{
  INCLUDE %a 
  ?p wikibase:directClaim ?wdt .
  { ?item ?wdt [] } UNION { [] ?wdt ?item }
}
Try it!

number of classes (types) edit

definition
number of distinct values used with instance of (P31) or subclass of (P279)
query
#  f. # classes
SELECT (COUNT(DISTINCT ?class) as ?nb_classes) 
WITH 
{
    SELECT DISTINCT ?item
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
     # ?item wdt:P641 wd:Q847 .
    }      
} as %a
{
  INCLUDE %a 
  ?item (wdt:P31|wdt:P279) ?class       
}
Try it!

Most frequent edit

most frequently used properties edit

definition
properties most frequently used as main values (truthy values)
query

most frequent sitelinks edit

definition
most frequently linked WMF sites (Wikipedia, Commons, Wikisource, etc.)
query

most frequently used classes (types) edit

definition
most frequent values used with instance of (P31) or subclass of (P279). Sometimes limited to P31.
query