Notes on SPARQL Aggregates

Aggregates functions like SUM, MIN, MAX, AVG and GROUP_CONCAT provide an easy way to aggregate over a complete dataset, or groups within a dataset in combination with a GROUP BY clause. This provides an easy way to summarise information, but often we want to know more about specific solutions within a group.

For demo purposes, consider a dataset consisting of information about books, an excerpt of which is shown below:

base <http://example.com/id/>
prefix schema: <http://schema.org/>

<book/1> a schema:Book ;
  schema:name "The Hobbit" ;
  schema:genre "Fantasy" ;
  schema:pages 367 ;
  schema:author <person/1> .

<book/2> a schema:Book ;
  schema:name "The Lord Of The Rings: The Fellowship Of The Ring" ;
  schema:genre "Fantasy" ;
  schema:pages 404 ;
  schema:author <person/1> .

<book/2> a schema:Book ;
  schema:name "The Lord Of The Rings: The Two Towers" ;
  schema:genre "Fantasy" ;
  schema:pages 450 ;
  schema:author <person/1> .

<book/2> a schema:Book ;
  schema:name "The Lord Of The Rings: The Return Of The King" ;
  schema:genre "Fantasy" ;
  schema:pages 496 ;
  schema:author <person/1> .

<person/1> a schema:Person ;
  schema:name "J.R.R. Tolkien" .

Problem: finding solutions within an aggregate grouping

Use case: finding the longest book in each genre.

Solutions:

Concatenate the date + filename and use MAX aggregate, then split the during projection to give filename

prefix schema: <http://schema.org/>
select ?genre (strafter(max(concat(?Book_pages, "//", ?Book_name)), "//") as ?Book_pages_name)
where  {
  [] a schema:Book ;
    schema:name ?name ;
    schema:pages ?pages ;
    schema:genre ?genre .
}
group by ?genre

This quick approach can satisfy many simpler use cases, but starts to fall apart when you want to return more information about the solution within a group (especially if that information consists of RDF terms that are not strings or plain literals). Also this approach will return once solution from each group, in this case the book with most pages and the alphabetically last name. In some use cases, we may want to know all books with the maximum number of pages within each group.

Use sub-select and join the results.

prefix schema: <http://schema.org/>
select *
where {
  {
    # find max pages per genre
    select ?genre (max(?pages) as ?max_pages) {
      [] a schema:Book ;
        schema:genre ?genre ;
        schema:pages ?pages .
    }
    group by ?genre
  }
  ?Book a schema:Book ;
    schema:name ?name ;
    schema:genre ?genre ;
    schema:pages ?max_pages .
}

Problem: calculating the statistics of aggregates

Use case: How many authors have X number of books?

Solutions:

  • Use sub-select with aggregation to calculate the number of children per family and project this
  • In outer query aggregate by the projected (aggregated) value to calculate count of families with that number of children

As demonstrated in the query:

prefix schema: <http://schema.org/>
select ?count_books (count(*) as ?count_authors)
where {
  select ?author (count(*) as ?count_books) {
    [] a schema:Book ;
      schema:author ?author .
  }
  group by ?author
}
group by ?count_books