How to Group By “Nothing” in SQL

The SQL standard knows a lesser known feature called GROUPING SETS. One particular side-effect of that feature is that we can group by “nothing” in SQL. E.g. when querying the Sakila database:

SELECT count(*)
FROM film
GROUP BY ()

This will yield:

count |
------|
1000  |

What’s the point, you’re asking? Can’t we just omit the GROUP BY clause? Of course, this will yield the same result:

SELECT count(*)
FROM film

Yet, the two versions of the query are subtly different. The latter will always return exactly one row. The former will perform grouping and return all the groups. How is this different? Just add a predicate!

SELECT count(*)
FROM film
WHERE 1 = 0
GROUP BY ();

SELECT count(*)
FROM film
WHERE 1 = 0;

Now, the first query will produce nothing!

count |
------|

Whereas the second one produces:

count |
------|
0     |

Subtle, eh? Note that unlike DB2, Oracle and SQL Server (which expose the above behaviour), PostgreSQL does not produce the above result as it seems to implement the SQL standard (so, always producing a row) as shown by Markus Winand:

In SQL:1999 (when it was introduced), the <empty grouping set> was called <grand total>, akin to a grand total that can be calculated in a Microsoft Excel Pivot Table. It does make more sense for grand totals to always be present in the result, despite the absence of any input data.

Standards…

What if your database doesn’t support grouping sets?

Not all databases support the awesome GROUPING SETS feature. Among the ones supported by jOOQ, these do:

  • DB2 LUW
  • HANA
  • Oracle
  • PostgreSQL 9.5+
  • SQL Server
  • Sybase SQL Anywhere
  • Teradata

Note that the following databases support a vendor-specific syntax for ROLLUP, which doesn’t help with the empty grouping set.

  • CUBRID
  • MariaDB
  • MySQL
  • Vertica

So, can we emulate it for the other databases?

Of course. There are two ways to emulate the empty grouping set:

By using a constant

You could try using a constant literal:

SELECT count(*)
FROM film
WHERE 1 = 0
GROUP BY 'a';

Sometimes, you’ll have to tweak the database into thinking it is not a constant literal, because it will not accept that:

SELECT count(*)
FROM film
WHERE 1 = 0
GROUP BY 'a' || 'b';

And if that’s also not supported, try wrapping the literal in a subquery:

SELECT count(*)
FROM film
WHERE 1 = 0
GROUP BY (SELECT 1);

One of the above three syntaxes is usually accepted, by these databases:

  • Firebird
  • HSQLDB
  • MariaDB
  • MySQL
  • PostgreSQL
  • Redshift
  • SQLite
  • Vertica

By using a dummy table

In rare cases, none of the above works as the database’s SQL parser tries to be “clever” and rejects my silly attempts to fool it. But no one can fool me!

I’ll just cross join whatever is in the FROM clause with a dummy table (akin to an emulation of table dee) and then group by the dummy table’s column:

SELECT count(*)
FROM film, (SELECT 1 x) dummy
WHERE 1 = 0
GROUP BY dummy.x;

This is guaranteed to work, including on these databases:

  • Access
  • Informix
  • Ingres
  • SQL Data Warehouse
  • Sybase ASE

Q.E.D. 👏

(needless to say that jOOQ supports this emulation. You can play around with it here: https://www.jooq.org/translate)

Correlated Subqueries are Evil and Slow. Or are They?

A common myth in SQL is the idea that correlated subqueries are evil and slow. For example, this query here:

SELECT 
  first_name, last_name,
  (SELECT count(*) 
   FROM film_actor fa 
   WHERE fa.actor_id = a.actor_id)
FROM actor a

It “forces” the database engine to run a nested loop of the form (in pseudo code):

for (Actor a : actor) {
  output(
    a.first_name,
    a.last_name,
    film_actor.where(fa -> fa.actor_id = a.actor_id)
              .size()
}

So, for every actor, collect all the corresponding film_actors and count them. This will produce the number of films each actors has played in.

It appears that it would be much better to run this query in “bulk”, I.e. to run:

SELECT 
  first_name, last_name, count(fa.actor_id)
FROM actor a
LEFT JOIN film_actor fa USING (actor_id)
GROUP BY actor_id, first_name, last_name

But is it really faster? And if so, why would you expect that?

Bulk aggregation vs nested loops

Bulk aggregation in this case really just means that we’re collecting all actors and all film_actors and we then merge them in memory for the group by operation. The execution plan (in Oracle) looks like this:

-------------------------------------------------------------------
| Id  | Operation              | Name                    | A-Rows |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                         |    200 |
|   1 |  HASH GROUP BY         |                         |    200 |
|*  2 |   HASH JOIN (OUTER)    |                         |   5462 |
|   3 |    TABLE ACCESS FULL   | ACTOR                   |    200 |
|   4 |    INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR |   5462 |
-------------------------------------------------------------------

There are 5462 rows in our film_actor table, and for each actor, we join and group and aggregate them all to get to the results. Let’s compare this to the nested loop’s plan:

-----------------------------------------------------------------------
| Id  | Operation         | Name                    | Starts | A-Rows |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                         |      1 |    200 |
|   1 |  SORT AGGREGATE   |                         |    200 |    200 |
|*  2 |   INDEX RANGE SCAN| IDX_FK_FILM_ACTOR_ACTOR |    200 |   5462 |
|   3 |  TABLE ACCESS FULL| ACTOR                   |      1 |    200 |
-----------------------------------------------------------------------

I’ve now included the “Starts” row to illustrate that after having collected all the 200 actors, we start the subquery 200 times to get the count for each actor. But this time, with a range scan.

From the plan only, can we tell which one is faster? Bulk aggregation will need more memory (to load all the data), but has a lower algorithmic complexity (linear). Nested looping will need less memory (all the required info is available from the index directly) but seems to have a higher algorithmic complexity (quadratic).

Fact is, this isn’t exactly true.

The bulk aggregation is indeed linear, but according to O(M + N) where M = number of actors and N = number of film_actors, whereas nested looping isn’t quadratic, it’s O(M log N). We don’t need to traverse the entire index to get the count.

At some point, the higher complexity is worse, but with this little amount of data, it’s not:

On the x-axis is the size of N and on the y-axis is the “complexity value”, e.g. how much time (or memory) is used by an algorithm.

Effects of algorithmic complexity for large N

Effects of algorithmic complexity for large N

Effects of algorithmic complexity for "small" N

Effects of algorithmic complexity for “small” N

Here’s a disclaimer about the above:

Algorithmic complexity for “small N” = “works on my machine” ;)

There’s nothing better than proving things with measurements. Let’s run the following PL/SQL program:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000;
BEGIN
  v_ts := SYSTIMESTAMP;
  
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        first_name, last_name,
        (SELECT count(*) 
         FROM film_actor fa 
         WHERE fa.actor_id = a.actor_id)
      FROM actor a
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
  
  dbms_output.put_line('Nested select: ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
  
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        first_name, last_name, count(film_id)
      FROM actor a
      LEFT JOIN film_actor fa USING (actor_id)
      GROUP BY actor_id, first_name, last_name
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
  
  dbms_output.put_line('Group by     : ' || (SYSTIMESTAMP - v_ts));
END;
/

After three runs, and against our standard Sakila database (get it here: https://www.jooq.org/sakila) with 200 actors and 5462 film_actors, we can see that the nested select consistently outperforms the bulk group by:

Nested select: +000000000 00:00:01.122000000
Group by     : +000000000 00:00:03.191000000

Nested select: +000000000 00:00:01.116000000
Group by     : +000000000 00:00:03.104000000

Nested select: +000000000 00:00:01.122000000
Group by     : +000000000 00:00:03.228000000

Helping the optimiser

Some interesting feedback by Markus Winand (author of http://sql-performance-explained.com) was given on Twitter:

There is a third option: Nesting the GROUP BY operation in a derived table:

SELECT
  first_name, last_name, c
FROM actor a
JOIN (
  SELECT actor_id, count(*) c
  FROM film_actor
  GROUP BY actor_id
) USING (actor_id)

which produces a slightly better plan than the “ordinary” group by query:

--------------------------------------------------------------------
| Id  | Operation               | Name                    | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                         |    200 |
|*  1 |  HASH JOIN              |                         |    200 |
|   2 |   TABLE ACCESS FULL     | ACTOR                   |    200 |
|   3 |   VIEW                  |                         |    200 |
|   4 |    HASH GROUP BY        |                         |    200 |
|   5 |     INDEX FAST FULL SCAN| IDX_FK_FILM_ACTOR_ACTOR |   5462 |
--------------------------------------------------------------------

Adding it to the benchmark as such:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000;
BEGIN
  v_ts := SYSTIMESTAMP;
   
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT
        first_name, last_name,
        (SELECT count(*) 
         FROM film_actor fa 
         WHERE fa.actor_id = a.actor_id)
      FROM actor a
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
   
  dbms_output.put_line('Nested select            : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
   
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT
        first_name, last_name, count(film_id)
      FROM actor a
      LEFT JOIN film_actor fa USING (actor_id)
      GROUP BY actor_id, first_name, last_name
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
   
  dbms_output.put_line('Group by                 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
  
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        first_name, last_name, c
      FROM actor a
      JOIN (
        SELECT actor_id, count(*) c
        FROM film_actor
        GROUP BY actor_id
      ) USING (actor_id)
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Group by in derived table: ' || (SYSTIMESTAMP - v_ts));
END;
/

… shows that it’s already very close to the nested select option:

Nested select            : +000000000 00:00:01.371000000
Group by                 : +000000000 00:00:03.417000000
Group by in derived table: +000000000 00:00:01.592000000

Nested select            : +000000000 00:00:01.236000000
Group by                 : +000000000 00:00:03.329000000
Group by in derived table: +000000000 00:00:01.595000000

Conclusion

We have shown that under some circumstances, correlated subqueries can be better than bulk aggregation. In Oracle. With small-medium sized data sets. In other cases, that’s not true as the size of M and N, our two algorithmic complexity variables increase, O(M log N) will be much worse than O(M + N).

The conclusion here is: Don’t trust any initial judgement. Measure. When you run such a query a lot of times, 3x slower can make a big difference. But don’t go replace all your bulk aggregations either. :)

Liked this article? The content is part of our Data Geekery SQL performance training

How SQL GROUP BY Should Have Been Designed – Like Neo4j’s Implicit GROUP BY

In the recent past, we’ve explained the syntactic implications of the SQL GROUP BY clause. If you haven’t already, you should read our article “Do You Really Understand SQL’s GROUP BY and HAVING clauses?“.

In essence, adding a GROUP BY clause to your query transforms your query on very implicit levels. The following reminder summarises the previous article:

  • Only column expressions referenced in the GROUP BY clause, or aggregations of other column expressions may appear in the SELECT clause
  • Aggregations without explicit GROUP BY clause imply the “grand total” GROUP BY () clause
  • Some databases (e.g. MySQL, and to some extent: the SQL standard) don’t follow these rules and allow for arbitrary column expressions (or at least functionally dependent column expressions) in the SELECT clause

How SQL GROUP BY should have been designed

There is another way of looking at GROUP BY, and it has been implemented in the equally fascinating, beautiful, and weird Cypher query language (those are good attributes) as supported by the Neo4j graph database. This alternative (yet SQL inspired) query language probably deserves a whole blog post series on its own, but let’s focus on aggregation. Because aggregation is the primary use case for grouping.

(for the record, check out the Neo4j docs about aggregation for details)

A quick wrap-up to understand Cypher:

Consider this simple Cypher query:

MATCH (me:Person)-->(friend:Person)
                 -->(friend_of_friend:Person)
WHERE me.name = 'A'
RETURN count(DISTINCT friend_of_friend), 
       count(friend_of_friend)

Furthermore

  • Cypher
    (me:Person)-->(friend:Person)
               -->(friend_of_friend:Person)
    

    corresponds roughly to SQL

         Person AS me 
    JOIN Person AS friend 
      ON [ implicit equi-join predicate ]
    JOIN Person as friend_of_friend
      ON [ implicit equi-join predicate ]
    

Cypher’s way of writing JOIN is actually extremely useful and could also be applied to SQL. It is only a matter of time until someone will write a Cypher-to-SQL transformer that implements the syntax, at least as syntactic sugar for the equivalent ANSI equi-join notation.

Let’s investigate aggregation in Cypher

Here’s the query again:

MATCH (me:Person)-->(friend:Person)
                 -->(friend_of_friend:Person)
WHERE me.name = 'A'
RETURN count(DISTINCT friend_of_friend), 
       count(friend_of_friend)

So, in SQL terms, this is exactly the same as:

SELECT count(DISTINCT friend_of_friend), 
       count(friend_of_friend)
FROM   [ Persons ... ]
WHERE  me.name = 'A'

In other words, the same implicit grand total GROUP BY () is implied and all values are aggregated into a single row.

The next example from the Neo4j docs is more intriguing. This will count the number of nodes connected to a node n with name = 'A':

MATCH (n { name: 'A' })-->(x)
RETURN n, count(*)

Which is a shorter form for writing:

MATCH (n)-->(x)
WHERE n.name = 'A'
RETURN n, count(*)

This example will also perform aggregation, but this time with an implicit GROUP BY n clause. In SQL, you’d write something like:

SELECT   n.id, count(*)
FROM     n
JOIN     x
  ON     [ implicit equi-join predicate ]
WHERE    n.name = 'A'
GROUP BY n.id

The nice thing in Cypher is that the obvious GROUP BY clause (it can only be GROUP BY n.id) is implied. It doesn’t have to be written explicitly.

Takeaway for SQL

We’ve seen a couple of nice Cypher language features, especially the incredibly nice way to write “JOIN” (or rather graph traversal in Neo4j). But a much more obvious, low-hanging fruit with actual chances to make it into the SQL standard would be to make the SQL GROUP BY clause optional, and dependent on the SELECT clause using the following rules:

  • If SELECT contains no aggregation functions, there shall be no implied GROUP BY clause
  • If SELECT contains 1-N aggregation functions, there shall be an implied GROUP BY clause formed from the remaining columns
  • If SELECT contains only aggregation functions, the “grand total” GROUP BY () shall apply
  • An explicit GROUP BY clause will always be preferred to any implied GROUP BY clause

If any of you ISO / IEC committee members are reading this, this is on my wish list for a future SQL standard. And please, PostgreSQL. Implement this right away.

Liked this article?

Here’s some further reading about the SQL GROUP BY clause and aggregation:

How to Translate SQL GROUP BY and Aggregations to Java 8

I couldn’t resist. I have read this question by Hugo Prudente on Stack Overflow. And I knew there had to be a better way than what the JDK has to offer.

The question reads:

I’m looking for a lambda to refine the data already retrieved. I have a raw resultset, if the user do not change the date I want use java’s lambda to group by the results for then. And I’m new to lambdas with java.

The lambda I’m looking for works simliar to this query.

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM table 
GROUP BY z, w;

SQL is declarative. Functional programming is not.

Before we go on with this discussion, let’s establish a very important fact. SQL is a completely declarative language. Functional (or “functional-ish”, to keep the Haskell-aficionados at peace) programming languages like Java 8 are not declarative. While expressing data transformation algorithms using functions is much more concise than expressing them using objects, or worse, using imperative instructions, you’re still explicitly expressing the algorithm.

When you write SQL, you don’t write any algorithm. You just describe the result you want to have. The SQL engine’s optimiser will figure out the algorithm for you – e.g. based on the fact that you may have an index on Z but not on W or on (Z, W).

While simple examples like these can easily be implemented using Java 8, you will quickly run into Java’s limitations, once you need to do more complex reporting.

Of course, as we’ve blogged before, the optimum is reached when you combine SQL and functional programming.

How can this be written in Java 8?

There are a variety of ways to do it. The essence is to understand all the participants in such a transformation. And no matter if you find this easy or hard, suitable for Java 8 or inadequate, thinking about the different, lesser-known parts of new Stream API is certainly worth the exercise.

The main participants here are:

  • Stream: If you’re using JDK 8 libraries, then the new java.util.stream.Stream type will be your first choice.
  • Collector: The JDK provides us with a rather low-level and thus very powerful new API for data aggregation (also known as “reduction”). This API is summarised by the new java.util.stream.Collector type, a new type from which we have heard only little so far in the blogosphere

Disclaimer

Some of the code displayed here might not work in your favourite IDE. Unfortunately, even if Java 7 reaches its end of life, all major IDEs (Eclipse, IntelliJ, NetBeans), and even the javac compiler still have quite a few bugs related to the combination of generic type inference and lambda expressions. Stay tuned until those bugs are fixed! And report any bug you discover. We’ll all thank you for it!

Let’s go!

Let’s review our SQL statement:

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM table 
GROUP BY z, w;

In terms of the Stream API, the table itself is the Stream. Let’s just assume that we have a “table type” A as such:

class A {
    final int w;
    final int x;
    final int y;
    final int z;

    A(int w, int x, int y, int z) {
        this.w = w;
        this.x = x;
        this.y = y;
        this.z = z;
    }

    @Override
    public String toString() {
        return "A{" +
                "w=" + w +
                ", x=" + x +
                ", y=" + y +
                ", z=" + z +
                '}';
    }
}

You can also add equals() and hashCode() if you must.

We can now easily compose the Stream using Stream.of(), and some sample data:

Stream<A> stream =
Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5));

Now, the next step is to GROUP BY z, w. The Stream API itself, unfortunately, doesn’t contain such a convenience method. We have to resort to more low-level operations by specifying the more general Stream.collect() operation, and passing a Collector to it that does the grouping. Luckily, a variety of different grouping Collectors are already made available from the Collectors helper class.

So we add that to our stream

Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
.collect(Collectors.groupingBy(...));

jool-logo-blackNow the interesting part starts. How do we specify that we want to group by both A.z and A.w? We need to provide this groupingBy method with a function that can extract something like a SQL tuple from the A type. We could write our own tuple, or we simply use that of jOOλ, a library that we have created and open-sourced to improve our jOOQ integration tests.

The Tuple2 type roughly looks like this:

public class Tuple2<T1, T2> {

    public final T1 v1;
    public final T2 v2;

    public T1 v1() {
        return v1;
    }

    public T2 v2() {
        return v2;
    }

    public Tuple2(T1 v1, T2 v2) {
        this.v1 = v1;
        this.v2 = v2;
    }
}

public interface Tuple {
    static <T1, T2> Tuple2<T1, T2> tuple(T1 v1, T2 v2) {
        return new Tuple2<>(v1, v2);
    }
}

It has many more useful features, but these ones will be sufficient for this article.

On a side-note

Why the JDK doesn’t ship with built-in tuples like C#’s or Scala’s escapes me.

Functional programming without tuples is like coffee without sugar: A bitter punch in your face.

Anyway… back on track

So we’re grouping by the (A.z, A.w) tuple, as we would in SQL

Map<Tuple2<Integer, Integer>, List<A>> map =
Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
.collect(Collectors.groupingBy(
    a -> tuple(a.z, a.w)
));

As you can see, this produces a verbose but very descriptive type, a map containing our grouping tuple as its key, and a list of collected table records as its value.

Running the following statement

map.entrySet().forEach(System.out::println);

will yield:

(1, 1)=[A{w=1, x=1, y=1, z=1}, A{w=1, x=2, y=3, z=1}]
(4, 9)=[A{w=9, x=8, y=6, z=4}, A{w=9, x=9, y=7, z=4}]
(5, 2)=[A{w=2, x=3, y=4, z=5}, A{w=2, x=4, y=4, z=5}, A{w=2, x=5, y=5, z=5}]

That’s already quite awesome! In fact, this behaves like the SQL:2011 standard COLLECT() aggregate function, that is also available in Oracle 10g+

Now, instead of actually collecting the A records, we prefer to aggregate the individual values of x and y. The JDK provides us with a couple of interesting new types, e.g. the java.util.IntSummaryStatistics, which is available for convenience again from the Collectors type via Collectors.summarizingInt().

On a side note

For my taste, this sledge-hammer data aggregation technique is a bit quirky. The JDK libraries have been left intentionally low level and verbose, perhaps to keep the library footprint small, or to prevent “horrible” consequences when in 5-10 years (after the release of JDK 9 and 10), it becomes obvious that some features may have been added prematurely.

At the same time, there is this all-or-nothing IntSummaryStatistics, that blindly aggregates these popular aggregation values for your collection:

  • COUNT(*)
  • SUM()
  • MIN()
  • MAX()

and obviously, once you have SUM() and COUNT(*), you also have AVG() = SUM() / COUNT(*). So that’s going to be the Java way. IntSummaryStatistics.

In case you were wondering, the SQL:2011 standard specifies these aggregate functions:

AVG, MAX, MIN, SUM, EVERY, ANY, SOME, COUNT, STDDEV_POP, STDDEV_SAMP, VAR_SAMP, VAR_POP, COLLECT, FUSION, INTERSECTION, COVAR_POP, COVAR_SAMP, CORR, REGR_SLOPE, REGR_INTERCEPT, REGR_COUNT, REGR_R2, REGR_AVGX, REGR_AVGY, REGR_SXX, REGR_SYY, REGR_SXY, PERCENTILE_CONT, PERCENTILE_DISC, ARRAY_AGG

And obviously there are many other, vendor-specific aggregate and window functions in SQL. We’ve blogged about them all:

True, MIN, MAX, SUM, COUNT, AVG are certainly the most popular ones. But it would’ve been nicer if they hadn’t been included in these default aggregation types, but made available in a much more composable way.

Anyway… back on track

If you want to stay low-level and use mostly JDK API, you can use the following technique to implement aggregation over two columns:

Map<
    Tuple2<Integer, Integer>, 
    Tuple2<IntSummaryStatistics, IntSummaryStatistics>
> map = Stream.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
.collect(Collectors.groupingBy(
    a -> tuple(a.z, a.w),
    Collector.of(

        // When collecting, we'll aggregate data
        // into two IntSummaryStatistics for x and y
        () -> tuple(new IntSummaryStatistics(), 
                    new IntSummaryStatistics()),

        // The accumulator will simply take
        // new t = (x, y) values
        (r, t) -> {
            r.v1.accept(t.x);
            r.v2.accept(t.y);
        },

        // The combiner will merge two partial
        // aggregations, in case this is executed
        // in parallel
        (r1, r2) -> {
            r1.v1.combine(r2.v1);
            r1.v2.combine(r2.v2);

            return r1;
        }
    )
));

map.entrySet().forEach(System.out::println);

The above would now print

(1, 1)=(IntSummaryStatistics{count=2, sum=3, min=1, average=1.500000, max=2}, 
        IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3})
(4, 9)=(IntSummaryStatistics{count=2, sum=17, min=8, average=8.500000, max=9}, 
        IntSummaryStatistics{count=2, sum=13, min=6, average=6.500000, max=7})
(5, 2)=(IntSummaryStatistics{count=3, sum=12, min=3, average=4.000000, max=5}, 
        IntSummaryStatistics{count=3, sum=13, min=4, average=4.333333, max=5})

But obviously, no one will want to write that much code. The same thing can be achieved with jOOλ with much less code

Map<
    Tuple2<Integer, Integer>, 
    Tuple2<IntSummaryStatistics, IntSummaryStatistics>
> map =

// Seq is like a Stream, but sequential only,
// and with more features
Seq.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))

// Seq.groupBy() is just short for 
// Stream.collect(Collectors.groupingBy(...))
.groupBy(
    a -> tuple(a.z, a.w),

    // ... because once you have tuples, 
    // why not add tuple-collectors?
    Tuple.collectors(
        Collectors.summarizingInt(a -> a.x),
        Collectors.summarizingInt(a -> a.y)
    )
);

What you see above is probably as close as it gets to the original, very simmple SQL statement:

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM table 
GROUP BY z, w;

The interesting part here is the fact that we have what we call “tuple-collectors”, a Collector that collects data into tuples of aggregated results for any degree of the tuple (up to 8). Here’s the code for Tuple.collectors:

// All of these generics... sheesh!
static <T, A1, A2, D1, D2> 
       Collector<T, Tuple2<A1, A2>, Tuple2<D1, D2>> 
collectors(
    Collector<T, A1, D1> collector1
  , Collector<T, A2, D2> collector2
) {
    return Collector.of(
        () -> tuple(
            collector1.supplier().get()
          , collector2.supplier().get()
        ),
        (a, t) -> {
            collector1.accumulator().accept(a.v1, t);
            collector2.accumulator().accept(a.v2, t);
        },
        (a1, a2) -> tuple(
            collector1.combiner().apply(a1.v1, a2.v1)
          , collector2.combiner().apply(a1.v2, a2.v2)
        ),
        a -> tuple(
            collector1.finisher().apply(a.v1)
          , collector2.finisher().apply(a.v2)
        )
    );
}

Where the Tuple2<D1, D2> is the aggregation result type that we derive from collector1 (which provides D1) and from collector2 (which provides D2).

That’s it. We’re done!

Conclusion

Java 8 is a first step towards functional programming in Java. Using Streams and lambda expressions, we can already achieve quite a bit. The JDK APIs, however, are extremely low level and the experience when using IDEs like Eclipse, IntelliJ, or NetBeans can still be a bit frustrating. While writing this article (and adding the Tuple.collectors() method), I have reported around 10 bugs to the different IDEs. Some javac compiler bugs are not yet fixed, prior to JDK 1.8.0_40 ea. In other words:

I just keep throwing generic type parameters at the darn thing until the compiler stops bitching at me

But we’re on a good path. I trust that more useful API will ship with JDK 9 and especially with JDK 10, when all of the above will hopefully profit from the new value types and generic type specialization.

jool-logo-blackAnd, of course, if you haven’t already, download and contribute to jOOλ here!

We have created jOOλ to add the missing pieces to the JDK libraries. If you want to go all in on functional programming, i.e. when your vocabulary includes hipster terms (couldn’t resist) like monads, monoids, functors, and all that, we suggest you skip the JDK’s Streams and jOOλ entirely, and go download functionaljava by Mark Perry or javaslang by Daniel Dietrich

MongoDB “Lightning Fast Aggregation” Challenged with Oracle

What does “Scale” even mean in the context of databases? When talking about scaling, people have jumped to the vendor-induced conclusion that:

  • SQL doesn’t scale
  • NoSQL scales

It is very obvious that NoSQL vendors make such claims. It has also been interesting that many NoSQL consumers made such claims, even if they probably confused SQL in general with MySQL in particular. They then go on about comparing MongoDB with MySQL scalability, which totally makes sense as MySQL is to SQL what MongoDB is to NoSQL.

Let’s get back down to earth…

… because in the last decades, there was no database to beat Oracle on the Transaction Processing Performance Council’s benchmarks. We trust that the CERN people have made an informed decision when they opted for Oracle Exadata and other Oracle products to manage their immensely huge data even before huge data was called Big Data.

So let’s make a quick comparison. Recently, Vlad Mihalcea has blogged about “lightning speed aggregation” (with MongoDB). He took a data set of 50 million records of the form:

created_on           | value
-------------------------------------------
2012-05-02T06:08:47Z | 0.9270193106494844
2012-09-06T22:40:25Z | 0.005334891844540834
2012-06-15T05:58:22Z | 0.05611344985663891
...                  | ...

These are random timestamps and random floats. He then aggregated this data with the following query:

var dataSet = db.randomData.aggregate([
    {
        $group: {
                "_id": {
                    "year" : {
                        $year : "$created_on"
                    },
                    "dayOfYear" : {
                        $dayOfYear : "$created_on"
                    }
                },
                "count": {
                    $sum: 1
                },
                "avg": {
                    $avg: "$value"
                },
                "min": {
                    $min: "$value"
                },
                "max": {
                    $max: "$value"
                }
            }
    },
    {
        $sort: {
            "_id.year" : 1,
            "_id.dayOfYear" : 1
        }
    }
]);

And got a “mind-numbing”, lightning speed aggregation result:

Aggregation took:129.052s

129s for a medium-sized table with 50M records is lightning speed for MongoDB? Alright, we thought. Let’s try this with Oracle. Vlad had the courtesy to provide us with the sample data, which we imported into the following trivial Oracle table:

CREATE TABLESPACE aggregation_test
  DATAFILE 'aggregation_test.dbf'
  SIZE 2000M ONLINE;

CREATE TABLE aggregation_test (
  created_on TIMESTAMP NOT NULL,
  value NUMBER(22, 20) NOT NULL
)
TABLESPACE aggregation_test;

Now, let’s load that data with sqlldr into my single-core licenced Oracle XE 11gR2 instance:

OPTIONS(skip=1)
LOAD DATA
  INFILE randomData.csv
  APPEND
INTO TABLE aggregation_test
  FIELDS TERMINATED BY ','
(
  created_on DATE
    "YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"",
  value TERMINATED BY WHITESPACE
    "to_number(ltrim(rtrim(replace(:value,'.',','))))"
)

And then:

C:\oraclexe\app\oracle\product\11.2.0\server\bin\sqlldr.exe 
  userid=TEST/TEST 
  control=randomData.txt 
  log=randomData.log 
  parallel=true 
  silent=feedback
  bindsize=512000 
  direct=true

Loading the records took a while on my computer with an old disk:

Elapsed time was:     00:03:07.70
CPU time was:         00:01:09.82

I might be able to tune this one way or another, as SQL*Loader isn’t the fastest tool for the job. But let’s challenge the 129s for the aggregation. So far, we hadn’t specified any index, but that isn’t necessary as we’re aggregating the whole table with all 50M records. Let’s do this with the following equivalent query to Vlad’s:

SELECT
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  COUNT(*),
  AVG(value),
  MIN(value),
  MAX(value)
FROM aggregation_test
GROUP BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD')
ORDER BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD')

This took 32s on my machine with an obvious full table scan. Not impressive. Let’s try Vlad’s other query, filtering on a single hour, which executed in 209ms in his benchmark (what we believe is really not fast at all):

var dataSet = db.randomData.aggregate([
    {
        $match: {
            "created_on" : {
                $gte: fromDate,
                $lt : toDate
            }
        }
    },
    {
        $group: {
                "_id": {
                    "year" : {
                        $year : "$created_on"
                    },
                    "dayOfYear" : {
                        $dayOfYear : "$created_on"
                    },
                    "hour" : {
                        $hour : "$created_on"
                    }
                },
                "count": {
                    $sum: 1
                },
                "avg": {
                    $avg: "$value"
                },
                "min": {
                    $min: "$value"
                },
                "max": {
                    $max: "$value"
                }
            }
    },
    {
        $sort: {
            "_id.year" : 1,
            "_id.dayOfYear" : 1,
            "_id.hour" : 1
        }
    }
]);

Vlad generated a random date, which he logged as

Aggregating 
from Mon Jul 16 2012 00:00:00 GMT+0300
  to Mon Jul 16 2012 01:00:00 GMT+0300

So let’s use exactly the same dates. But first, we should create an index on AGGREGATION_TEST(CREATED_ON):

CREATE TABLESPACE aggregation_test_index
  DATAFILE 'aggregation_test_index1.dbf'
  SIZE 2000M ONLINE;

CREATE INDEX idx_created_on
ON aggregation_test(created_on)
TABLESPACE aggregation_test_index;

OK. Now let’s run the Oracle equivalent of Vlad’s query (note, there’s a new column in the GROUP BY, SELECT, ORDER BY clauses):

SELECT
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on),
  COUNT(*),
  AVG(value),
  MIN(value),
  MAX(value)
FROM aggregation_test
WHERE created_on
  BETWEEN TIMESTAMP '2012-07-16 00:00:00.0'
  AND     TIMESTAMP '2012-07-16 01:00:00.0'
GROUP BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on)
ORDER BY
  EXTRACT(YEAR FROM created_on),
  TO_CHAR(created_on, 'DDD'),
  EXTRACT(HOUR FROM created_on);

This took 20s in a first run. To be fair, before running this statement, I cleared some caches. I believe that Vlad’s MongoDB was already “warmed up” for the second query:

alter system flush shared_pool;
alter system flush buffer_cache;

Because when I ran the same query again, it only took 0.02 seconds as Oracle’s buffer cache kicked in, preventing actual disk access. I might have tuned my Oracle instance in a way to keep the whole table in memory from the first query, in case of which Oracle would have beat MongoDB by an order of magnitude.

Another option is to tune indexing. Let’s remove our existing index and replace it with a “covering” index as such:

DROP INDEX idx_created_on;

CREATE INDEX idx_created_on_value
ON aggregation_test(created_on, value)
TABLESPACE aggregation_test_index;

Let’s again flush caches:

alter system flush shared_pool;
alter system flush buffer_cache;

And run the query…

  • First execution: 0.5s
  • Second execution: 0.005s

From the execution plan, I can now tell that the query didn’t need to access the table any longer. All relevant data was contained in the index:

plan-2

Now that starts getting impressive, right?

So, what conclusion can we draw from this?

Here are a couple of findings:

Don’t jump to conclusions

First off, we should be careful with jumping to any conclusion at all. We are comparing apples with oranges. We’d even be comparing apples with oranges when comparing Oracle with SQL Server. Both benchmarks were quickly written, hacked-up benchmarks that do not represent any productive situations. What we can say already now is that there is not a significant winner when aggregating data in a 50M records table, but Oracle seems to perform better very quickly on an SSD (MongoDB) vs. HDD (Oracle) disk benchmark!

To be fair, my computer has a better CPU than Vlad’s:

system

My Computer

 

Vlad’s Computer

 

Both SQL and NoSQL can scale

Much of the NoSQL scaling debate is FUD. People always want to believe vendors who tend to say “their product doesn’t do this and that”. Don’t buy it immediately. Run a smallish benchmark as the above and see for yourself, scaling up is not a problem for both SQL and NoSQL databases. Don’t be blinded by Oracle’s initial “slow” query execution. Oracle is a very sophisticated database that can tune itself according to actual need, thanks to its cost-based optimiser and its statistics.

And, nothing keeps you (and your DBA) from using all of your tool’s features. In this example, we have only scratched the surface. For instance, we might apply IOT (Index-organised Table) or table partitioning, if you’re using Oracle Enterprise Edition.

50M is not Big Data

CERN has Big Data. Google does. Facebook does. You don’t. 50M is not “Big Data”. It’s just your average database table.

200ms is not fast

Obviously, you might not care too much how long it takes to prepare your offline report on a dedicated batch reporting server. But if you have a real user on a real UI waiting for it… If they do, 200ms is not fast, because they might be running 20 of these queries. And there might be 10’000 of these users. In a real OLTP system, having such reporting queries around even means that Oracle 0.005s aren’t fast!

So, in other words…

… don’t jump the SQL ship just yet!