10 Nice Examples of Writing SQL in Kotlin With jOOQ

Kotlin is the next big thing. With Google announcing official support for Kotlin on Android, we’ll see a lot more traction for this lovely language.

We’ve already blogged about the Kotlin language recently: 10 Features I Wish Java Would Steal From the Kotlin Language.

Should you migrate your application to Kotlin? Perhaps – there are many questions. The most important question is (of course): What are you going to do with your awesome SQL code?

The answer is: Use jOOQ! Need convincing? Here are 10 nice examples of writing SQL in Kotlin with jOOQ:

(Note: All the code is available on GitHub)

1. Using Kotlin’s Map Access Literals With jOOQ Records

This is a free feature that you’re getting from Kotlin thanks to their smart move of supporting certain operators on custom types by convention. Check out how easy it is to access a value from a record using a column reference:

val a = AUTHOR
val b = BOOK

ctx.select(a.FIRST_NAME, a.LAST_NAME, b.TITLE)
   .from(a)
   .join(b).on(a.ID.eq(b.AUTHOR_ID))
   .orderBy(1, 2, 3)
   .forEach {
       println("${it[b.TITLE]} by "
             + "${it[a.FIRST_NAME]} ${it[a.LAST_NAME]}")
   }

The output from our sample code is this:

1984 by George Orwell
Animal Farm by George Orwell
Brida by Paulo Coelho
O Alquimista by Paulo Coelho

We’re already using a couple of useful language features here. First off:

val for local variable type inference

We’re renaming AUTHOR to a and BOOK to b in our code to shorten table names a little bit. Using val we don’t even need to explicitly refer to the type of those tables.

As it looks right now, we’re going to get that feature in the Java language as well, so stay tuned.

(More about val in item #3)

More info about this feature here.

String interpolation

As you can see, inside of the String literal, we’re using the ${dollar.notation} to access variables from the context.

More info about this feature here.

(More about string interpolation in item #5)

Implicit lambda parameter “it”

When writing single-parameter lambda expressions, which is typical for loops, mapping, and many other standard library features, we don’t need to actually name that parameter in Kotlin. The default name is “it”. Groovy developers will rejoice, as they can profit from the same feature.

More info about this feature here.

Finally: Operator overloading

Kotlin supports a nice and pragmatic approach to operator overloading. We’re making use of the fact that the following two things mean exactly the same in Kotlin:

val x1 = anything[key];
val x2 = anything.get(key);

anything[key] = value;
anything.set(key, value);

Yeah, why not? In jOOQ, we’ve already seen this coming in version 3.9, and we’ve thus added get() and set() methods on our ubiquitous Record type, so you can access record values by:

  • Field reference (with type safety)
  • Column name (no type safety)
  • Index (no type safety)

Note that operator overloading can be used as well with a couple of arithmetic operators:

ctx.select(TRANSACTION.AMOUNT + 10) // Translates to Field.plus(10)
   .from(TRANSACTION)
   .fetch();

More info about this feature here.

2. Loop Over Record Contents

Another really nice feature of the language is destructuring of “tuples” into several local variables. Quite a few languages support this now, and so does Kotlin, even without first-level language support for real tuples. But often, we don’t actually need tuples, we just need their syntax. This can be done with any Map, for instance. Consider this code:

for (r in ctx.fetch(b))
    for ((k, v) in r.intoMap())
        println("${r[b.ID]}: ${k.padEnd(20)} = $v")

The output from our sample code is this:

1: ID                   = 1
1: AUTHOR_ID            = 1
1: TITLE                = 1984
1: PUBLISHED_IN         = 1948
2: ID                   = 2
2: AUTHOR_ID            = 1
2: TITLE                = Animal Farm
2: PUBLISHED_IN         = 1945
3: ID                   = 3
3: AUTHOR_ID            = 2
3: TITLE                = O Alquimista
3: PUBLISHED_IN         = 1988
4: ID                   = 4
4: AUTHOR_ID            = 2
4: TITLE                = Brida
4: PUBLISHED_IN         = 1990

This is really nice! Every jOOQ record can be represented as a Map using Record.intoMap(), which can be destructured automatically in loops as can be seen above.

More info about this feature here.

3. Local Variable Type Inference

An API that makes use of generics as heavily as jOOQ can leave quite some burden on the user occasionally. Type safety is great for the DSL, e.g. in jOOQ, the Java / Kotlin compilers can type-check things like:

// Type checks: Both sides are of type string
AUTHOR.FIRST_NAME.in("Alice", "Bob");

// Type checks: 
// - Both sides are of type string
// - Both sides are of degree one
AUTHOR.FIRST_NAME.in(
  select(CUSTOMER.FIRST_NAME).from(CUSTOMER)
);

// Doesn't type check: Wrong degree
AUTHOR.FIRST_NAME.in(
  select(CUSTOMER.FIRST_NAME, CUSTOMER.LAST_NAME).from(CUSTOMER)
)

This works through fancy types like:

Select<Record2<String, String>> subselect =
  select(CUSTOMER.FIRST_NAME, CUSTOMER.LAST_NAME).from(CUSTOMER);

These types can get quite hairy as they go up to 22 (higher degrees are supported without type safety). So, in Kotlin, we can simply write:

val subselect =
  select(CUSTOMER.FIRST_NAME, CUSTOMER.LAST_NAME).from(CUSTOMER);

// Still doesn't type check:
AUTHOR.FIRST_NAME.in(subselect)

// This works:
row(AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME).in(subselect)

More info about this feature here.

4. Inline Functions for New jOOQ Features

Are you missing a feature in jOOQ, for instance the PostgreSQL ilike operator? Well, it is supported through the Field.likeIgnoreCase() method, but you might prefer the more native looking ilike operator. No problem with Kotlin. Just write an inline function (which works similar to an extension function but they’re not exactly the same thing):

inline fun <F: Field<String>> F.ilike(field: String): Condition {
    return condition("{0} ilike {1}", this, field)
}

These inline functions make use of jOOQ’s plain SQL API which is heavily used by Java developers as well, but in this case, the keyword inline on the above functions indicates that the function works like an inline function. We can now write a query like this:

println("${ctx.select(b.TITLE)
              .from(b)
              .where(b.TITLE.ilike("%animal%"))
              .fetchOne(b.TITLE)}")

The result being:

Animal Farm

Notice how we can simply write b.TITLE.ilike("..") even if the jOOQ API doesn’t have such a method!

More info about this feature here

5. String Interpolation and Multiline Strings

How many times have we wished for multiline strings in Java? So many other languages have it, even SQL and PL/SQL. When working with string-based embedded SQL, this is just a killer feature. Good for Kotlin! Not only does Kotlin support multiline strings, but also string interpolation.

Here’s how to use the jOOQ plain SQL API with Kotlin:

ctx.resultQuery("""
    SELECT *
    FROM (
      VALUES (1, 'a'),
             (2, 'b')
    ) AS t
    """)
    .fetch()
    .forEach {
        println("${it.intoMap()}")
    }

Just copy-paste any arbitrary SQL statement right from your SQL console (in this case: H2 syntax) and you’re done. The output of the above is:

{C1=1, C2=a}
{C1=2, C2=b}

Bonus feature if you’re using the upcoming jOOQ 3.10 parser: You can standardise on SQL features that are not available from your database:

val colX = field("x")
val colY = field("y")
ctx.parser()
   .parseResultQuery("""
    SELECT *
    FROM (
      VALUES (1, 'a'),
             (2, 'b')
      -- This feature (derived column lists) 
      -- isn't really available in H2 yet it works!
    ) AS t(${colX.name}, ${colY.name})
    """)
   .fetch()
   .forEach {
       println("${it[colX]}, ${it[colY]}")
   }

The above statement makes use of derived column lists (renaming tables and columns in one go). Unfortunately, H2 doesn’t support this feature but jOOQ can emulate it, transparently. Note that we’re again using string interpolation, this time in order to reuse column names.

More info about the feature here.

6. Null Safe Dereferencing

Null is the mother of all bikesheds. Of course, Kotlin is opinionated about null as well

You’re not sure if your query will produce a result? No problem with Kotlin’s various null-related operator:

println("${ctx.fetchOne(b, b.ID.eq(5))?.title ?: "not found"}")

The above will print:

not found

Why? Because fetchOne() won’t return a record from our database, it will return null. We cannot dereference ?.title from null, but the ?. operator won’t throw an exception, it will just return null again. Finally, the Elvis operator will provide a default value if the expression on the left of it is null. Perfect!

More info about the feature here.

7. Getters and setters feel like properties

Remember the last 20 years when we emulated properties in Java using the obnoxious JavaBeans convention? Tired of writing getters and setters? Well, at least, with Kotlin, it pays off. We can syntactically pretend that we have a property where there are only getters and setters. Check this out:

val author1 = ctx.selectFrom(a).where(a.ID.eq(1)).fetchOne();
println("${author1.firstName} ${author1.lastName}")

val author2 = ctx.newRecord(a);
author2.firstName = "Alice"
author2.lastName = "Doe"
author2.store()
println("${author2.firstName} ${author2.lastName}")

Isn’t that awesome? we can just dereference author.firstName, which simply translates to author.getFirstName(). Likewise, we can assign a value to author2.lastName = "Doe", which simply translates to author2.setLastName("Doe")

It seems so obvious, no?

More info about the feature here

8. With to Abbreviate Setter Calls

We can take this feature one step further and use the nice with() inline function (not a keyword!). In the above example, when storing a new author, we have to repeat the author reference all the time. This is no longer true with… with!

val author3 = ctx.newRecord(a);
with (author3) {
    firstName = "Bob"
    lastName = "Doe"
    store()
}

All the code that is inside of the block supplied to the with() function is going to operate on the with() function’s first argument. We already know this interesting feature from JavaScript, but in Kotlin, it’s typesafe!

And notice, the store() call is also affected!

More info about this feature here

9. With to Locally Import Tables

We can use with() also with queries! In jOOQ, columns are instance members inside of tables, and as such, must be dereferenced from a concrete table. That can be tedious (SQL doesn’t have this requirement), especially when selecting only from a single table. You can turn on static member generation in jOOQ’s code generator, but then table aliasing is much less easy.

Or you use Kotlin again! The with() function can help:

with (AUTHOR) {
    ctx.select(FIRST_NAME, LAST_NAME)
       .from(AUTHOR)
       .where(ID.lt(5))
       .orderBy(ID)
       .fetch {
           println("${it[FIRST_NAME]} ${it[LAST_NAME]}")
       }
}

Resulting in

George Orwell
Paulo Coelho

Isn’t that nice? Now, all the FIRST_NAME, LAST_NAME, ID column references are dereferenced from AUTHOR in a much more lenient way.

More info about this feature here

10. Destructuring jOOQ records into sets of local variables

This is so great, we’re going to add more native support for it in jOOQ 3.10 right away.

Kotlin’s type destructuring feature is implemented following convention. This means that any type that has methods with the name component[N] can be destructured. We currently don’t have those methods, but they can be added really easily using inline operators:

operator fun <T, R : Record3<T, *, *>> R.component1() : T {
    return this.value1();
}

operator fun <T, R : Record3<*, T, *>> R.component2() : T {
    return this.value2();
}

operator fun <T, R : Record3<*, *, T>> R.component3() : T {
    return this.value3();
}

These inline functions are added to the R type, which is any subtype of Record3, and then they can be used as such:

for ((first, last, title) in ctx
   .select(a.FIRST_NAME, a.LAST_NAME, b.TITLE)
   .from(a)
   .join(b).on(a.ID.eq(b.AUTHOR_ID))
   .orderBy(1, 2, 3))
       println("$title by $first $last")

The result is again:

1984 by George Orwell
Animal Farm by George Orwell
Brida by Paulo Coelho
O Alquimista by Paulo Coelho

As can be seen, our query returns the type Record3<String, String, String>. Our inline functions will add component1(), component2(), and component3() methods to the type, which are used by the Kotlin language to destructure the record into three local loop variables: (first, last, title).

This is incredibly useful!

(note also that we don’t have to execute the query explicitly in the loop because any jOOQ ResultQuery is Iterable)

More info about this feature here

Conclusion

Kotlin is getting increasingly popular. It contains a lot of smart yet pragmatic features that make daily work with programming on the JVM a lot easier. Kotlin’s high focus on Java interoperability makes it really really powerful, and working with Kotlin and jOOQ is a really productive way of writing SQL on the JVM.

Do give it a shot. The sources are here:

https://github.com/jOOQ/jOOQ/tree/master/jOOQ-examples/jOOQ-kotlin-example

10 Tips on How to be a Great Programmer

I was recently asked in an interview about my opinion on how to be a great programmer. That’s an interesting question, and I think we can all be great programmers, regardless of our talent, if we follow a couple of rules that – I believe – should be common sense. In fact, these rules don’t all apply to programmers only, but to any professional.

Not everything in this list is meant to be taken entirely seriously, some things are just my opinions and yours may vary, and not all descriptions of programmer personalities match real-world situations I may have encountered, so if in doubt, please do not take offense. I didn’t mean you 🙂

Here they are:

1. Learn how to ask questions

There are essentially these types of programmers asking questions:

  • The perfectionist: Especially when asking a question about some open source tool, they may have already debugged through the code and found the real cause of their problem. But even if not, the perfectionist will write an introduction to the problem, steps to reproduce, potentially a suggested workaround, and as I said, potentially a suggested fix. In fact, the perfectionist doesn’t have questions. Only answers.
  • The chatterbox: This person will not really ask questions. Rather, they arrange their thoughts publicly and occasionally put a rhetorical question mark here and there. What may appear to be a question is really just a stream of thoughts and if you wait with the answer, they may either find the answer themselves, or ask the real question in email number 37. Or not. Or maybe, if we try it this way. You know what? It turned out that the requirement is completely wrong, I solved it with some other technique. Oh, actually, I changed libraries entirely. Hehe. No more questions.
  • The slacker: Here’s the code. What’s wrong? Halp plz.
  • The manager: For this type, time is money. The questions must be short and the answers ASAP. There’s something ironic about this approach, because by keeping the questions short (read: incomplete, not concise), most often, a lot of important details are missing, which only leads to requests for details. Then, the manager (naturally being disappointed because the reply is not an answer but a new question) will send yet again a short message and demand an answer with even more urgency. This can go back and forth for quite a while. In the end, it may take 1-2 weeks until what could’ve been the answer will actually be answered.
  • The complainer: This person doesn’t ask questions. They complain. Until the question goes away. Perhaps. If it doesn’t all the better. More reasons to complain.

By now, it should be clear that a well prepared question (concise, simple, short, yet with enough details) will yield much better answers. If you know what exactly you want to learn with your question, chances are, you’ll get exactly what you wanted.

2. Learn how to avoid asking questions

In fact, mostly, it’s better to try to avoid asking questions. Perhaps you can figure it out yourself? Not always, of course. Many things, you simply cannot know and by asking a domain expert, you’ll find the quickest and most efficient path to success. But often, trying something yourself has these benefits:

  • You learn it the “hard way” which is the way that sticks to our memory much better – we’ll remember what we’ve learned
  • It’s more rewarding to figure out stuff yourself
  • You don’t create “noise”. Remember the chatterbox? Unless the person you’re asking the question(s) is routinely answering questions (and thus postponing their answer), they may not see through your stream of thoughts and try to answer each individual incomplete “question”. That doesn’t help anyone.
  • By postponing a question (for a while, at least), you can gather more relevant information that you can provide to someone who might be able to answer your question. Think of the “perfectionist”. Spend more time looking for details first, then answer the question.
  • You’ll get better at asking questions by training to ask good questions. And that takes a bit more time.

3. Don’t leave broken windows

There was a very interesting article on reddit, recently about not leaving broken windows. The essence of the article is to never compromise on quality. To never become a slacker. To never leave … broken windows. Here’s a quote from the article

When we take certain shortcuts to deliver something in the shortest period of time, and our code reflects how careless we’ve been, developers coming after us (from the same team, a future team, and even ourselves!), will reach an important conclusion: it’s not important to pay enough attention to the code we produce. Little by little our application will start deteriorating in what it becomes an unstoppable process.

This isn’t about being a perfectionist, in fact. Sometimes, fixing broken windows can be postponed (and must be, of course). Oftentimes, by allowing windows to break and stay broken, we’re introducing a situation where no one is happy. Not us the programmers, not our customers, not our users, not our project managers. This is an attitude thing and thus at the core of being professional. How did Benjamin Franklin say?

The bitterness of poor quality remains long after the sweetness of low price is forgotten

That’s true for everything. In this case, “low price” is the quick win we might get by implementing something in a sloppy way.

4. Software ought to be deterministic. Aim for it!

In an ideal world, everything in software is “deterministic”. We’d all be functional programmers, writing side-effect free, pure functions. Like String.contains(). No matter how many times you execute the following:

assertTrue("abcde".contains("bc"));

… the result is always the same, expected result. The universe and all its statefulness will have absolutely no impact on this computation. It’s deterministic.

We can do that, too in our own programs, not just in standard libraries. We can try to write side-effect free, deterministic modules for as often as possible. This isn’t really a matter of what technology we choose. Deterministic programming can be done in any language – even if functional languages have more tools to prevent accidental side-effects through more sophisticated type systems. But the example I’ve shown is a Java example. Object orientation permits determinism. Heck, procedural languages like PL/SQL allow for determinism. It’s a requirement for a function to be deterministic, if it is to be used in an index:

CREATE INDEX upper_first_name ON customer (upper (first_name));
-- Deterministic function here: -----------^^^^^^^^^^^^^^^^^^

So again, this is a matter of discipline. You could see a side-effectful procedure / method / “function” as a “broken window”. Perhaps it was easier to maintain a side-effect, hoping that eventually, it can be removed. But that’s usually a lie. The price will be paid later, when non-determinism suddenly strikes. And it will.

5. Expect the unexpected

That previous link is key. Murphy’s Law is something we programmers should observe all the time. Everything can break. And it will. Come on, as software engineers, we know it will break. Because our world is not deterministic, neither are the business requirements that we’re implementing. We can implement tip #4 (determinism) only until it is no longer possible. From then on, we’ll inevitably enter the world of non-determinism (the “real world”), where stuff will go wrong. So, aim for it. Expect the unexpected. Train your inner pessimist to foresee all kinds of things.

How to write that pessimistic code in a concise manner is another story, of course. And how to distinguish things that will fail (and need to be dealt with) from things that might fail (and don’t need to be dealt with), that takes some practice 🙂

6. Never cargo cult. Never follow dogma. Always embrace: “It depends”

Everything you’ve been taught is potentially wrong. Including (or probably: especially) when someone really popular says it. Here’s a nice quote:

Our profession is full of hypocrisy. We like to think of ourselves as mathematicians, where only the purest of ideas persist, and they’re necessarily correct.

That’s wrong. Our profession is built on top of mathematics, but unless you go into the funky world of category theory or relational algebra (and even then, I doubt everything is “correct”), you’re in the pragmatic world of real-world business requirements. And that’s, quite frankly, very far from perfect. Let’s look at some of the most popular programming languages out there:

  • C
  • Java
  • SQL

Do you really think that these languages resemble mathematics in the least bit? If so, let’s discuss segmentation faults, Java generics, or SQL three-valued logic. These languages are platforms built by pragmatists. There is some really cool theoretical background in all of them, but ultimately, they had to do the job.

The same is true for everything built on top of languages. Libraries. Frameworks. Heck, even architectures. Design patterns. Nothing is right or wrong. Everything is a tool that was designed for some context. Think of the tool in its context. Never think of the tool as a standalone raison d’être. We’re not doing art for art’s sake.

So, say no to unquestioned:

  • XML
  • JSON
  • Functional programming
  • Object oriented programming
  • Design patterns
  • Microservices
  • Three tier architectures
  • DDD
  • TDD
  • In fact: *DD

All of these are nice tools given a certain context. They don’t always hold true. By staying curious and thinking out of the box, you’ll be a better programmer and know when to use which one of these tools.

7. Do it

True. There are luminaries out there who outperform everyone.

But most programmers are simply “good”. Or they have the potential to be “good”. How to be a good programmer? By doing it. Great software wasn’t written in a day, and popular people aren’t the only heroes of our time. I’ve met many great programmers no one knows about, because they lead private lives, solve private problems of small companies.

But great programmers all have one thing in common: They just do it. They practice. They work every day to get better and better.

Want to get better at SQL? Do it! Try to write a SQL statement with some new features in them, every day. Use window functions. Grouping sets. Recursion. Partitioned outer join. The MODEL and/or MATCH_RECOGNIZE clauses. It doesn’t have to ship to production every time, but the practice will be worth it.

8. Focus on one subject (in the long run)

There might be only very few “good” full stack developers out there. Most full stack developers will rather be mediocre at everything. Sure, a small team might need only a few of them and they can cover a lot of business logic to quickly bootstrap a piece of new software. But everything will be quite sloppy and “kinda works”. Perhaps that’s good enough for the minimum viable product phase, but in the long run, there will be more complex problems that a full stack developer will not have the time to properly analyse (or foresee!).

Focus on one subject mostly, and get really good at that. A specialist will always be in demand as long as that specialist’s niche exists, and many niches will outlive us all (hello COBOL or SQL). So, do your career a favour and do something really weel, rather than many things “just ok”.

9. Play

While you should mostly focus on one subject, you shouldn’t forget the other subjects completely. You may never be really really really good at all of SQL, scaling up, scaling out, low level performance, CSS (who’s good at that anyway!?), object orientation, requirements engineering, architecture, etc. all at once (see tip #8). It’s just not possible.

But you should at least understand the essence of each of these. You should understand when SQL is the right choice (and when it isn’t). When low level performance tuning is important (and when it isn’t). How CSS works in principle. The advantage of object orientation, FP, etc.

You should spend some time playing with these (and many more) concepts, technologies to better understand why they’re important. To know when to apply them, and then perhaps to find an expert to actually execute the work.

By playing around with new paradigms and technologies, you’ll open up your mind to entirely different ways of thinking, and chances are, you’ll be able to use that in your everyday work in one way or another.

10. Keep it simple, stupid

Albert Einstein said:

Everything should be made as simple as possible, but no simpler

No one is able to handle huge complexity. Not in software, not in any other aspect of life. Complexity is the killer of good software and thus simplicity is the enabler. Easy to understand. Hard to implement. Simplicity is something that takes a lot of time and practice to recognise and produce. There are many rules you can follow, of course.

One of the simplest rules is to have methods/functions with only a few parameters. Let’s look at that. Surely, the aforementioned String.contains() method qualifies. We can write "abcde".contains("bcd") and without reading any documentation, everyone will immediately understand what this does and why. The method does one thing and one thing only. There are no complicated context/settings/other arguments that can be passed to the method. There are no “special cases”, nor any caveats.

Again, producing simplicity in a library is much easier than doing that in business logic. Can we achieve it? Perhaps. By practicing. By refactoring. But like great software, simplicity is not built in a day.

(Protip: Conway’s Law applies. It is utterly impossible to write good, simple software in an environment where the business is super complex. Either, you like complexity and ugly legacy, or you better get out of that business).

The Difficulty of Tuning Queries Over a Database Link – Or How I Learned to Stop Worrying and Love the DUAL@LINK Table

A large-ish customer in banking (largest tables on that particular system: ~1 billion rows) once decided to separate the OLTP database from the “log database” in order to better use resources and prevent contention on some tables, as the append-only log database is used heavily for analytic querying of all sorts. That seems to make perfect sense. Except that sometimes, joins need to be done between “main database” and “log database” tables. This is when things get really hard to tune in Oracle – and probably in other databases too.

In this article, however, I’d like to focus on a much simpler example. One that seems to cause no trouble to the optimiser because all joined tables are from the “log database” only. Let’s use the following setup:

-- This is the database link
CREATE PUBLIC DATABASE LINK LOOPBACK 
CONNECT TO TEST IDENTIFIED BY TEST 
USING 'ORCLCDB';

-- Just making sure we get all statistics in execution plans
ALTER SESSION SET statistics_level = ALL;

And then, create this schema:

CREATE TABLE t (
  a INT NOT NULL,
  b INT NOT NULL,
  CONSTRAINT pk_t PRIMARY KEY (a)
);
CREATE TABLE u (
  a INT NOT NULL,
  b INT NOT NULL,
  CONSTRAINT pk_u PRIMARY KEY (a)
);

INSERT INTO t
SELECT
  level,
  level
FROM dual
CONNECT BY level <= 500000;

INSERT INTO u
SELECT
  level,
  level
FROM dual
CONNECT BY level <= 500000;

CREATE INDEX i_t ON t(b);

ALTER TABLE u ADD CONSTRAINT fk_u FOREIGN KEY (a) REFERENCES t;

EXEC dbms_stats.gather_table_stats('TEST', 'T');
EXEC dbms_stats.gather_table_stats('TEST', 'U');

It’s a really boring emulation of the real schema, and it doesn’t have nearly as many columns / rows. But the essence is:

  • There are (at least) two tables
  • Both have quite a few rows (that’s important here. I’ll show why, later)
  • We’ll use an index for searching rows
  • We’ll join by a one-to-many relationship

There may be other setups to reproduce the same issue, of course.

Now, let’s consider the following query (not using the database link yet).

SELECT CASE WHEN EXISTS (
  SELECT *
  FROM t 
  JOIN u USING (a)
  WHERE t.b BETWEEN 0 AND 1000
) THEN 1 ELSE 0 END
FROM dual

Unfortunately, Oracle doesn’t support boolean types and always requires a FROM clause. Otherwise, we could be writing this more concise version, as in PostgreSQL:

SELECT EXISTS (
  SELECT *
  FROM t 
  JOIN u USING (a)
  WHERE t.b BETWEEN 0 AND 1000
)

We’re checking for the existence of rows in both tables, given a predicate that runs on the previously created index.

As shown in a previous article, it’s much better to use EXISTS rather than COUNT(*), in pretty much all databases. The algorithm is optimal, because the usage of the EXISTS predicate hints to the optimiser that a SEMI JOIN can be used instead of an INNER JOIN

--------------------------------------------------------------------------
| Operation                            | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
| SELECT STATEMENT                     |      |      1 |        |      1 |
|  NESTED LOOPS SEMI                   |      |      1 |      4 |      1 |
|   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |   1000 |      1 |
|    INDEX RANGE SCAN                  | I_T  |      1 |   1000 |      1 |
|   INDEX UNIQUE SCAN                  | PK_U |      1 |    333K|      1 |
|  FAST DUAL                           |      |      1 |      1 |      1 |
--------------------------------------------------------------------------

Some observations:

So, this is optimal (until I learn a new trick, of course).

Let’s bring in database links

Assuming that these two tables are on a remote database, we might naively proceed with writing this query:

SELECT CASE WHEN EXISTS (
  SELECT *
  FROM t@loopback 
  JOIN u@loopback USING (a)
  WHERE t.b BETWEEN 0 AND 1000
) THEN 1 ELSE 0 END
FROM dual;

So now, we’re selecting from T@LOOPBACK and U@LOOPBACK, but the rest is exactly the same. For the sake of simplicity, I’m running this reproduction on the same instance, thus “LOOPBACK”. The logical impact is the same, though.

------------------------------------------------------
| Operation        | Name | Starts | E-Rows | A-Rows |
------------------------------------------------------
| SELECT STATEMENT |      |      1 |        |      1 |
|  REMOTE          |      |      1 |        |      1 |
|  FAST DUAL       |      |      1 |      1 |      1 |
------------------------------------------------------

Interesting. Or rather: Not too interesting. Sure, our own database knows the correct estimate: 1 row that comes out of the EXISTS() predicate. But the interesting thing happens at the remote database. Let’s look at the plan there. The query being executed on the remote database is this:

SQL_ID  80fczs4r1c9yd, child number 0
-------------------------------------

SELECT 0 FROM "T" "A2","U" "A1" WHERE "A2"."B">=0 AND "A2"."B"=0

So, the EXISTS() predicate is not propagated to the remote database. Thus, the plan:

Plan hash value: 165433672
 
--------------------------------------------------------------------------
| Operation                            | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
| SELECT STATEMENT                     |      |      1 |        |      1 |
|  HASH JOIN                           |      |      1 |   1000 |      1 |
|   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |   1000 |   1000 |
|    INDEX RANGE SCAN                  | I_T  |      1 |   1000 |   1000 |
|   INDEX FAST FULL SCAN               | PK_U |      1 |    500K|      1 |
--------------------------------------------------------------------------

Oops. Observations:

  • We’re now running a hash join (as expected, given the query that the remote database knows of)
  • We’re materialising the expected 1000 rows from the predicate on T.B
  • But we’re still not fetching all the expected 500,000 rows from the U table because the database that calls this query will abort as soon as it finds a single row

Huh. Bummer. So while we’re not running into a major catastrophe (of materialising all the rows from U), this is still far from optimal. The remote database has no knowledge at all of the fact that we’re going to be selecting 0 or 1 rows only, and that it thus should always run a SEMI JOIN.

You can try adding a /*+FIRST_ROWS(1)*/ hint, but that doesn’t work. It won’t make it to the remote database.

Arcane DUAL@LINK to the rescue

This is when I had an idea. The problem might just be the fact that Oracle always needs a FROM clause, even if it doesn’t make any sense here. So what if we use DUAL@LOOPBACK instead of DUAL. Because that DUAL table, technically, is a table on our own database, so even if it looks as though the entire query can be run on the remote database, that seems not to be true here! So let’s try this:

SELECT CASE WHEN EXISTS (
  SELECT *
  FROM t@loopback 
  JOIN u@loopback USING (a)
  WHERE t.b BETWEEN 0 AND 1000
) THEN 1 ELSE 0 END
FROM dual@loopback; -- Subtle difference here!

As I hoped, this subtle change leads to the EXISTS() predicate being sent to the remote database. The query executed on the remote database is now:

SQL_ID  9bz87xw0zc23c, child number 0
-------------------------------------
SELECT CASE  WHEN  EXISTS (SELECT 0 FROM "T" "A3","U" "A2" WHERE 
"A3"."B">=0 AND "A3"."B"<=1000 AND "A3"."A"="A2"."A") THEN 1 ELSE 0 END 
 FROM "DUAL" "A1"

And the plan, now again including the desired SEMI JOIN:

Plan hash value: 1561559448
 
--------------------------------------------------------------------------
| Operation                            | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
| SELECT STATEMENT                     |      |      1 |        |      1 |
|  NESTED LOOPS SEMI                   |      |      1 |      4 |      1 |
|   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |   1000 |      1 |
|    INDEX RANGE SCAN                  | I_T  |      1 |   1000 |      1 |
|   INDEX UNIQUE SCAN                  | PK_U |      1 |    333K|      1 |
|  FAST DUAL                           |      |      1 |      1 |      1 |
--------------------------------------------------------------------------

Excellent!

Benchmark time

Plans and estimates are one thing. What ultimately counts to business is wall clock time. So, let’s try this again using a benchmark:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 100;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT CASE WHEN EXISTS (
          SELECT *
          FROM t 
          JOIN u USING (a)
          WHERE t.b BETWEEN 0 AND 1000
        ) THEN 1 ELSE 0 END
        FROM dual
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 1 : ' 
      || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT CASE WHEN EXISTS (
          SELECT *
          FROM t@loopback 
          JOIN u@loopback USING (a)
          WHERE t.b BETWEEN 0 AND 1000
        ) THEN 1 ELSE 0 END
        FROM dual
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : ' 
      || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT CASE WHEN EXISTS (
          SELECT *
          FROM t@loopback 
          JOIN u@loopback USING (a)
          WHERE t.b BETWEEN 0 AND 1000
        ) THEN 1 ELSE 0 END
        FROM dual@loopback
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 3 : ' 
      || (SYSTIMESTAMP - v_ts));
    dbms_output.put_line('');
  END LOOP;
END;
/

And here are the resulting times:

Run 1, Statement 1 : +000000000 00:00:00.008110000
Run 1, Statement 2 : +000000000 00:00:00.213404000
Run 1, Statement 3 : +000000000 00:00:00.043044000

Run 2, Statement 1 : +000000000 00:00:00.003466000
Run 2, Statement 2 : +000000000 00:00:00.198487000
Run 2, Statement 3 : +000000000 00:00:00.042717000

Run 3, Statement 1 : +000000000 00:00:00.003077000
Run 3, Statement 2 : +000000000 00:00:00.191802000
Run 3, Statement 3 : +000000000 00:00:00.048740000

Run 4, Statement 1 : +000000000 00:00:00.005008000
Run 4, Statement 2 : +000000000 00:00:00.192828000
Run 4, Statement 3 : +000000000 00:00:00.043461000

Run 5, Statement 1 : +000000000 00:00:00.002970000
Run 5, Statement 2 : +000000000 00:00:00.190786000
Run 5, Statement 3 : +000000000 00:00:00.043910000

Clearly, not using the database link is always the fastest, roughly by a factor of 10 compared to the DUAL@LOOPBACK solution. But due to the system design, we don’t have this choice. Nonetheless, you can still see that DUAL@LOOPBACK consistently outperforms DUAL by another factor of around 5 as it still prevents the HASH JOIN!

Caveat: Small data != Big data

There, I said it. “Big Data”. Before, we had a predicate that ran on 1,000 rows in a 500,000 row strong table. Our customer had millions of rows. But what happens if you query small data sets? Let’s reduce the predicate to this:

WHERE t.b BETWEEN 0 AND 10

The benchmark result is now completely different:

Run 1, Statement 1 : +000000000 00:00:00.007093000
Run 1, Statement 2 : +000000000 00:00:00.047539000
Run 1, Statement 3 : +000000000 00:00:00.071546000

Run 2, Statement 1 : +000000000 00:00:00.003023000
Run 2, Statement 2 : +000000000 00:00:00.041259000
Run 2, Statement 3 : +000000000 00:00:00.052132000

Run 3, Statement 1 : +000000000 00:00:00.002767000
Run 3, Statement 2 : +000000000 00:00:00.034190000
Run 3, Statement 3 : +000000000 00:00:00.054023000

Run 4, Statement 1 : +000000000 00:00:00.003468000
Run 4, Statement 2 : +000000000 00:00:00.026141000
Run 4, Statement 3 : +000000000 00:00:00.047415000

Run 5, Statement 1 : +000000000 00:00:00.002818000
Run 5, Statement 2 : +000000000 00:00:00.026100000
Run 5, Statement 3 : +000000000 00:00:00.046875000

And as you can see, the DUAL@LOOPBACK solution actually worsens performance for these queries. The reason for this is that we’re now running, again, a NESTED LOOP JOIN (but not SEMI JOIN) rather than a HASH JOIN on the remote database:

Query on remote database:

SQL_ID  7349t2363uc9m, child number 0
-------------------------------------
SELECT 0 FROM "T" "A2","U" "A1" WHERE "A2"."B">=0 AND "A2"."B"=0

Plan on remote database:

Plan hash value: 2558931407
 
--------------------------------------------------------------------------
| Operation                            | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
| SELECT STATEMENT                     |      |      1 |        |      1 |
|  NESTED LOOPS                        |      |      1 |     10 |      1 |
|   TABLE ACCESS BY INDEX ROWID BATCHED| T    |      1 |     10 |      1 |
|    INDEX RANGE SCAN                  | I_T  |      1 |     10 |      1 |
|   INDEX UNIQUE SCAN                  | PK_U |      1 |      1 |      1 |
--------------------------------------------------------------------------

I haven’t analysed what the reason for this difference is, as the difference is not significant enough, compared to the improvement for large data sets.

Conclusion

Tuning queries over database links is hard. Much much harder than tuning “ordinary” queries. Ideally, you’ll simply avoid database links and run all queries on a single instance. But sometimes that’s not possible.

In that case, the best solution is to move the logic to the remote query completely and collect only the result. Ideally, this is done using a stored procedure on the remote database and calculating this 1/0 result completely remotely. I think, hipsters these days call this a Microservice, or better, a Lambda:

CREATE FUNCTION t_u RETURN NUMBER IS
  v_result NUMBER;
BEGIN
  SELECT CASE WHEN EXISTS (
    SELECT *
    FROM t 
    JOIN u USING (a)
    WHERE t.b BETWEEN 0 AND 1000
  ) THEN 1 ELSE 0 END
  INTO v_result
  FROM dual;
  
  RETURN v_result;
END t_u;
/

Comparing the benchmark call with the other options:

T.B BETWEEN 0 AND 10

Statement 1 : +000000000 00:00:00.003022000 -- Local query
Statement 2 : +000000000 00:00:00.027416000 -- DUAL
Statement 3 : +000000000 00:00:00.043823000 -- Remote DUAL
Statement 4 : +000000000 00:00:00.022181000 -- Remote stored procedure

T.B BETWEEN 0 AND 1000

Statement 1 : +000000000 00:00:00.002877000 -- Local query
Statement 2 : +000000000 00:00:00.188588000 -- Local DUAL
Statement 3 : +000000000 00:00:00.050163000 -- Remote DUAL
Statement 4 : +000000000 00:00:00.018736000 -- Remote stored procedure

But that, too, might not be possible as you may not have the required rights to create stored procedures on that database. You could call DBMS_SQL on the remote database and run a PL/SQL block dynamically on the remote database (didn’t try that in my benchmark).

Or, you simply use an occasional DUAL@LINK, which might already do the trick with a minimal change to the original query.

Don’t Use the String Concatenation “Trick” in SQL Predicates

In SQL, quite often, we want to compare several values with each other. For instance, when we’re looking for a specific user by their first and last names, we’ll write a query like this one:

SELECT *
FROM customer
WHERE first_name = 'SUSAN'
AND last_name = 'WILSON';

We’re getting:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          8   SUSAN        WILSON

Surely, everyone agrees that this is correct and perfectly fine as we probably have an index on these two columns (or on at least one of them) to speed up such queries:

CREATE INDEX idx_customer_name ON customer (last_name, first_name);

The execution plan is thus optimal, e.g. with Oracle:

-------------------------------------------------------------------------
| Id  | Operation                           | Name              | Rows  |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                   |       |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER          |     1 |
|*  2 |   INDEX RANGE SCAN                  | IDX_CUSTOMER_NAME |     1 |
-------------------------------------------------------------------------

But sometimes, we cannot use AND to connect two predicates. In particular, that’s not possible with an IN predicate, so people sometimes resort to using string concatenation, because that seems to work and make sense.

For instance, let’s find all customers whose first and last names matches those of an actor (as always, using the Sakila database)

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And yes indeed, what we’re getting here is the correct answer:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS

But that answer is only accidentally correct!

Because we weren’t looking for customers called

first_name = 'JENNIFER' AND last_name = 'DAVIS'

We were looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Want proof? Let’s add a new customer:

INSERT INTO customer (customer_id, first_name, last_name )
VALUES               (600        , 'JENNI'   , 'FERDAVIS');

Yeah right? No one is called FERDAVIS. Or are they? As good programmers, we closely observe Murphy’s Law (i.e. always look both left and right when crossing a street).

In any case, let’s run our query again:

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And observe the result!

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS
        600   JENNI        FERDAVIS

Of course, because our predicate was really looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Which matches in both cases:

-- What we expected
first_name || last_name = 'JENNIFER' || 'DAVIS'

-- What we got
first_name || last_name = 'JENNI' || 'FERDAVIS'

Notice that I only added this customer to the customer table, not to the actor table. There’s no actor by the name FERDAVIS, so the result is clearly wrong.

AHA! Let’s use an “impossible” separator

So, we might proceed to fixing this as such:

SELECT *
FROM customer
WHERE first_name || '###' || last_name IN (
  SELECT first_name || '###' || last_name
  FROM actor
)

And now, the result is again correct. We get only JENNIFER DAVIS because we were looking for:

first_name || '###' || last_name = 'JENNIFER###DAVIS'

This works quite well for a while, as the separator is quite “impossible” (i.e. improbable) to be encountered in actual data. But we shouldn’t trust our judgement, because… Murphy’s Law. So you might think: better use a more rare separator, e.g. (if your database supports proper character sets)

SELECT *
FROM customer
WHERE first_name || '🙈🙉🙊' || last_name IN (
  SELECT first_name || '🙈🙉🙊' || last_name
  FROM actor
)

The use of emojis should indicate what my opinion of this approach is.

Too bad for performance, though

Remember that index we’ve created? Fact is, we also have such an index on the ACTOR table:

CREATE INDEX idx_actor_name ON actor (last_name, first_name);

And now, let’s assume our query is a bit different. We’ll be looking only for customers whose address_id is 10:

SELECT *
FROM customer
WHERE address_id = 10
AND first_name || '🙈🙉🙊' || last_name IN (
  SELECT first_name || '🙈🙉🙊' || last_name
  FROM actor
)

Now, our querymoji is using the index indeed, but for an INDEX FULL SCAN, so it’s only slightly faster than scanning the entire actor table:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|*  1 |  HASH JOIN SEMI                      |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|   4 |   INDEX FULL SCAN                    | IDX_ACTOR_NAME             |     2 |
-----------------------------------------------------------------------------------

And what’s worse, even if all the cardinality estimates correctly indicate only 1-2 rows, we’ll perform a HASH JOIN and load the full index for it! We should be running a NESTED LOOP instead.

Is there a better way? Yes! Use row constructors to compare several values at once:

SELECT *
FROM customer
WHERE address_id = 10
AND (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM actor
);

Or, if your database doesn’t support this syntax (luckily, Oracle and PostgreSQL do, for instance), then you can resort to an equivalent EXISTS predicate

SELECT *
FROM customer c
WHERE address_id = 10
AND EXISTS (
  SELECT 1
  FROM actor a
  WHERE c.first_name = a.first_name
  AND c.last_name = a.last_name
);

Both of these queries are exactly equivalent and result in a nested loop semi join, rather than the previous hash join, which is perfectly reasonable for these small tables. We can now use the IDX_ACTOR_NAME for a quick INDEX RANGE SCAN operation:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|   1 |  NESTED LOOPS SEMI                   |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|*  4 |   INDEX RANGE SCAN                   | IDX_ACTOR_NAME             |     1 |
-----------------------------------------------------------------------------------

But let’s not trust the estimated plans. Let’s benchmark (more info about benchmarking SQL here)

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2500;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND first_name || '###' || last_name IN (
          SELECT first_name || '###' || last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 1 : ' 
      || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND (first_name, last_name) IN (
          SELECT first_name, last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : '
      || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

As can be seen here, the benchmark shows that the query using the row constructor is drastically faster as it can properly use the index as it should:

Run 1, Statement 1 : +000000000 00:00:00.374471000
Run 1, Statement 2 : +000000000 00:00:00.062830000
Run 2, Statement 1 : +000000000 00:00:00.364168000
Run 2, Statement 2 : +000000000 00:00:00.066252000
Run 3, Statement 1 : +000000000 00:00:00.359559000
Run 3, Statement 2 : +000000000 00:00:00.063898000
Run 4, Statement 1 : +000000000 00:00:00.344775000
Run 4, Statement 2 : +000000000 00:00:00.086060000
Run 5, Statement 1 : +000000000 00:00:00.394163000
Run 5, Statement 2 : +000000000 00:00:00.063176000

Now, imagine we were running this against some much more impressive data sets than the Sakila database

Conclusion

If you’re ever thinking about concatenating two fields for a comparison, try again. There are two major caveats that should indicate you’re about to do something silly:

  • There’s a major risk of your query being subtly wrong (accidental matches between JENNIFER DAVIS and JENNI FERDAVIS)
  • There’s a major risk of your query being quite slow

So, as a rule of thumb, don’t use concatenation in predicates. There’s (almost) always a better way.

Read also: Why You Should (Sometimes) Avoid Expressions in SQL Predicates

How to Calculate Multiple Aggregate Functions in a Single Query

At a customer site, I’ve recently encountered a report where a programmer needed to count quite a bit of stuff from a single table. The counts all differed in the way they used specific predicates. The report looked roughly like this (as always, I’m using the Sakila database for illustration):

-- Total number of films
SELECT count(*)
FROM film

-- Number of films with a given length
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150

-- Number of films with a given language
SELECT count(*)
FROM film
WHERE language_id = 1

-- Number of films for a given rating
SELECT count(*)
FROM film
WHERE rating = 'PG'

And then, unsurprisingly, combinations of these predicates were needed as well, i.e.

-- Number of films with a given length / language_id
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150
AND language_id = 1

-- Number of films with a given length / rating
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150
AND rating = 'PG'

-- Number of films with a given language_id / rating
SELECT count(*)
FROM film
WHERE language_id = 1
AND rating = 'PG'

-- Number of films with a given length / language_id / rating
SELECT count(*)
FROM film
WHERE length BETWEEN 120 AND 150
AND language_id = 1
AND rating = 'PG'

In the end, there were 32 queries in total (or 8 in my example) with all the possible combinations of predicates. Needless to say that running them all took quite a while, because the table had around 200M records and only one predicate could profit from an index.

But in fact, the improvement is really easy. There are several options to calculate all these counts in a single query

Simplest solution works in all databases: Filtered aggregate functions (or manual pivot)

This solution allows for calculating all results in a single query by using 8 different, explicit, filtered aggregate functions and no GROUP BY clause (none in this example. More complex cases where GROUP BY persists are sill imaginable).

This is how it works on all databases:

SELECT 
  count(*),
  count(length),
  count(language_id),
  count(rating),
  count(length + language_id),
  count(length + rating),
  count(language_id + rating),
  count(length + language_id + rating)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    CASE WHEN language_id = 1            THEN 1 END language_id,
    CASE WHEN rating = 'PG'              THEN 1 END rating
  FROM film
) film

Which yields:

col1  col2  col3  col4  col5  col6  col7  col8
1000  224   1000  194   224   43    194   43

How to read the above query?

Instead of evaluating the three different predicates in a WHERE clause, we pre-calculate it in a derived table (subquery in the FROM clause) and translate the predicate in some random value (e.g. 1) if TRUE and NULL if FALSE. Note, I omitted the ELSE clause from the CASE expression, which means that we get NULLs per default. Running the nested select on its own…

SELECT
  CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
  CASE WHEN language_id = 1            THEN 1 END language_id,
  CASE WHEN rating = 'PG'              THEN 1 END rating
FROM film

… yields something along the lines of:

length  language_id  rating
---------------------------
NULL    1            1
NULL    1            NULL
NULL    1            NULL
NULL    1            NULL
1       1            NULL
NULL    1            1
NULL    1            NULL
...

(Note, of course, we could have used actual BOOLEAN types, e.g. in PostgreSQL, but that wouldn’t work on all databases)

Now, in the outer query, we’re using once COUNT(*), which simply counts all the rows regardless of any predicates in the CASE expressions. The other COUNT(expr) aggregate functions do something that surprisingly few people are aware of (yet a lot of people use this form “by accident”). They count only the number of non-NULL rows. For instance:

SELECT 
  ...
  count(length),
  ...
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    ...
  FROM film
) film

Or also:

SELECT 
  count(CASE WHEN length BETWEEN 120 AND 150 THEN 1 END)
FROM
  film

These queries will count those films whose length is BETWEEN 120 AND 150 (because those rows produce the value 1, which is non-NULL, and thus counted), whereas all the other films are not being counted.

Finally, I just used a trick to combine nullable values to make sure they’re all non-NULL:

SELECT 
  ...
  count(length + language_id),
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    CASE WHEN language_id = 1            THEN 1 END language_id,
    ...
  FROM film
) film

This counts those rows whose length BETWEEN 120 AND 150 and whose language_id = 1, because if either predicate was FALSE, the number would be NULL and thus the sum is NULL as well.

PostgreSQL and HSQLDB variant: FILTER

In PostgreSQL and HSQLDB (and in the SQL standard), there’s a special syntax for this. We can use the FILTER clause instead of encoding values in NULL / non-NULL like this:

SELECT 
  count(*),
  count(*) FILTER (WHERE length IS NOT NULL),
  count(*) FILTER (WHERE language_id IS NOT NULL),
  count(*) FILTER (WHERE rating IS NOT NULL),
  count(*) FILTER (WHERE length + language_id IS NOT NULL),
  count(*) FILTER (WHERE length + rating IS NOT NULL),
  count(*) FILTER (WHERE language_id + rating IS NOT NULL),
  count(*) FILTER (
    WHERE length + language_id + rating IS NOT NULL)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
    CASE WHEN language_id = 1            THEN 1 END language_id,
    CASE WHEN rating = 'PG'              THEN 1 END rating
  FROM film
) film

Or even, writing out the entire predicates again:

SELECT 
  count(*),
  count(*) FILTER (WHERE length BETWEEN 120 AND 150),
  count(*) FILTER (WHERE language_id = 1),
  count(*) FILTER (WHERE rating = 'PG'),
  count(*) FILTER (
    WHERE length BETWEEN 120 AND 150 AND language_id = 1),
  count(*) FILTER (
    WHERE length BETWEEN 120 AND 150 AND rating = 'PG'),
  count(*) FILTER (
    WHERE language_id = 1 AND rating = 'PG'),
  count(*) FILTER (
    WHERE length BETWEEN 120 AND 150 
    AND language_id = 1 AND rating = 'PG')
FROM film

Usually, the FILTER clause is more convenient, but both approaches are equivalent, and we’re running only a single query!

I also call this “manual PIVOT“, because it really works like a PIVOT table. And the good news is… There is a PIVOT syntax!

A more fancy solution: PIVOT

This solution is vendor-specific and only works in Oracle and with a bit less features in SQL Server. Here’s the Oracle version:

SELECT 
  a + b + c + d + e + f + g + h,
                  e + f + g + h,
          c + d         + g + h,
      b     + d     + f     + h,
                          g + h,
                      f     + h,
              d             + h,
                              h
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
PIVOT (
  count(*) FOR (length, language_id, rating) IN (
    (0, 0, 0) AS a,
    (0, 0, 1) AS b,
    (0, 1, 0) AS c,
    (0, 1, 1) AS d,
    (1, 0, 0) AS e,
    (1, 0, 1) AS f,
    (1, 1, 0) AS g,
    (1, 1, 1) AS h
  )
)

How to read this solution? There are 3 steps:

Step 1: The derived table

As in the previous example, we’re translating the desired predicates for our report into three columns that produce values 1 and 0. That’s understood so I won’t repeat the explanation.

Step 2: The PIVOT clause

The PIVOT clause can be applied to a table expression to “pivot” it in a similar way as we’re used from Microsoft Excel’s powerful pivot tables. It takes three parts:

  • A list of aggregate functions
  • An expression (FOR clause)
  • A list of expected values (IN clause)

The resulting table expression groups the PIVOT‘s input table by all the remaining columns (i.e. all the columns that are not part of the FOR clause, in our example, that’s no columns), and aggregates all the aggregate functions (in our case, only one) for all the values in the IN list.

If we SELECT * from this PIVOT table:

SELECT *
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
PIVOT (
  count(*) FOR (length, language_id, rating) IN (
    (0, 0, 0) AS a,
    (0, 0, 1) AS b,
    (0, 1, 0) AS c,
    (0, 1, 1) AS d,
    (1, 0, 0) AS e,
    (1, 0, 1) AS f,
    (1, 1, 0) AS g,
    (1, 1, 1) AS h
  )
)

… we’ll get these values:

a    b    c    d    e    f    g    h
------------------------------------
0    0  625  151    0    0  181   43

As you can see, the column names are generated from the IN list of expected values and the values contained in these columns are aggregations for the different predicates. These aggregations are not exactly the ones we wanted. For instance, column G is all the films whose length BETWEEN 120 AND 150 and whose language_id = 1 and whose RATING != 'PG'.

Step 3: Summing the count values

So, in order to get the expected results, we have to sum all the partial counts as such:

SELECT 
  a + b + c + d + e + f + g + h,
                  e + f + g + h,
          c + d         + g + h,
      b     + d     + f     + h,
                          g + h,
                      f     + h,
              d             + h,
                              h
FROM 
  ...

The result is now the same.

A more fancy solution: GROUPING SETS

GROUPING SETS are a SQL standard and they’re supported in at least:

  • DB2
  • HANA
  • Oracle
  • PostgreSQL
  • SQL Server
  • Sybase SQL Anywhere

Simply put, GROUPING SETS allow for grouping a table several times and creating a UNION of all the results. For example, the following two queries are the same, conceptually, although the GROUPING SETS one is usually faster:

-- Grouping once by language_id, then by rating
SELECT language_id, rating, count(*)
FROM film
GROUP BY GROUPING SETS (
  (language_id),
  (rating)
)

-- Grouping first by language_id
SELECT language_id, NULL, count(*)
FROM film
GROUP BY language_id
UNION ALL
SELECT NULL, rating, count(*)
FROM film
GROUP BY rating

Both queries yield:

language_id   rating   count
          1             1000 -- First grouping set / union subquery
              G          178 \
              PG         194  |
              PG-13      223  | Second grouping set / union subquery
              R          195  |
              NC-17      210 /

Clearly, the GROUPING SETS variant is more concise. Let’s imagine, we’d like to add more combinations of grouping columns, e.g.

SELECT language_id, rating, count(*)
FROM film
GROUP BY GROUPING SETS (
  (),
  (language_id),
  (rating),
  (language_id, rating)
)

Now, we’re grouping by all the combinations of columns, and the result is:

language_id   rating   count
                        1000 -- First grouping set: ()
          1             1000 -- Second grouping set: (language_id)
              G          178 \
              PG         194  |
              PG-13      223  | Third grouping set: (rating)
              R          195  |
              NC-17      210 /
          1   G          178 \
          1   PG         194  |
          1   PG-13      223  | Fourth grouping set: (language_id, rating)
          1   R          195  |
          1   NC-17      210 /

Of course, this would all be more impressive if we had more than one language in the system…

So, how do we solve the original problem with GROUPING SETS? Here’s how:

SELECT 
  GROUPING_ID (length, language_id, rating),
  length,
  language_id,
  rating,
  count(*)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
GROUP BY GROUPING SETS (
  (),
  (length),
  (language_id),
  (rating),
  (length, language_id),
  (length, rating),
  (rating, language_id),
  (length, language_id, rating)
)
HAVING COALESCE (length, 1) != 0 
AND COALESCE (language_id, 1) != 0 
AND COALESCE (rating, 1) != 0
ORDER BY GROUPING_ID (length, language_id, rating) DESC

Wow. How to read this? In 4 steps:

Step 1: Again, the derived table

This time, we’ll encode FALSE as 0, not NULL, because NULL already has a different meaning in GROUPING SETS. It means that for a given GROUPING SET, we didn’t group by that column. We’ll see that in step 3.

Step 2: The GROUPING SETS

In this section, we’re just listing all the possible combinations of GROUP BY columns that we want to use, which produces 8 distinct GROUPING SETS. I’ve already explained this in the previous introduction to GROUPING SETS, so this is no different.

Step 3: Filter out unwanted groupings

Just like in the PIVOT example, we’re also getting results for which the predicates are FALSE, but we don’t want those in the result. So we’re filtering them out in the HAVING clause:

SELECT 
  ...
HAVING COALESCE (length, 1) != 0 
AND COALESCE (language_id, 1) != 0 
AND COALESCE (rating, 1) != 0
...

How to read this? E.g. LENGTH can be any of:

  • 1: The length predicate was TRUE
  • 0: The length predicate was FALSE
  • NULL: The length column is not considered for a given GROUPING SET, e.g. () or (rating, language_id)

So, using COALESCE, we’re making sure that we include only 1 and NULL lengths, not 0 lengths.

Step 4: Ordering the results

This is optional, but in order to get the same output order as before, we can use the special GROUPING_ID() (or GROUPING() depending on the DB) function which returns an ID for each GROUPING SET. The output is:

grouping   length   language_id   rating   count
------------------------------------------------
       7     NULL          NULL     NULL    1000
       6     NULL          NULL        1     194
       5     NULL             1     NULL    1000
       4     NULL             1        1     194
       3        1          NULL     NULL     224
       2        1          NULL        1      43
       1        1             1     NULL     224
       0        1             1        1      43

Excellent! And hey, there’s even syntax sugar for “special” GROUPING SETS configurations like ours, where we list all the possible column permutations. In this case, we can use CUBE()!

SELECT 
  GROUPING_ID (length, language_id, rating),
  length,
  language_id,
  rating,
  count(*)
FROM (
  SELECT
    CASE WHEN length BETWEEN 120 AND 150 
         THEN 1 ELSE 0 END length,
    CASE WHEN language_id = 1            
         THEN 1 ELSE 0 END language_id,
    CASE WHEN rating = 'PG'              
         THEN 1 ELSE 0 END rating
  FROM film
) film
GROUP BY CUBE (length, language_id, rating)
HAVING COALESCE(length, 1) != 0 
AND COALESCE(language_id, 1) != 0 
AND COALESCE(rating, 1) != 0
ORDER BY GROUPING_ID (length, language_id, rating) DESC

Performance

Such a comparison blog post wouldn’t be complete if we wouldn’t benchmark for performance. This time, I’ll be benchmarking only for Oracle, as PostgreSQL doesn’t support PIVOT and SQL Server’s PIVOT is more limited than Oracle’s.

Here’s the complete benchmark:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2000;
BEGIN

  -- Repeat the whole benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP

    -- Individual statements
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT count(*) FROM film
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE language_id = 1
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
        AND language_id = 1
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
        AND rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE language_id = 1
        AND rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;

      FOR rec IN (
        SELECT count(*) FROM film 
        WHERE length BETWEEN 120 AND 150
        AND language_id = 1
        AND rating = 'PG'
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : ' || (SYSTIMESTAMP - v_ts));

    -- Manual PIVOT
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT
          count(*),
          count(length),
          count(language_id),
          count(rating),
          count(length + language_id),
          count(length + rating),
          count(language_id + rating),
          count(length + language_id + rating)
        FROM (
          SELECT
            CASE WHEN length BETWEEN 120 AND 150 THEN 1 END length,
            CASE WHEN language_id = 1            THEN 1 END language_id,
            CASE WHEN rating = 'PG'              THEN 1 END rating
          FROM film
        ) film
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 2 : ' || (SYSTIMESTAMP - v_ts));
    
    -- PIVOT
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT 
          a + b + c + d + e + f + g + h,
                          e + f + g + h,
                  c + d         + g + h,
              b     + d     + f     + h,
                                  g + h,
                              f     + h,
                      d             + h,
                                      h
        FROM (
          SELECT
            CASE WHEN length BETWEEN 120 AND 150 THEN 1 ELSE 0 END length,
            CASE WHEN language_id = 1            THEN 1 ELSE 0 END language_id,
            CASE WHEN rating = 'PG'              THEN 1 ELSE 0 END rating
          FROM film
        ) film
        PIVOT (
          count(*) FOR (length, language_id, rating) IN (
            (0, 0, 0) AS a,
            (0, 0, 1) AS b,
            (0, 1, 0) AS c,
            (0, 1, 1) AS d,
            (1, 0, 0) AS e,
            (1, 0, 1) AS f,
            (1, 1, 0) AS g,
            (1, 1, 1) AS h
          )
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 3 : ' || (SYSTIMESTAMP - v_ts));

    -- GROUPING SETS
    v_ts := SYSTIMESTAMP;
      
    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT 
          GROUPING_ID (length, language_id, rating),
          length,
          language_id,
          rating,
          count(*)
        FROM (
          SELECT
            CASE WHEN length BETWEEN 120 AND 150 THEN 1 ELSE 0 END length,
            CASE WHEN language_id = 1            THEN 1 ELSE 0 END language_id,
            CASE WHEN rating = 'PG'              THEN 1 ELSE 0 END rating
          FROM film
        ) film
        GROUP BY CUBE (length, language_id, rating)
        HAVING COALESCE (length, 1) != 0 
        AND COALESCE (language_id, 1) != 0 
        AND COALESCE (rating, 1) != 0
        ORDER BY GROUPING_ID (length, language_id, rating) DESC
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;
      
    dbms_output.put_line('Run ' || r ||', Statement 4 : ' || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

And the results:

Run 1, Statement 1 : +000000000 00:00:01.928497000
Run 1, Statement 2 : +000000000 00:00:01.136341000
Run 1, Statement 3 : +000000000 00:00:02.751679000
Run 1, Statement 4 : +000000000 00:00:00.797529000

Run 2, Statement 1 : +000000000 00:00:01.695543000
Run 2, Statement 2 : +000000000 00:00:01.004073000
Run 2, Statement 3 : +000000000 00:00:02.490895000
Run 2, Statement 4 : +000000000 00:00:00.838979000

Run 3, Statement 1 : +000000000 00:00:01.634047000
Run 3, Statement 2 : +000000000 00:00:01.016266000
Run 3, Statement 3 : +000000000 00:00:02.566895000
Run 3, Statement 4 : +000000000 00:00:00.790159000

Run 4, Statement 1 : +000000000 00:00:01.669844000
Run 4, Statement 2 : +000000000 00:00:01.015502000
Run 4, Statement 3 : +000000000 00:00:02.574646000
Run 4, Statement 4 : +000000000 00:00:00.807804000

Run 5, Statement 1 : +000000000 00:00:01.653498000
Run 5, Statement 2 : +000000000 00:00:00.980375000
Run 5, Statement 3 : +000000000 00:00:02.556186000
Run 5, Statement 4 : +000000000 00:00:00.890283000

Very disappointingly, the PIVOT solution is the slowest every time. I’m assuming there’s some substantial temporary object overhead which wouldn’t be as severe if the table were much larger, but clearly, the manual PIVOT solution (COUNT(CASE ...)) and the GROUPING SETS solution heavily outperform the initial attempt, where we calculate 8 counts individually.

To get back to the original report where 32 counts were calculated: The report was roughly 20x as fast with manual PIVOT on 200M rows and imagine if you need to JOIN – you definitely want to avoid those 32 individual queries and calculate everything in one go.

Cheers!

jOOQ Tuesdays: Richard North Makes Database Testing More Reproducible with Testcontainers

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

I’m very excited to feature today Richard North, creator of Testcontainers, a promising new database and UI testing tool for reproducible tests.

Hi Richard – you work at Skyscanner, which means you have tons of travel data to work with. What’s the most exciting thing about working with this data?

Skyscanner is a really data-led company, using data at all levels in decision-making. The volume of data we gather and process really helps us understand what travellers want and serve them better. For example, our destination recommender system helps with people discover new and interesting places to go, based upon a vast amount of data, algorithms and experiments. But it’s interesting how this varies from recommendations in internet media companies – there are far fewer possible destinations than books, songs and movies, yet users’ reasons for travelling and tastes can be more nuanced and varied.

There’s yet more data-oriented work under the surface, too – for example, the infrastructure needed to gather and analyse such large amounts of data, and the running of experiments to help us improve. There are a lot of smart people working hard to make this all happen, and it’s an exciting place to be!

You’ve created an increasingly popular testing framework, Testcontainers. What made you do it? What itch does it scratch?

Well, I think it’s something that scratches several itches at the same time – things that previously we only had isolated solutions to. The common element, though, is reproducibility of test environments. Of all my time developing and writing tests for JVM-based systems, it’s always been the non-JVM dependencies that caused the most complexity, unreliability and maintenance overhead.

I remember my first day as a developer, years ago: I was given a desktop machine and 2 days’ worth of step-by-step instructions that I needed to follow – just so that I’d be able to develop and run tests with all dependencies in place. A few months later I had to repeat the same task many times over when building new CI servers.

A lot has changed since then in terms of how we deploy and manage our production infrastructure, and thankfully Docker has done a lot to further bring prod-parity to developers’ machines.

Testcontainers started out as my effort to bring the full power of Docker to integrated testing on the JVM in two areas that I’ve experienced the most pain: testing against a clean, representative database, and making browser-based selenium testing more reproducible, both for developers and on CI.

Mostly being curious about testing databases, your documentation mentions Testcontainers as an alternative to using H2 as a test database. What are the disadvantages of emulating a database e.g. with H2? Did you make any personal experience with that?

Yes, definitely – it was one of the tipping point factors that triggered me to create Testcontainers. I do think H2 is a fantastic piece of work in what it manages to deliver, and it’s something I’ve used on a number of projects to good effect.

However, compatibility with real databases has often been a sticking point. Back in 2015, before I started Testcontainers, we were struggling with a few MySQL features that didn’t have equivalents in H2. We were facing the unpleasant prospect of having to constrain our implementation to what H2 would allow us to test against. It became fairly obvious that there was a gap in the market for an H2-like tool that was actually a facade to a Docker-based database container – and Testcontainers was born.

How do you think of mocking the database at any layer, including the DAO layer, service layer, etc.?

I’m all in favour of keeping tests small, light and layered, and using mocks to accomplish this. This might sound strange coming from somebody who has developed an integrated testing tool, but it’s true!

Still, I feel that we need to be pragmatic about how we approach automated tests and how we make sure we’re testing the right thing – especially when crossing boundaries. Are we testing how this code behaves against reality, or are we testing against our own (potentially false) understanding of how external components work?

My feeling is that it’s quite straightforward to mock layers of your system that you yourself wrote, or where you can easily jump into the source code, a spec or documentation. With an external component, you can still produce a mock that behaves how you expect, or how you witness the real thing behaving. But does that mock continue to represent the real thing, especially after accretion of other features, or the additional perils of state that a database entails – schema changes and actual data?

My ideal is to mock the data access layer for consumption by higher layers, but to be quite careful about what the data access layer itself talks to in my tests. It should probably be a real database. Hopefully Testcontainers is one tool that helps make this particular thing a little less painful – so that when you find yourself needing to do this, there’s a way to do it easily.

What’s the biggest challenge you’ve faced when testing databases, or other things?

It’s not databases, but I’d say that by far the hardest testing challenge I’ve faced as a developer is mobile apps, especially iOS. I’ve always enjoyed mobile development  as a whole, but when switching from a Java server-side/web project to mobile, it really feels like you’re going back in time. Some of the challenges are harder – such as asynchronicity and platform APIs that make it harder to structure software in a testable way. But it also feels like the tooling is much further behind, and until quite recently received far less attention. I feel the net result has been that developers have been discouraged from investing in automated tests, which is sad given that we know how valuable they can be.

Things are getting better, but I do greatly prefer the testing aspects of working on server-side JVM projects. For all its difficulties, we are actually quite lucky!

How to Fetch Multiple Oracle Execution Plans in One Nice Query

When looking at execution plans in Oracle, we’ll have to do several steps to be able to call the DBMS_XPLAN package functions. In fact, we have to find out the SQL_ID for a given statement first, and only then we can get its plan. I’ve blogged about this previously, here.

However, thanks to lateral unnesting, we can do the two steps in one go. More than that, we can even fetch several plans in one go this way! Check this out…

Let’s run the queries from the previous benchmarking blog post, and let’s add the /*+GATHER_PLAN_STATISTICS*/ hint to get some actual execution values, not just estimates:

SELECT /*+GATHER_PLAN_STATISTICS*/ 
  first_name, last_name, count(fa.actor_id) AS c
FROM actor a
LEFT JOIN film_actor fa
ON a.actor_id = fa.actor_id
WHERE last_name LIKE 'A%'
GROUP BY a.actor_id, first_name, last_name
ORDER BY c DESC
;

SELECT /*+GATHER_PLAN_STATISTICS*/
  first_name, last_name, (
    SELECT count(*)
    FROM film_actor fa
    WHERE a.actor_id =
    fa.actor_id
  ) AS c
FROM actor a
WHERE last_name LIKE 'A%' 
ORDER BY c DESC
;

Both queries do the same thing. They try to find those actors whose last name starts with the letter A and counts their corresponding films. Now what about the execution plans? Run the following query and you don’t have to know any SQL_ID in advance:

SELECT s.sql_id, p.*
FROM v$sql s, TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE s.sql_text LIKE '%/*+GATHER_PLAN_STATISTICS*/%';

As you can see, with “LATERAL unnesting”, we can unnest a nested table (as returned by DBMS_XPLAN.DISPLAY_CURSOR) into the calling query, and we can pass column values from the V$SQL table to each function call. In other words, the above query reads as:

  • Get all SQL statements from the cursor cache V$SQL
  • Keep only those who have our GATHER_PLAN_STATISTICS hint in them (replace with your own query matching pattern)
  • Cross-join the unnested table from DBMS_XPLAN.DISPLAY_CURSOR where we get the plan per SQL_ID and CHILD_NUMBER

This implicit “LATERAL unnesting” is a bit obscure in my opinion (but its brief). More formally correct would be to use the actual LATERAL keyword, or better the SQL Server style CROSS APPLY

-- LATERAL: A bit verbose
SELECT s.sql_id, p.*
FROM v$sql s CROSS JOIN LATERAL (SELECT * FROM TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
)) p
WHERE s.sql_text LIKE '%/*+GATHER_PLAN_STATISTICS*/%';

-- CROSS APPLY: Very neat!
SELECT s.sql_id, p.*
FROM v$sql s CROSS APPLY TABLE (
  dbms_xplan.display_cursor (
    s.sql_id, s.child_number, 'ALLSTATS LAST'
  )
) p
WHERE s.sql_text LIKE '%/*+GATHER_PLAN_STATISTICS*/%';

In any case, the result is the same (I’ve removed some columns for brevity):

SQL_ID          PLAN_TABLE_OUTPUT
3gv1fd3dcj3b0	SQL_ID  3gv1fd3dcj3b0, child number 0
3gv1fd3dcj3b0	-------------------------------------
3gv1fd3dcj3b0	SELECT /*+GATHER_PLAN_STATISTICS*/ first_name, last_name, 
3gv1fd3dcj3b0	count(fa.actor_id) AS c FROM actor a LEFT JOIN film_actor fa ON 
3gv1fd3dcj3b0	a.actor_id = fa.actor_id WHERE last_name LIKE 'A%' GROUP BY a.actor_id, 
3gv1fd3dcj3b0	first_name, last_name ORDER BY c DESC
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	Plan hash value: 3014447605
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	-----------------------------------------------------------------------------------------------------
3gv1fd3dcj3b0	| Id  | Operation                              | Name                    | Starts | E-Rows | A-Rows |
3gv1fd3dcj3b0	-----------------------------------------------------------------------------------------------------
3gv1fd3dcj3b0	|   0 | SELECT STATEMENT                       |                         |      1 |        |      7 |
3gv1fd3dcj3b0	|   1 |  SORT ORDER BY                         |                         |      1 |      7 |      7 |
3gv1fd3dcj3b0	|   2 |   HASH GROUP BY                        |                         |      1 |      7 |      7 |
3gv1fd3dcj3b0	|*  3 |    HASH JOIN OUTER                     |                         |      1 |    154 |    196 |
3gv1fd3dcj3b0	|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR                   |      1 |      6 |      7 |
3gv1fd3dcj3b0	|*  5 |      INDEX RANGE SCAN                  | IDX_ACTOR_LAST_NAME     |      1 |      6 |      7 |
3gv1fd3dcj3b0	|   6 |     INDEX FAST FULL SCAN               | IDX_FK_FILM_ACTOR_ACTOR |      1 |   5462 |   5462 |
3gv1fd3dcj3b0	-----------------------------------------------------------------------------------------------------
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	Predicate Information (identified by operation id):
3gv1fd3dcj3b0	---------------------------------------------------
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	   3 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")
3gv1fd3dcj3b0	   5 - access("A"."LAST_NAME" LIKE 'A%')
3gv1fd3dcj3b0	       filter("A"."LAST_NAME" LIKE 'A%')
3gv1fd3dcj3b0	 
3gv1fd3dcj3b0	Note
3gv1fd3dcj3b0	-----
3gv1fd3dcj3b0	   - dynamic statistics used: dynamic sampling (level=2)
3gv1fd3dcj3b0	   - this is an adaptive plan
3gv1fd3dcj3b0	   - 1 Sql Plan Directive used for this statement
3gv1fd3dcj3b0	 
6a3nrpcw22avr	SQL_ID  6a3nrpcw22avr, child number 0
6a3nrpcw22avr	-------------------------------------
6a3nrpcw22avr	SELECT /*+GATHER_PLAN_STATISTICS*/ first_name, last_name, (   SELECT 
6a3nrpcw22avr	count(*)   FROM film_actor fa   WHERE a.actor_id =   fa.actor_id ) AS c 
6a3nrpcw22avr	FROM actor a WHERE last_name LIKE 'A%'  ORDER BY c DESC
6a3nrpcw22avr	 
6a3nrpcw22avr	Plan hash value: 3873085786
6a3nrpcw22avr	 
6a3nrpcw22avr	---------------------------------------------------------------------------------------------------
6a3nrpcw22avr	| Id  | Operation                            | Name                    | Starts | E-Rows | A-Rows |
6a3nrpcw22avr	---------------------------------------------------------------------------------------------------
6a3nrpcw22avr	|   0 | SELECT STATEMENT                     |                         |      1 |        |      7 |
6a3nrpcw22avr	|   1 |  SORT AGGREGATE                      |                         |      7 |      1 |      7 |
6a3nrpcw22avr	|*  2 |   INDEX RANGE SCAN                   | IDX_FK_FILM_ACTOR_ACTOR |      7 |     27 |    196 |
6a3nrpcw22avr	|   3 |  SORT ORDER BY                       |                         |      1 |      6 |      7 |
6a3nrpcw22avr	|   4 |   TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR                   |      1 |      6 |      7 |
6a3nrpcw22avr	|*  5 |    INDEX RANGE SCAN                  | IDX_ACTOR_LAST_NAME     |      1 |      6 |      7 |
6a3nrpcw22avr	---------------------------------------------------------------------------------------------------
6a3nrpcw22avr	 
6a3nrpcw22avr	Predicate Information (identified by operation id):
6a3nrpcw22avr	---------------------------------------------------
6a3nrpcw22avr	 
6a3nrpcw22avr	   2 - access("FA"."ACTOR_ID"=:B1)
6a3nrpcw22avr	   5 - access("LAST_NAME" LIKE 'A%')
6a3nrpcw22avr	       filter("LAST_NAME" LIKE 'A%')
6a3nrpcw22avr	 

This is really neat!