Prevent SQL Injection with SQL Builders Like jOOQ


As long as we allow ourselves to write string-based dynamic SQL embedded in other programming languages like Java, we will have a certain risk of being vulnerable to SQL injection. That’s a fact. Don’t believe it? Check out this website exposing all vulnerabilities on Stack Overflow for PHP questions:

https://laurent22.github.io/so-injections

In a previous blog post, I’ve shown how fatal such a single vulnerability can be, if discovered. A lot of blog posts out there warn about the potential of attackers injecting a DROP DATABASE statement.

18mpenleoksq8jpg

Also, everyone knows this famous xkcd:
https://xkcd.com/327

But in my opinion, a much more important threat is not immediate damage to your system, but data leakage. Using tools like sqlmap, every script kiddie can download your credit card information and other sensitive data from your database.

The best remedy: Avoid string-based embedded SQL

The best remedy is to always avoid string-based embedded SQL, whenever you can. I.e. try not to do things like this too often:

try (PreparedStatement s = c.prepareStatement(
    "SELECT first_name, last_name "
  + "FROM users "
  + "WHERE user_id = ? ")) {

    s.setInt(1, userId);
    try (ResultSet rs = s.executeQuery()) {
        ...
    }
}

Sure, there is currently nothing wrong with the above Java code. There is no vulnerability as we’re using a bind variable. But the risk of some developer not paying attention and accidentally adding a vulnerability when adding another predicate is too high.

Fine, so we’re not using string-based embedded SQL. But what are the alternatives?

Use a query DSL like jOOQ to build your dynamic SQL

APIs like jOOQ help you build SQL statements in a type safe, composable way as if the Java language actually understood SQL. The previous JDBC prepared statement now translates to the following:

Result<?> result =
ctx.select(USERS.FIRST_NAME, USERS.LAST_NAME)
   .from(USERS)
   // Bind variable embedded in statement
   .where(USERS.USER_ID.eq(userId))
   .fetch();

In jOOQ, the underlying JDBC PreparedStatement is created transparently and the userId bind variable is placed right in the middle of the statement so you don’t have to worry about the boring details. There’s no way you can have any accidental SQL injection vulnerability in such jOOQ API calls, because every SQL clause and expression is part of an expression tree that is managed by jOOQ.

And what’s best: The Java compiler can now type check your SQL statement to a certain degree. This is a huge benefit in terms of productivity and code quality.

Of course, SQL builders aren’t a perfect shield for SQL injection, as they usually expose some API to insert custom SQL strings. For instance, in jOOQ, you can write:

Result<?> result =
ctx.select(USERS.FIRST_NAME, USERS.LAST_NAME)
   .from(USERS)
   // Bind variable embedded in "plain SQL" string
   .where("user_id = ?", userId)
   .fetch();

This alternative API usage is just as convenient as the previous one. The bind variable is located right where it is needed (not applied later on a PreparedStatement), but of course this is again string-based embedded SQL, which is potentially vulnerable if you get it wrong.

The important thing here is that the string-based method is the exception, which is used only very rarely when you need a very advanced SQL feature that is not supported by jOOQ. By default, you’re using the “save” approach without strings. And since jOOQ 3.9, you can also add a type checker that generates a compilation warning or error as soon as you’re using jOOQ’s “plain SQL” API. Read more about that here:
JSR-308 and the Checker Framework Add Even More Typesafety to jOOQ 3.9

Use views and stored procedures

The safest and least intrusive way to prevent such problems, of course, is not to use embedded SQL at all, but to use views and/or stored procedures instead. By moving and storing the SQL statements into the database, you get a few advantages in addition to the lack of SQLi vulnerability:

  • The database can type-check your statement in its entirety
  • You can easily reuse a statement in different applications, e.g. not all written in Java
  • You could even revoke grants to the underlying tables and grant access only to these views and stored procedures, which adds another layer of security to your system

If you’re writing PL/SQL in Oracle, the previous statement would be transformed to something like this:

FOR rec IN (
  SELECT first_name, last_name
  FROM users
  -- Bind variable again embedded in SQL statement
  WHERE user_id = l_user_id
) LOOP
  ...
END LOOP;

Just like with jOOQ, you can easily embed the SQL statement in the “host language”, except that this time, it isn’t Java but PL/SQL. There’s absolutely no way the above statement will ever be vulnerable to SQL injection, as the statement isn’t even a dynamic SQL statement anymore.

Again, you can write dynamic SQL also in PL/SQL (or other database’s stored procedure languages), e.g. by using EXECUTE IMMEDIATE, but the important thing is again that dynamic, string-based, embedded SQL is the exception not the default. That’s the important thing here!

Now, if you do decide to use stored procedures, then jOOQ is again here to help you, as you can easily call that procedure from jOOQ, again in a type safe way. More about that in this blog post:
Painless Access from Java to PL/SQL Procedures with jOOQ

Beware! SQL isn’t the only language vulnerable to injection

Vlad Mihalcea displayed an equally important threat to JPA based applications. Scroll down in his blog post to find him mentioning JPQL injection:
A beginner’s guide to SQL injection and how you should prevent it

Yes, if you’re doing something silly as:

public List<Post> getPostsByTitle(String title) {
    return doInJPA(entityManager -> {
        return entityManager.createQuery(
            "select p " +
            "from Post p " +
            "where" +
            "   p.title = '" + title + "'", Post.class)
        .getResultList();
    });
}

… then an attacker can inject any sort of JPQL code into the title variable. The possibilities are a bit more limited than with SQL injection, but it is still perfectly possible to read and dump the entire database (including credit card information, remember?) from such a vulnerability.

Again, if you’re doing this quite often, you should consider switching to a more type safe way to write JPQL (e.g. using the the infamous Criteria API), or again switch to SQL and jOOQ.

Read Vlad’s post for more details:
A beginner’s guide to SQL injection and how you should prevent it

Conclusion

There are many external DSLs, like SQL, JPQL, XPath, regular expressions, and what not. Some of them are extremely powerful and they’re used to operate on sensitive data. Which means that if you leak control of the language outside of your application, you’re very vulnerable.

Vulnerabiliy mostly happens when you embed those external DSLs into Java in a string-based form. The best remedies are:

  • Use an internal DSL that models the external DSL instead
  • Keep the external DSL external, e.g. by using views and stored procedures

Both of these approaches work because they both avoid strings.

The Java Ecosystem’s Obsession with NonNull Annotations


I’m not well known for my love of annotations. While I do recognise that they can serve a very limited purpose in some areas (e.g. hinting stuff to the compiler or extending the language where we don’t want new keywords), I certainly don’t think they were ever meant to be used for API design.

“unfortunately” (but this is a matter of taste), Java 8 introduced type annotations. An entirely new extension to the annotation type system, which allows you to do things like:

@Positive int positive = 1;

Thus far, I’ve seen such common type restriction features only in the Ada or PL/SQL languages in a much more rigid way, but others may have similar features.

The nice thing about Java 8’s implementation is the fact that the meaning of the type of the above local variable (@Positive int) is unknown to the Java compiler (and to the runtime), until you write and activate a specific compiler plugin to enforce your custom meaning. The easiest way to do that is by using the checker framework (and yes, we’re guilty at jOOQ. We have our own checker for SQL dialect validation). You can implement any semantics, for instance:

// This compiles because @Positive int is a subtype of int
int number = positive;

// Doesn't compile, because number might be negative
@Positive int positive2 = number;

// Doesn't compile, because -1 is negative
@Positive int positive3 = -1;

As you can see, using type annotations is a very strategic decision. Either you want to create hundreds of types in this parallel universe as in this example:

Or, in my opinion, you better leave this set of features alone, because probably: YAGNI

Unfortunately, and to the disappointment of Mike Ernst, the author of the checker framework (whom I’ve talked to about this some years ago), most people abuse this new JSR-308 feature for boring and simple null checking. For instance, just recently, there had been a feature request on the popular Javaslang library to add support for such annotations that help users and IDEs guarantee that Javaslang API methods return non-null results.

Please no. Don’t use this atomic bomb for boring null checks

Let me make this very clear:

Type annotations are the wrong tool to enforce nullability

– Lukas Eder, timeless

You may quote me on that. The only exception to the above is if you strategically embrace JSR-308 type annotations in every possible way and start adding annotations for all sorts of type restrictions, including the @Positive example that I’ve given, then yes, adding nullability annotations won’t hurt you much anymore, as your types will take 50 lines of code to declare and reference anyway. But frankly, this is an extremely niche approach to type systems that only few general purpose programs, let alone publicly available APIs can profit from. If in doubt, don’t use type annotations.

One important reason why a library like Javaslang shouldn’t add such annotations is the fact that in a library like Javaslang, you can be very sure that you will hardly ever encounter null, because references in Javaslang are mostly one of three things:

  • A collection, which is never null but empty
  • An Option which replaces null (it is in fact a collection of cardinality 0..1)
  • A non-null reference (because in the presence of Option, all references can be expected to be non-null)

Of course, these rules aren’t valid for every API. There are some low quality APIs out there that return “unexpected” null values, or leak “internal” null values (and historically, some of the JDK APIs, unfortunately, are part of these “low quality APIs”). But Javaslang is not one of them, and the APIs you are designing also shouldn’t be one of them.

So, let go of your null fear. null is not a problem in well-designed software. You can spare yourself the work of adding a @NonNull annotation on 99% of all of your types just to shut up your IDE, in case you turned on those warnings. Focus on writing high-quality software rather than bikeshedding null.

Because: YAGNI.

And, if you haven’t had enough bikeshedding already, consider watching this entertaining talk by Stuart Marks:

Use jOOQ to Read / Write Oracle PL/SQL RECORD Types


Some of the biggest limitations when working with Oracle PL/SQL from Java is the lack of support for a variety of PL/SQL features through the JDBC interface. This lack of support is actually not limited to JDBC, but also extends to Oracle SQL. For instance, if you’re using the useful PL/SQL BOOLEAN type as such:

CREATE OR REPLACE FUNCTION yes RETURN boolean AS
BEGIN
  RETURN true;
END yes;
/

It would now be terrific if you could do this:

SELECT yes FROM dual;

But it’s not possible. You’ll be getting an error along the lines of the following:

ORA-06552: PL/SQL: Statement ignored
ORA-06553: PLS-382: expression is of wrong type
06552. 00000 -  "PL/SQL: %s"
*Cause:    
*Action:

It’s crazy to think that the Oracle SQL language still doesn’t support the SQL standard boolean type, which is so useful as I’ve shown in previous blog posts. Here’s where you can upvote the feature request:

https://community.oracle.com/ideas/2633

BOOLEAN isn’t the only “inaccessible” SQL feature

Now, there are a couple of other data types, which cannot be “bridged” to the SQL engine, and thus (for some reason only the OJDBC driver gods can fathom) cannot be “bridged” to a JDBC client. Among them: The very useful PL/SQL RECORD type.

Very often, you want to do this:

CREATE PACKAGE customers AS
  TYPE person IS RECORD (
    first_name VARCHAR2(50),
    last_name VARCHAR2(50)
  );
  
  FUNCTION get_customer(p_customer_id NUMBER) RETURN person;
END customers;
/

CREATE PACKAGE BODY customers AS
  FUNCTION get_customer(p_customer_id NUMBER) RETURN person IS
    v_person customers.person;
  BEGIN
    SELECT c.first_name, c.last_name
    INTO v_person
    FROM customer c
    WHERE c.customer_id = p_customer_id;
    
    RETURN v_person;
  END get_customer;
END customers;
/

(we’re running this on the SAKILA database).

As in any language with the least bit of sophistication, we can define “structs” or records in PL/SQL, which we can now frequently reuse. Everyone knows what a PERSON is and we can pass them around between procedures and functions.

For instance, in PL/SQL client code:

SET SERVEROUTPUT ON
DECLARE
  v_person customers.person;
BEGIN
  v_person := customers.get_customer(1);
  
  dbms_output.put_line(v_person.first_name);
  dbms_output.put_line(v_person.last_name);
END;
/

… which yields:

MARY
SMITH

What about JDBC client code?

After having added support for PL/SQL BOOLEAN types in jOOQ 3.9, with jOOQ 3.9, we now finally support PL/SQL record types in stored procedures as well, at least in standalone calls, which are not embedded in SQL statements. The jOOQ code generator will pick up all of these package-level PL/SQL record types and their structures and generate the boring boiler plate code for you. E.g. (simplified):

package org.jooq.demo.sakila.packages.customers.records;

public class PersonRecord extends UDTRecordImpl<PersonRecord> {
    public void   setFirstName(String value) { ... }
    public String getFirstName()             { ... }
    public void   setLastName(String value)  { ... }
    public String getLastName()              { ... }

    public PersonRecord() { ... }
    public PersonRecord(String firstName, String lastName) { ... }
}

Notice how jOOQ doesn’t really make any difference in its API between the generated code for an Oracle SQL OBJECT type or an Oracle PL/SQL RECORD type. They’re essentially the same thing (from a jOOQ API perspective).

More interesting, what happened to the generated package and the function? This code is generated (simplified):

public class Customers extends PackageImpl {
    public static PersonRecord getCustomer(
        Configuration configuration, Long pCustomerId
    ) { ... }
}

That’s all! So all we now need to do is pass the ubiquitous jOOQ Configuration (which contains information such as SQLDialect or JDBC Connection) and the actual stored function parameter, the P_CUSTOMER_ID value, and we’re done!

This is how jOOQ client code might look:

PersonRecord person = Customers.getCustomer(configuration(), 1L);
System.out.println(person);

As you can see, this is just the same thing as the corresponding PL/SQL code. And the output of this println call is this:

SAKILA.CUSTOMERS.PERSON('MARY', 'SMITH')

A fully qualified RECORD declaration with schema, package, and type name.

How does it work?

Let’s turn on jOOQ’s built-in TRACE logging to see what jOOQ did behind the scenes:

Calling routine          : 
  DECLARE
    r1 "CUSTOMERS"."PERSON";
  BEGIN
    r1 := "CUSTOMERS"."GET_CUSTOMER"("P_CUSTOMER_ID" => ?);
    ? := r1."FIRST_NAME";
    ? := r1."LAST_NAME";
  END;
Binding variable 1       : 1 (class java.lang.Long)
Registering variable 2   : class java.lang.String
Registering variable 3   : class java.lang.String
Fetched OUT parameters   : +-----------------+
                         : |RETURN_VALUE     |
                         : +-----------------+
                         : |('MARY', 'SMITH')|
                         : +-----------------+

So, jOOQ usually doesn’t use JDBC’s very limited escape syntax to call stored procedures, it just produces an anonymous PL/SQL block with a local variable of type CUSTOMER.PERSON, i.e. of our RECORD type. The function call is then assigned to this local variable, and the local variable is descructured into its individual parts.

In the TRACE log, you can see the individual bind variables, i.e. there’s an IN variable at index 1 of type Long, and two OUT variables of type String at indexes 2 and 3.

How does jOOQ know the record types?

At runtime, all the information is hard-wired to the generated code. So, the magic is inside of the code generator. Warning: Some serious SQL ahead!

This beauty here queries the dictionary views for PL/SQL record types:

AllArguments a = ALL_ARGUMENTS.as("a");
AllArguments x = ALL_ARGUMENTS.as("x");
Field<BigDecimal> nextSibling = field(name("next_sibling"), x.SEQUENCE.getDataType());

DSL.using(configuration)
   .select(x.TYPE_OWNER, x.TYPE_NAME, x.TYPE_SUBNAME)
   .select(
       a.ARGUMENT_NAME                                               .as(ALL_TYPE_ATTRS.ATTR_NAME),
       a.SEQUENCE                                                    .as(ALL_TYPE_ATTRS.ATTR_NO),
       a.TYPE_OWNER                                                  .as(ALL_TYPE_ATTRS.ATTR_TYPE_OWNER),
       nvl2(a.TYPE_SUBNAME, a.TYPE_NAME, inline(null, a.TYPE_NAME))  .as("package_name"),
       coalesce(a.TYPE_SUBNAME, a.TYPE_NAME, a.DATA_TYPE)            .as(ALL_TYPE_ATTRS.ATTR_TYPE_NAME),
       a.DATA_LENGTH                                                 .as(ALL_TYPE_ATTRS.LENGTH),
       a.DATA_PRECISION                                              .as(ALL_TYPE_ATTRS.PRECISION),
       a.DATA_SCALE                                                  .as(ALL_TYPE_ATTRS.SCALE))
   .from(a)
   .join(table(
       select(
           a.TYPE_OWNER,
           a.TYPE_NAME,
           a.TYPE_SUBNAME,
           min(a.OWNER        ).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(a.OWNER),
           min(a.PACKAGE_NAME ).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(a.PACKAGE_NAME),
           min(a.SUBPROGRAM_ID).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(a.SUBPROGRAM_ID),
           min(a.SEQUENCE     ).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(a.SEQUENCE),
           min(nextSibling    ).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(nextSibling),
           min(a.DATA_LEVEL   ).keepDenseRankFirstOrderBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.SEQUENCE).as(a.DATA_LEVEL))
      .from(table(
          select(
              lead(a.SEQUENCE, 1, a.SEQUENCE).over(
                  partitionBy(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID, a.DATA_LEVEL)
                 .orderBy(a.SEQUENCE)
              ).as("next_sibling"),
              a.TYPE_OWNER,
              a.TYPE_NAME,
              a.TYPE_SUBNAME,
              a.OWNER,
              a.PACKAGE_NAME,
              a.SUBPROGRAM_ID,
              a.SEQUENCE,
              a.DATA_LEVEL,
              a.DATA_TYPE)
         .from(a)
         .where(a.OWNER.in(getInputSchemata()))
      ).as("a"))
      .where(a.TYPE_OWNER.in(getInputSchemata()))
      .and(a.OWNER.in(getInputSchemata()))
      .and(a.DATA_TYPE.eq("PL/SQL RECORD"))
      .groupBy(a.TYPE_OWNER, a.TYPE_NAME, a.TYPE_SUBNAME)
   ).as("x"))
   .on(row(a.OWNER, a.PACKAGE_NAME, a.SUBPROGRAM_ID).eq(x.OWNER, x.PACKAGE_NAME, x.SUBPROGRAM_ID))
   .and(a.SEQUENCE.between(x.SEQUENCE).and(nextSibling))
   .and(a.DATA_LEVEL.eq(x.DATA_LEVEL.plus(one())))
   .orderBy(x.TYPE_OWNER, x.TYPE_NAME, x.TYPE_SUBNAME, a.SEQUENCE)
   .fetch();

This is a nice little jOOQ query, which corresponds to the following equially impressive SQL query, which you can run directly in your SQL developer, or some other SQL client for Oracle:

SELECT 
  "x"."TYPE_OWNER",
  "x"."TYPE_NAME",
  "x"."TYPE_SUBNAME",
  "a"."ARGUMENT_NAME" "ATTR_NAME",
  "a"."SEQUENCE" "ATTR_NO",
  "a"."TYPE_OWNER" "ATTR_TYPE_OWNER",
  nvl2("a"."TYPE_SUBNAME", "a"."TYPE_NAME", NULL) "package_name",
  COALESCE("a"."TYPE_SUBNAME", "a"."TYPE_NAME", "a"."DATA_TYPE") "ATTR_TYPE_NAME",
  "a"."DATA_LENGTH" "LENGTH",
  "a"."DATA_PRECISION" "PRECISION",
  "a"."DATA_SCALE" "SCALE"
FROM "SYS"."ALL_ARGUMENTS" "a"
JOIN (
  SELECT 
    "a"."TYPE_OWNER",
    "a"."TYPE_NAME",
    "a"."TYPE_SUBNAME",
    MIN("a"."OWNER") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "OWNER",
    MIN("a"."PACKAGE_NAME") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "PACKAGE_NAME",
    MIN("a"."SUBPROGRAM_ID") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "SUBPROGRAM_ID",
    MIN("a"."SEQUENCE") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "SEQUENCE",
    MIN("next_sibling") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "next_sibling",
    MIN("a"."DATA_LEVEL") KEEP (DENSE_RANK FIRST
      ORDER BY "a"."OWNER" ASC, "a"."PACKAGE_NAME" ASC, "a"."SUBPROGRAM_ID" ASC, "a"."SEQUENCE" ASC) "DATA_LEVEL"
  FROM (
    SELECT 
	  lead("a"."SEQUENCE", 1, "a"."SEQUENCE") OVER (
	    PARTITION BY "a"."OWNER", "a"."PACKAGE_NAME", "a"."SUBPROGRAM_ID", "a"."DATA_LEVEL" 
		ORDER BY "a"."SEQUENCE" ASC
	  ) "next_sibling",
      "a"."TYPE_OWNER",
      "a"."TYPE_NAME",
      "a"."TYPE_SUBNAME",
      "a"."OWNER",
      "a"."PACKAGE_NAME",
      "a"."SUBPROGRAM_ID",
      "a"."SEQUENCE",
      "a"."DATA_LEVEL",
      "a"."DATA_TYPE"
    FROM "SYS"."ALL_ARGUMENTS" "a"
    WHERE "a"."OWNER" IN ('SAKILA', 'SYS')     -- Possibly replace schema here
    ) "a"
  WHERE ("a"."TYPE_OWNER" IN ('SAKILA', 'SYS') -- Possibly replace schema here
  AND "a"."OWNER"         IN ('SAKILA', 'SYS') -- Possibly replace schema here
  AND "a"."DATA_TYPE"      = 'PL/SQL RECORD')
  GROUP BY 
    "a"."TYPE_OWNER",
    "a"."TYPE_NAME",
    "a"."TYPE_SUBNAME"
  ) "x"
ON (("a"."OWNER", "a"."PACKAGE_NAME", "a"."SUBPROGRAM_ID")
 = (("x"."OWNER", "x"."PACKAGE_NAME", "x"."SUBPROGRAM_ID"))
AND "a"."SEQUENCE" BETWEEN "x"."SEQUENCE" AND "next_sibling"
AND "a"."DATA_LEVEL" = ("x"."DATA_LEVEL" + 1))
ORDER BY 
  "x"."TYPE_OWNER" ASC,
  "x"."TYPE_NAME" ASC,
  "x"."TYPE_SUBNAME" ASC,
  "a"."SEQUENCE" ASC

Whew.

The output shows that we got all the required information for our RECORD type:

[  TYPE INFO COLUMNS  ]   ATTR_NAME   ATTR_TYPE_NAME  LENGTH
SAKILA CUSTOMERS PERSON	  FIRST_NAME  VARCHAR2        50
SAKILA CUSTOMERS PERSON	  LAST_NAME   VARCHAR2        50

All of this also works for:

  • Nested types
  • Multiple IN and OUT parameters

I’ll blog about a more advanced use-case in the near future, so stay tuned.

Applying Queueing Theory to Dynamic Connection Pool Sizing with FlexyPool


I’m very happy to have another interesting blog post by Vlad Mihalcea on the jOOQ blog, this time about his Open Source library flexypool. Read his previous jOOQ Tuesdays post on Hibernate here.

Vlad is a Hibernate developer advocate and he’s the author of the popular book High Performance Java Persistence, and he knows 1-2 things about connection pooling.

vladmihalcea

Introduction

Back in 2014, I was working as a software architect, and our team was building a real-estate platform which was composed of multiple nodes, as depicted in the following diagram:

databaseintegrationpoint

This is a classic enterprise architecture layout. The database is replicated to provide better throughout and availability in case of node failures. There are front-end nodes that deliver the website content. There are also many back-end nodes as well, like email schedulers or data import batch processors.

All these nodes require database connectivity, either to a Master node, for read-write transactions or to the Slave nodes, for read-only transactions.

Because acquiring database connections is an expensive process, each system node uses its own connection pool. By reusing physical database connections, the connection acquisition is very fast, therefore reducing the overall transaction response time.

Not only that a connection pool can reduce transaction response time, but it can level up traffic spikes as well. Without a connection pool, during a traffic spike, a front-end nodes might acquire all database connections, leaving the back-end processors with no database connectivity.

The connection pool, having a maximum number of database connections, allows the connections to queue whenever a traffic spike is happening. Therefore, during a traffic spike, the transaction response time will increase due to the queuing mechanism, but this is way better than taking down the whole system.

For these two reasons, the connection pool is a very good choice in many enterprise systems.

Based on the underlying hardware resources, a relational database can only offer a limited number of connections. For this reason, we must be very careful when choosing the pool size for each particular system node.

Connection pool sizing

I was the lucky person to get the task of figuring out how many connections should we allocate for each system node in our real-estate platform. Since I graduated Electronics and Telecommunications, I remembered that we learned about a similar problem when having to provision telecommunications networks. Agner Krarup Erlang invented Queuing theory for solving this problem, and I was curious if we could also find the right pool size by applying Erlang queuing models.

I was not the only one trying to apply the Queuing theory principles to software systems. Percona has a very interesting study: Forecasting MySQL Scalability with the actual service time in a system that is affected by a myriad of variables.

In the end, I realized that the best way to tackle this problem is to constant measuring and adjustments. For this reason, I needed a tool to capture database connection metrics, as well as a way to adjust a given connection pool while the enterprise system is running.

And, that’s how FlexyPool was born.

Basically, FlexyPool is a DataSource Proxy that stands in front of the actual JDBC DataSource or other proxies (e.g. statement logging).

datasourceproxyarchitecture

FlexyPool supports a great variety of stand-alone connection pools:

And it collects the following metrics:

  • concurrent connections histogram
  • concurrent connection requests histogram
  • data source connection acquiring time histogram
  • connection lease time histogram
  • maximum pool size histogram
  • total connection acquiring time histogram
  • overflow pool size histogram
  • retries attempts histogram

For instance, the concurrent connection count metric gives you an insight into how many connections are required by a certain application under a given traffic load:

concurrentconnectioncount

The connection acquisition metric tells you how much time it takes to obtain a database connection from the pool:

connectionacquire

The connection lease time allows you to spot long-running transactions, which are undesirable in high-performance OLTP applications:

connectionlease

For the stand-alone connection pools, FlexyPool can increment the pool size beyond the maximum capacity, as it offers an overflow buffer. The benefit of this overflow buffer is that it allows you to increase the pool size only when the incoming traffic causes a certain connection acquisition timeout.

Although FlexyPool can also monitor Java EE connection pools, it cannot increase the pool size in Java EE environments since the DataSource is an application server managed resource.

Conclusion

Because enterprise systems evolve, so does the underlying data access patterns. For this reasons, monitoring the underlying database connection usage is a very important metric, which needs to be monitored on a regular basis. FlexyPool builds on top of CodaHale and Dropwizard Metrics, so you can easily integrate it with well-known Application Performance Monitoring tools, such as Graphite or Grafana.

FlexyPool is open-source, and it uses an Apache license 2.0. You can find it the project repository on GitHub, and all the released dependencies are available on Maven Central, so it’s very easy to integrate it in your own project.

FkexyPool is powering many enterprise systems, like Etuovi, Mitch&Mates, and ScentBird. If you decide to use it in your current enterprise system, and you are willing to provide a testimonial, you can win a free copy of my High-Performance Java Persistence book.

Why You Should (Sometimes) Avoid Expressions in SQL Predicates


I’ve recently discovered a rather significant performance issue on a productive Oracle 11g customer database. And I’m sure you have this issue too, which is why I’m documenting it here.

This is a simplified representation of the setup at the customer site:

        ID PAYMENT_DATE TEXT                               
---------- ------------ -----------------------------------
     33803 21.05.16     DcTNBOrkQIgMtbietUWOsSFNMIqGLlDw...
     29505 09.03.16     VIuPaOAQqzCMlFBYPQtvqUSbWYPDndJD...
     10738 25.07.16     TUkxGpZPrGKaHzDRxczrktkFWvGmiwjR...

Let’s produce the above table with the following statement:

CREATE TABLE payment (
  id            NOT NULL PRIMARY KEY,
  payment_date  NOT NULL,
  text
) AS
SELECT
  level,
  SYSDATE - dbms_random.value(1, 500),
  dbms_random.string('a', 500)
FROM dual
CONNECT BY level <= 50000
ORDER BY dbms_random.value;

CREATE INDEX i_payment_date ON payment(payment_date);

EXEC dbms_stats.gather_table_stats('TEST', 'PAYMENT');

What we’re doing here: There’s a payment table with an ID primary key column, an indexed payment_date column and some “text” info. The real table is, of course, much bigger than this.

Now, this table needs infrequent house keeping. In nightly batch jobs, the customer ran through the table and removed all payments that were older than some fixed amount of days, e.g.:

DELETE
FROM payment
WHERE payment_date < SYSDATE - 470 // Older than 470 days

This might look fine at first, because:

  1. We have an index on payment_date, and it can be used for the deletion
  2. SYSDATE is constant per query, so the expression SYSDATE - 470 should also be constant per query

We can prove both statements easily:

1. Index usage

Let’s run:

EXPLAIN PLAN FOR
DELETE
FROM payment
WHERE payment_date < SYSDATE - 470;

SELECT * FROM TABLE(dbms_xplan.display);

And we’re getting:

----------------------------------------------------------------------------
| Id  | Operation         | Name           | Rows  | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | DELETE STATEMENT  |                |  3008 |    10   (0)| 00:00:01 |
|   1 |  DELETE           | PAYMENT        |       |            |          |
|*  2 |   INDEX RANGE SCAN| I_PAYMENT_DATE |  3008 |    10   (0)| 00:00:01 |
----------------------------------------------------------------------------
                                                                            
Predicate Information (identified by operation id):                         
---------------------------------------------------
                                                                            
   2 - access("PAYMENT_DATE"<SYSDATE@!-470)                                 

So, we get a relatively cheap access predicate on the I_PAYMENT_DATE index, and there’s this weird SYSDATE@!-470 constant in the plan.

2. SYSDATE being constant

There’s a lot of confusion about what the actual value of these non-deterministic, non-pure date/time functions is in each database. Sometimes, the function is evaluated on each row individually and produces a new value each time. In my opinion, that’s the worst, as such functions are completely unpredictable.

Sometimes, there’s a guarantee that these timestamps stay the same for the entire duration of a transaction. That’s a bit weird as a default, but OK why not.

In Oracle, it seems that SYSDATE (and SYSTIMESTAMP) are evaluated only a single time on a per-statement level. This would explain that weird SYSDATE@! notation in the execution plan, and it can be “proven” by doing something like this:

CREATE OR REPLACE FUNCTION sleep(i NUMBER) RETURN NUMBER 
AS
BEGIN
  dbms_lock.sleep(i);
  RETURN i;
END;
/

SELECT 
  min(CAST(sysdate AS TIMESTAMP)), 
  max(CAST(sysdate AS TIMESTAMP))
FROM dual
CONNECT BY level <= 5 AND sleep(1) > 0;

DROP FUNCTION sleep;
/

The result of this statement (which unsurprisingly takes around 4 seconds to execute) is:

MIN(CAST(SYSDATEASTIMESTAMP)) MAX(CAST(SYSDATEASTIMESTAMP))
----------------------------- -----------------------------
01.11.16 11:15:20.000000000   01.11.16 11:15:20.000000000  

Some background info on this OTN thread here. Unfortunately, the SYSDATE documentation fails to clearly specify this behaviour…

So, what’s the problem?

Even if we’re probably relying on an undocumented feature, it looks like everything is fine, right? SYSDATE - 470 is more or less a constant, so we’re fine putting it in that WHERE clause.

Wrong!

Let’s run a benchmark, comparing 3 times an equivalent query running each query 100 times (and for repeatability, we use SELECT, not DELETE):

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100;
  v_range CONSTANT NUMBER := 470;
  v_date CONSTANT DATE := SYSDATE - v_range;
BEGIN
  v_ts := SYSTIMESTAMP;
  
  -- Original query with inline expression
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT *
      FROM payment
      WHERE payment_date < SYSDATE - v_range
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
  
  -- Pre-calculated PL/SQL local variable
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT *
      FROM payment
      WHERE payment_date < v_date
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
  
  -- Magical 11g scalar subquery caching
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT *
      FROM payment
      WHERE payment_date < (SELECT SYSDATE - v_range FROM dual)
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 3 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The above benchmark uses 3 techniques that all produce the same result but have different timing characteristics:

  1. is using an inline expression in the predicate
  2. is using a bind variable
  3. is using a scalar subquery

When we check out the estimated execution plans, nothing seems to indicate that any solution might be better than the other:

EXPLAIN PLAN FOR
SELECT *
FROM payment
WHERE payment_date < SYSDATE - :v_range;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR
SELECT *
FROM payment
WHERE payment_date < :v_date;

SELECT * FROM TABLE(dbms_xplan.display);

-- CAST is there because of a "bug" in the EXPLAIN PLAN parser
EXPLAIN PLAN FOR
SELECT *
FROM payment
WHERE payment_date < (
  SELECT CAST(SYSDATE - :v_range AS DATE) FROM dual
);

SELECT * FROM TABLE(dbms_xplan.display);

The output is:

--------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Rows  | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |  2500 |   453   (0)| 00:00:06 |
|   1 |  TABLE ACCESS BY INDEX ROWID| PAYMENT        |  2500 |   453   (0)| 00:00:06 |
|*  2 |   INDEX RANGE SCAN          | I_PAYMENT_DATE |   450 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------
                                                                                      
Predicate Information (identified by operation id):                                   
---------------------------------------------------                                   
                                                                                      
   2 - access("PAYMENT_DATE"<SYSDATE@!-:V_RANGE)                                      


--------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Rows  | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |  2500 |   453   (0)| 00:00:06 |
|   1 |  TABLE ACCESS BY INDEX ROWID| PAYMENT        |  2500 |   453   (0)| 00:00:06 |
|*  2 |   INDEX RANGE SCAN          | I_PAYMENT_DATE |   450 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------
                                                                                      
Predicate Information (identified by operation id):                                   

Predicate Information (identified by operation id):                                   
---------------------------------------------------                                   
                                                                                      
   2 - access("PAYMENT_DATE"<:V_DATE)                                                 


--------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Rows  | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |  2500 |   455   (0)| 00:00:06 |
|   1 |  TABLE ACCESS BY INDEX ROWID| PAYMENT        |  2500 |   453   (0)| 00:00:06 |
|*  2 |   INDEX RANGE SCAN          | I_PAYMENT_DATE |   450 |     3   (0)| 00:00:01 |
|   3 |    FAST DUAL                |                |     1 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------
                                                                                      

Predicate Information (identified by operation id):                                   
---------------------------------------------------                                   
                                                                                      
   2 - access("PAYMENT_DATE"< (SELECT CAST(SYSDATE@!-:V_RANGE AS DATE) FROM           
              "SYS"."DUAL" "DUAL"))                                                   

The only exception is that in the 3rd plan, we have this FAST DUAL operation which hints at scalar subquery caching kicking in.

Let’s compare actual results, though! (as always, I’m not publishing the real results, just qualitative numbers in a fictive unit of measurement to comply with Oracle legal blah blah)

-- Run 1
Statement 1 : 1.98 wombos
Statement 2 : 1.17 wombos
Statement 3 : 0.80 wombos

-- Run 2
Statement 1 : 1.49 blorfs
Statement 2 : 1.19 blorfs
Statement 3 : 0.78 blorfs

-- Run 3
Statement 1 : 1.46 gnarls
Statement 2 : 1.20 gnarls
Statement 3 : 0.75 gnarls

Wow! As you can see, using a bind variable certainly helps, but when you apply scalar subquery caching, THIS is where you get the most benefits. Now, this query is just a very simple example. The actual customer query was immensely more complex, and trust me – the performance improvement was 10x (I couldn’t believe it myself at first, and I still don’t know exactly why!).

Conclusion (for Oracle 11g)

Modern optimisers “recognise” a lot of expressions to be constant. For instance, in most databases, it doesn’t matter if you’re writing COUNT(1) or COUNT(*), they’re both translated to the same thing.

In this particular case, however, I was quite disappointed by the fact that there is a significant difference between these three perfectly equivalent queries as far as I’m concerned, and the least intuitive solution using scalar subquery caching performed the best on Oracle 11g.

But wait (Oracle 12c)

I’ve tested the same behaviour on my Oracle 12c on Docker instance:

Statement 1 : 1.23 xmfs
Statement 2 : 1.11 xmfs
Statement 3 : 1.14 xmfs

Statement 1 : 1.27 xlorgs
Statement 2 : 1.07 xlorgs
Statement 3 : 1.30 xlorgs

Statement 1 : 1.23 glrls (I can invent units of measurement all day!)
Statement 2 : 1.00 glrls
Statement 3 : 1.39 glrls

… where now there seems to be a “regression” in the scalar subquery solution (it’s now the same as without the scalar subquery) and the bind variable solution now seems to be the fastest.

If anything at all, this just proves that with SQL, you will always have to measure stuff on your end, because your setup and database versions may differ. The bottom line is: If performance of a single statement matters to you, chances are that you can improve things by 2-digit percentages with just some simple tricks, and in a batch job or under heavy load, this definitely matters!

Certainly, you should not use any expressions of the form SYSDATE - some_value in your predicates.

A Little Known SQL Feature: Use Logical Windowing to Aggregate Sliding Ranges


I’m frequently telling developers to put window functions almost everywhere, because they’re so awesome!

One feature that I rarely see in the wild (even if it is extremely useful for reporting) is called “logical windowing” in Oracle, and it’s most useful when used with INTERVAL ranges. Let’s see what we may want to do.

I have a payment table in my Sakila DVD rental store database. The payment table has payment dates, as can be seen here:

SELECT CAST(payment_date AS TIMESTAMP) ts
FROM payment
ORDER BY ts

output:

TS                        
---------------------------
24.05.05 22:53:30.000000000 
24.05.05 22:54:33.000000000 
24.05.05 23:03:39.000000000 
24.05.05 23:04:41.000000000 
24.05.05 23:05:21.000000000 
24.05.05 23:08:07.000000000 
24.05.05 23:11:53.000000000 
24.05.05 23:31:46.000000000 
25.05.05 00:00:40.000000000 
25.05.05 00:02:21.000000000 
25.05.05 00:09:02.000000000 
25.05.05 00:19:27.000000000 

Now, let’s assume I’m interested in these things:

  1. How many payments were there in the same hour as any given payment?
  2. How many payments were there in the same hour before any given payment?
  3. How many payments were there within one hour before any given payment?

Those are three entirely different questions. The expected solution will be this:

TS                                1   2   3
-------------------------------------------
24.05.05 22:53:30.000000000       2   1   1
24.05.05 22:54:33.000000000       2   2   2
24.05.05 23:03:39.000000000       6   1   3
24.05.05 23:04:41.000000000       6   2   4
24.05.05 23:05:21.000000000       6   3   5
24.05.05 23:08:07.000000000       6   4   6
24.05.05 23:11:53.000000000       6   5   7
24.05.05 23:31:46.000000000       6   6   8
25.05.05 00:00:40.000000000       4   1   7
25.05.05 00:02:21.000000000       4   2   8
25.05.05 00:09:02.000000000       4   3   5
25.05.05 00:19:27.000000000       4   4   5

As you can see, in column #1, we’re getting always the same number of payments for any given hour. If we were using GROUP BY trunc(payment_date, 'HH24'), it would be simple to see that this would the result we were looking for:

TS                                1
-----------------------------------
24.05.05 22:00:00                 2
24.05.05 23:00:00                 6
25.05.05 00:00:00                 4

Column #2 is a bit more sophisticated as it will also aggregatee the number of payments per hour, but only those that are before any given payment. For instance, in the hour of 24.05.05 23:00:00, we have 6 payments (see above), and those are the different payment numbers within that hour. E.g. 24.05.05 23:11:53 is the fifth payment within that hour:

TS                                2
-----------------------------------
24.05.05 23:03:39.000000000       1
24.05.05 23:04:41.000000000       2
24.05.05 23:05:21.000000000       3
24.05.05 23:08:07.000000000       4
24.05.05 23:11:53.000000000       5
24.05.05 23:31:46.000000000       6

Finally, column #3 is using a sliding interval. For each payment, it checks how many payments were there in the time range between the current payment and one hour before. Now, excuse my ASCII art:

TS                                3
-----------------------------------
24.05.05 22:53:30.000000000       1
24.05.05 22:54:33.000000000       2
24.05.05 23:03:39.000000000       3 <------\
24.05.05 23:04:41.000000000       4        |
24.05.05 23:05:21.000000000       5        |
24.05.05 23:08:07.000000000       6        |
24.05.05 23:11:53.000000000       7 <----\ |
24.05.05 23:31:46.000000000       8 <--\ | |
25.05.05 00:00:40.000000000       7    | | |
25.05.05 00:02:21.000000000       8 ---|-|-/
25.05.05 00:09:02.000000000       5 ---|-/
25.05.05 00:19:27.000000000       5 ---/

All of these are really useful for reporting, and all of these can be done rather easily with window functions. I’m using Oracle syntax here:

SELECT 
  CAST(payment_date AS TIMESTAMP) ts,

  -- Column 1: Payments in the same hour
  COUNT(*) OVER (
    PARTITION BY TRUNC(payment_date, 'HH24')
  ),

  -- Column 2: Preceding payments in the same hour
  COUNT(*) OVER (
    PARTITION BY TRUNC(payment_date, 'HH24') 
    ORDER BY payment_date
  ),

  -- Column 3: Preceding payments within one hour
  COUNT(*) OVER (
    ORDER BY payment_date 
    RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
  )
FROM payment
ORDER BY ts

Remember how column #1 is essentially the same thing as running a GROUP BY operation? That’s simple with window functions too. Just use a single PARTITION BY clause, and you get that grouping by the hour:

COUNT(*) OVER (
  PARTITION BY TRUNC(payment_date, 'HH24')
)

Column #2 is more interesting. Because we’re ordering the window, and because we’re not putting any explicit frame clause (ROWS or RANGE), by default, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is implied. The following two are the same:

COUNT(*) OVER (
  PARTITION BY TRUNC(payment_date, 'HH24') 
  ORDER BY payment_date
)

COUNT(*) OVER (
  PARTITION BY TRUNC(payment_date, 'HH24') 
  ORDER BY payment_date
  RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)

In other words, we group (PARTITION BY) all payments by the hour, order each group by the actual date, and limit / frame the window to the payments that preceed the current payment.

Column #3 is the most interesting and the least known, however. Unfortunately, it is not yet supported in all databases – e.g. PostgreSQL doesn’t have it), even if it’s part of the SQL standard.

COUNT(*) OVER (
  ORDER BY payment_date 
  RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
)

We’re no longer using the PARITION BY clause, because we don’t want to group payments into hours. Instead, we want to look behind one hour from each individual payment to see how many payments there were in that hour. But hours now don’t start / end at 00:00 minutes/seconds, they end in a given payment’s payment_date column (where Oracle DATE is really a timestamp).

This is a really nice feature that is available with DATE or TIMESTAMP columns only, when you pass an INTERVAL data type to the frame clause. I bet, however, that most of you have already had 1-2 reports where this might have been useful to know!

Want to learn more?

Don’t Even use COUNT(*) For Primary Key Existence Checks


In a recent blog post, I’ve advocated against the use of COUNT(*) in SQL, when a simple EXISTS() would suffice. This is important stuff. I keep tuning productive queries where a customer runs a COUNT(*) query like so:

SELECT count(*)
INTO v_any_wahlbergs
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'

… where after they discard the exact count to only check for existence:

IF v_any_wahlbergs = 0 THEN
  something();
ELSE
  something_else();
END IF;

It doesn’t matter if the client logic is written in PL/SQL (as above), or in any other language like Java, the overhead is significant compared to the following, much simpler EXISTS() query:

SELECT CASE WHEN EXISTS (
  SELECT * FROM actor a
  JOIN film_actor fa USING (actor_id)
  WHERE a.last_name = 'WAHLBERG'
) THEN 1 ELSE 0 END
INTO v_any_wahlbergs
FROM dual

Clearly, you can see the effect is the same, and the database can optimise the existence check by taking a short cut once it has found at least one result (instead of going through the entire result set to get the exact count, which wasn’t needed in the first place).

This is true also for primary key checks

The above is obvious. But then, during a recent execution of my SQL Masterclass training, one of the delegates asked me a very interesting question.

Is this also true for primary key checks?

Now, I personally always prefer the EXISTS() syntax, because it clearly communicates that I’m after an existence check, not an actual count query. But in principle, the following two queries are exactly the same:

-- Can be 0 or 1
SELECT count(*) FROM film WHERE film_id = 1;

-- Can also be 0 or 1
SELECT CASE WHEN EXISTS (
  SELECT * FROM film WHERE film_id = 1
) THEN 1 ELSE 0 END
FROM dual;

I always say: Never guess, always measure.

Comparing techniques in Oracle

Here’s a benchmark in Oracle 11g XE running against the Sakila database:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*) FROM film WHERE film_id = 1
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM film WHERE film_id = 1
      ) THEN 1 ELSE 0 END
      FROM dual
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The result is (only qualitative numbers, not actual seconds because benchmark results comparing Oracle with other DBs must not be published in Oracle😦 )

Statement 1 : 0.0000012 slurbs
Statement 2 : 0.0000011 slurbs

As you can see, the EXISTS() query still slightly outperforms the COUNT(*) query in Oracle.

Comparing techniques in PostgreSQL

Let’s repeat the same in PostgreSQL 9.5:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*) FROM film WHERE film_id = 1
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT EXISTS (
        SELECT * FROM film WHERE film_id = 1
      )
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

This time, with actual seconds because yay, Open Source can be benchmarked freely:

INFO:  Statement 1: 00:00:13.32556
INFO:  Statement 2: 00:00:08.350491

Oh wow! That’s a massive improvement in PostgreSQL! OK, we know that COUNT(*) is slow in PostgreSQL. Here’s tons of excuses why.

Comparing techniques in SQL Server

Let’s repeat on SQL Server 2014. And please, observe the beautiful procedural language called T-SQL, which doesn’t even support implicit cursor loops as Oracle and PostgreSQL:

USE sakila
DECLARE @ts DATETIME;
DECLARE @repeat INT = 1000000;
DECLARE @i INT;
DECLARE @dummy VARCHAR;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

SET @s1 = CURSOR FOR 
SELECT count(*) FROM film WHERE film_id = 1;

SET @s2 = CURSOR FOR 
SELECT CASE WHEN EXISTS (
  SELECT * FROM film WHERE film_id = 1
) THEN 1 ELSE 0 END;


SET @ts = current_timestamp;
SET @i = 0;
WHILE @i < @repeat
BEGIN
  SET @i = @i + 1

  OPEN @s1;
  FETCH NEXT FROM @s1 INTO @dummy;
  WHILE @@FETCH_STATUS = 0
  BEGIN
    FETCH NEXT FROM @s1 INTO @dummy;
  END;

  CLOSE @s1;
END;

DEALLOCATE @s1;
PRINT 'Statement 1: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

SET @ts = current_timestamp;
SET @i = 0;
WHILE @i < @repeat
BEGIN
  SET @i = @i + 1

  OPEN @s2;
  FETCH NEXT FROM @s2 INTO @dummy;
  WHILE @@FETCH_STATUS = 0
  BEGIN
    FETCH NEXT FROM @s2 INTO @dummy;
  END;

  CLOSE @s2;
END;

DEALLOCATE @s2;
PRINT 'Statement 2: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

Again, I must not publish actual results because benchmarks results must not be published in SQL Server either, so let’s anonymise by introducing another new unit of measurement:

Statement 1: OVER 9000
Statement 2: OVER 8999

Oh excellent! No measurable difference in SQL Server, this time!

Conclusion

While the difference between COUNT(*) and EXISTS() queries is drastic for ordinary queries that may possibly return thousands of rows (and thus counts of > 1000), the difference for primary key checks is:

  • Marginal but still worth improving in Oracle
  • Significant in PostgreSQL
  • Non existent in SQL Server

So, there’s nothing wrong with consistently applying EXISTS() in all of your queries.

If you’re using jOOQ, getting it right is even easier. Just run:

boolean exists = ctx.fetchExists(
  select()
  .from(ACTOR)
  .join(FILM_ACTOR).using(ACTOR.ACTOR_ID)
  .where(ACTOR.LAST_NAME.eq("WAHLBERG"))
);

And jOOQ will wrap your query in that EXISTS() block for you.