PostgreSQL 9.3 Released!

A great database has just gotten better. PostgreSQL 9.3 has been released today. While improved reliability and availability is certainly quite a thrilling addition, from the jOOQ perspective, the most interesting features are new SQL syntax elements. These include:

  • Better support for JSON
  • LATERAL JOIN (or as SQL:1999 calls it: lateral derived table)
  • Materialised views
  • Updatable views

See all the improvements for yourself:
https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3

The Myth About Slow SQL JOIN Operations

In my recent SQL work for a large Swiss bank, I have maintained nested database view monsters whose unnested SQL code amounted up to 5k lines of code, joining the same table over and over again in separate subselects combined via UNION operations. This monster performed in way under 50ms, no matter how we queried it (see “10 more common mistakes” about the speed of queries). Of course, this performance was only achieved after lots of fine-tuning, load-testing and benchmarking. But it worked. Our Oracle database never let us down on these things.

Nonetheless, many SQL users think that JOIN operations are slow. Why? Perhaps because they are / used to be, in MySQL? I’m currently reading this interesting book by Markus Winand. The book is called SQL Performance Explained. He’s also the author of Use-The-Index-Luke.com where you can get free insight into his book. I still recommend reading the whole book, though. Even SQL old-timers and SQL nerds like me will find 1-2 novel, very interesting approaches, some of which will be incorporated into jOOQ very soon!

In particular, consider this page which explains very well how Hash JOIN operations work:
http://use-the-index-luke.com/sql/join/hash-join-partial-objects

MySQL Bad Idea #573

This is MySQL’s Bad Idea #573 (after #384, which I’ve blogged about before) I’ve just had a terrible experience with a bug report from the jOOQ User Group, related to escaping of backslashes in string literals in MySQL. First, I thought to myself, whatever. SQL doesn’t escape backslashes. The only escape character within a string literal according to the early SQL standards is the quote as in quote quote. Citing from SQL-1992 (slightly simplified):

<character string literal> ::=
    <quote> [ <character representation>... ] <quote>

<character representation> ::=
    <nonquote character>
  | <quote symbol><nonquote character> ::= !! See the Syntax Rules.
<quote symbol> ::= <quote><quote>

Alright? Crystal clear. There’s no escaping other than <quote><quote>

The only time when you will need to be able to escape something is with the LIKE predicate, in case you want to escape % and _ symbols. You can then use the ESCAPE clause:

<like predicate> ::=
    <match value> [ NOT ] LIKE <pattern>
    [ ESCAPE <escape character> ]

I have now learned, that MySQL (and of course MariaDB) unlike any other database also support quoting with backslashes, similar to Java and other languages. That’s not a problem per se, although from a cross-vendor compatibility perspective, it’s quite nasty. But then, I’ve discovered there is a flag called NO_BACKSLASH_ESCAPES:

This just reminds me of PHP’s horrible magic quotes. In fact, combine arbitrary configurations of PHP and MySQL on your server and good luck to you of ever getting string literals right. Sigh.

10 Common Mistakes Java Developers Make when Writing SQL

This article is part of a series. You might also like:

Java developers mix object-oriented thinking with imperative thinking, depending on their levels of:

  • Skill (anyone can code imperatively)
  • Dogma (some use the “Pattern-Pattern”, i.e. the pattern of applying patterns everywhere and giving them names)
  • Mood (true OO is more clumsy to write than imperative code. At first)

But when Java developers write SQL, everything changes. SQL is a declarative language that has nothing to do with either object-oriented or imperative thinking. It is very easy to express a query in SQL. It is not so easy to express it optimally or correctly. Not only do developers need to re-think their programming paradigm, they also need to think in terms of set theory.

Here are common mistakes that a Java developer makes when writing SQL through JDBC or jOOQ (in no particular order). For 10 More Common Mistakes, see this article here.

jOOQ is the best way to write SQL in Java

Here are common mistakes that a Java developer makes when writing SQL (in no particular order):

1. Forgetting about NULL

Misunderstanding NULL is probably the biggest mistake a Java developer can make when writing SQL. This is also (but not exclusively) due to the fact that NULL is also called UNKNOWN. If it were only called UNKNOWN, it would be easier to understand. Another reason is that JDBC maps SQL NULL to Java null when fetching data or when binding variables. This may lead to thinking that NULL = NULL (SQL) would behave the same way as null == null (Java)

One of the crazier examples of misunderstanding NULL is when NULL predicates are used with row value expressions.

Another, subtle problem appears when misunderstanding the meaning of NULL in NOT IN anti-joins.

The Cure:

Train yourself. There’s nothing but explicitly thinking about NULL, every time you write SQL:

  • Is this predicate correct with respect to NULL?
  • Does NULL affect the result of this function?

2. Processing data in Java memory

Few Java developers know SQL very well. The occasional JOIN, the odd UNION, fine. But window functions? Grouping sets? A lot of Java developers load SQL data into memory, transform the data into some appropriate collection type, execute nasty maths on that collection with verbose loop structures (at least, before Java 8’s Collection improvements).

But some SQL databases support advanced (and SQL standard!) OLAP features that tend to perform a lot better and are much easier to write. A (non-standard) example is Oracle’s awesome MODEL clause. Just let the database do the processing and fetch only the results into Java memory. Because after all some very smart guys have optimised these expensive products. So in fact, by moving OLAP to the database, you gain two things:

  • Simplicity. It’s probably easier to write correctly in SQL than in Java
  • Performance. The database will probably be faster than your algorithm. And more importantly, you don’t have to transmit millions of records over the wire.

The Cure:

Every time you implement a data-centric algorithm in Java, ask yourself: Is there a way to let the database perform that work for me?

3. Using UNION instead of UNION ALL

It’s a shame that UNION ALL needs an extra keyword compared to UNION. It would be much better if the SQL standard had been defined to support:

  • UNION (allowing duplicates)
  • UNION DISTINCT (removing duplicates)

Not only is the removal of duplicates rarely needed (or sometimes even wrong), it is also quite slow for large result sets with many columns, as the two subselects need to be ordered, and each tuple needs to be compared with its subsequent tuple.

Note that even if the SQL standard specifies INTERSECT ALL and EXCEPT ALL, hardly any database implements these less useful set operations.

The Cure:

Every time you write a UNION, think if you actually wanted to write UNION ALL.

4. Using JDBC Pagination to paginate large results

Most databases support some way of paginating ordered results through LIMIT .. OFFSET, TOP .. START AT, OFFSET .. FETCH clauses. In the absence of support for these clauses, there is still the possibility for ROWNUM (Oracle) or ROW_NUMBER() OVER() filtering (DB2, SQL Server 2008 and less), which is much faster than pagination in memory. This is specifically true for large offsets!

The Cure:

Just use those clauses, or a tool (such as jOOQ) that can simulate those clauses for you.

5. Joining data in Java memory

From early days of SQL, some developers still have an uneasy feeling when expressing JOINs in their SQL. There is an inherent fear of JOIN being slow. This can be true if a cost-based optimiser chooses to perform a nested loop, possibly loading complete tables into database memory, before creating a joined table source. But that happens rarely. With appropriate predicates, constraints and indexes, MERGE JOIN and HASH JOIN operations are extremely fast. It’s all about the correct metadata (I cannot cite Tom Kyte often enough for this). Nonetheless, there are probably still quite a few Java developers who will load two tables from separate queries into maps and join them in Java memory in one way or another.

The Cure:

If you’re selecting from various tables in various steps, think again to see if you cannot express your query in a single statement.

6. Using DISTINCT or UNION to remove duplicates from an accidental cartesian product

With heavy joining, one can lose track of all the relations that are playing a role in a SQL statement. Specifically, if multi-column foreign key relationships are involved, it is possible to forget to add the relevant predicates in JOIN .. ON clauses. This might result in duplicate records, but maybe only in exceptional cases. Some developers may then choose to use DISTINCT to remove those duplicates again. This is wrong in three ways:

  • It (may) solve the symptoms but not the problem. It may as well not solve the symptoms in edge-cases.
  • It is slow for large result sets with many columns. DISTINCT performs an ORDER BY operation to remove duplicates.
  • It is slow for large cartesian products, which will still load lots of data into memory

The Cure:

As a rule of thumb, when you get unwanted duplicates, always review your JOIN predicates. There’s probably a subtle cartesian product in there somewhere.

7. Not using the MERGE statement

This isn’t really a mistake, but probably some lack of knowledge or some fear towards the powerful MERGE statement. Some databases know other forms of UPSERT statements, e.g. MySQL’s ON DUPLICATE KEY UPDATE clause. But MERGE is really so powerful, most importantly in databases that heavily extend the SQL standard, such as SQL Server.

The Cure:

If you’re UPSERTING by chaining INSERT and UPDATE or by chaining SELECT .. FOR UPDATE and then INSERT or UPDATE, think again. Apart from risking race conditions, you might be able to express a simpler MERGE statement.

8. Using aggregate functions instead of window functions

Before the introduction of window functions, the only means to aggregate data in SQL was by using a GROUP BY clause along with aggregate functions in the projection. This works well in many cases, and if aggregation data needed to be enriched with regular data, the grouped query can be pushed down into a joined subquery.

But SQL:2003 defined window functions, which are implemented by many popular database vendors. Window functions can aggregate data on result sets that are not grouped. In fact, each window function supports its own, independent PARTITION BY clause, which is an awesome tool for reporting.

Using window functions will:

  • Lead to more readable SQL (less dedicated GROUP BY clauses in subqueries)
  • Improve performance, as a RDBMS is likely to optimise window functions more easily

The Cure:

When you write a GROUP BY clause in a subquery, think again if this cannot be done with a window function.

9. Using in-memory sorting for sort indirections

The SQL ORDER BY clause supports many types of expressions, including CASE statements, which can be very useful for sort indirections. You should probably never sort data in Java memory because you think that

  • SQL sorting is too slow
  • SQL sorting cannot do it

The Cure:

If you sort any SQL data in memory, think again if you cannot push sorting into your database. This goes along well with pushing pagination into the database.

10. Inserting lots of records one by one

JDBC knows batching, and you should use it. Do not INSERT thousands of records one by one, re-creating a new PreparedStatement every time. If all of your records go to the same table, create a batch INSERT statement with a single SQL statement and multiple bind value sets. Depending on your database and database configuration, you may need to commit after a certain amount of inserted records, in order to keep the UNDO log slim.

The Cure:

Always batch-insert large sets of data.

Some interesting books

Some very interesting books on similar topics are

jOOQ is the best way to write SQL in Java

Liked this article?

You might also like these follow-up articles

10 Things in SQL Server Which Don’t Work as Expected

So far, I have been blogging about curious RDBMS caveats mostly related to Oracle and MySQL databases. Some examples:

But there are also other databases, which have 1-2 things that do not work as expected. For example, SQL Server. Here’s an interesting blog post, showing 10 things in SQL Server which don’t work as expected:
http://tech.pro/tutorial/1419/10-things-in-sql-server-which-don-t-work-as-expected

Pinterest and SQL vs. NoSQL

I’ve recently discovered a very interesting read about Pinterest‘s architecture experimentation. One of the key messages is the fact that SQL and NoSQL data storage systems can coexist with each of them having their place. Here’s the full article:

http://highscalability.com/blog/2013/4/15/scaling-pinterest-from-0-to-10s-of-billions-of-page-views-a.html

This reminds me of a previous article about Instagram successfully displaying how they implemented large-scale sharding on Postgres:

https://blog.jooq.org/2011/12/10/a-success-story-of-sql-scaling-horizontally/

Easy Mocking of Your Database

Test-driven development is something wonderful! Once you’ve established it in your organisation, you will start to:

  • Greatly improve your quality (things break less often)
  • Greatly improve your processes (things can be changed more easily)
  • Greatly improve your developer atmosphere (things are more fun to do)

The importance of doing the right test-driven development is to find a good ratio of what kind of code is to be covered…

  • by automated unit tests
  • by automated integration tests
  • by manual “smoke tests”
  • by manual “acceptance tests”
  • not at all

Finding that ratio can be grounds for heated, religious discussions. I will soon blog about my own opinion on that subject. In this post, however, we will focus on the first kind of test: unit tests.

Unit testing your data access

When databases are involved, people will probably quickly jump to writing integration tests, because all they have to do is create a little Derby, H2 or HSQLDB (or other) test database, and run a couple of data-setup queries prior to the actual test. Their code module will then hopefully not notice the difference to a productive environment, and the whole system can be tested as a blackbox. The advantage of this is that your tests can be written in a way to verify your business requirements, your user stories, or whatever you call them. So far, the theory.

When these database integration tests pile up, it starts to become increasingly difficult to shield them off one another. Avoiding inter-dependencies and at the same time, avoiding costly database setups is hard. You won’t be able to run the whole test-suite immediately after building / committing. You need nightly builds, weekly builds. But unit testing the data access layer isn’t that much easier! Because JDBC is an awful API to mock. There are so many different ways of configuring and executing queries through this highly stateful API, your unit tests quickly become unmanageable.

There are a few libraries that help you with database testing. Just to name a few:

  • MockRunner: This one has some JDBC-specific extensions that allow for simulating JDBC ResultSets, as well as for checking whether actual queries are executed
  • jMock: An “ordinary” Java mocking library
  • mockito: An “ordinary” Java mocking library
  • DBUnit: This one doesn’t mock your database, it’s good for testing your database. Another use-case, but still worth mentioning here

Some of the above libraries will not get you around the fact that JDBC is an awkward API to mock, specifically if you need to support several (incompatible!) versions of JDBC at the same time. Some examples can be seen here:

Mocking the database with jOOQ

When you’re using jOOQ in your application, mocking your database just became really easy in jOOQ 3.0. jOOQ now also ships with a Mock JDBC Connection. Unlike with other frameworks, however, you only have to implement a single functional interface with jOOQ, and provide that implementation to your MockConnection: The MockDataProvider. Here’s a simple implementation example:

MockDataProvider provider = new MockDataProvider() {

    // Your contract is to return execution results, given a context
    // object, which contains SQL statement(s), bind values, and some
    // other context values
    @Override
    public MockResult[] execute(MockExecuteContext context) 
    throws SQLException {

        // Use ordinary jOOQ API to create an org.jooq.Result object.
        // You can also use ordinary jOOQ API to load CSV files or
        // other formats, here!
        DSLContext create = DSL.using(configuration);
        Result<MyTableRecord> result = create.newResult(MY_TABLE);
        result.add(create.newRecord(MY_TABLE));

        // Now, return 1-many results, depending on whether this is
        // a batch/multi-result context
        return new MockResult[] {
            new MockResult(1, result)
        };
    }
};

// Put your provider into a MockConnection and use that connection
// in your application. In this case, with a jOOQ DSLContext:
Connection connection = new MockConnection(provider);
DSLContext create = DSL.using(connection, dialect);

// Done! just use regular jOOQ API. It will return the values
// that you've specified in your MockDataProvider
assertEquals(1, create.selectOne().fetch().size());

The above implementation acts as a callback for JDBC’s various executeXXX() methods. Through a very simple MockExecuteContext API, you can thus:

  • Get access to the executed SQL and bind values (Use general jOOQ API to inline bind values into the SQL statement)
  • Distinguish between regular SQL statements and both single-statement/multi-bind-value and multi-statement/no-bind-value batch executions
  • Return one or several results using jOOQ’s org.jooq.Result objects (which you can easily import from CSV, XML, JSON, TEXT formats)
  • Return “generated keys” results through the same API
  • Let jOOQ’s MockStatement take care of the serialisation of your mock data through the JDBC API

There is also an experimental implementation of a MockFileDatabase, a text-based mock database that uses the following format:

# This is a sample test database for MockFileDatabase
# Its syntax is inspired from H2's test script files

# When this query is executed...
select 'A' from dual;
# ... then, return the following result
> A
> -
> A
@ rows: 1

# Just list all possible query / result combinations
select 'A', 'B' from dual;
> A B
> - -
> A B
@ rows: 1

select "TABLE1"."ID1", "TABLE1"."NAME1" from "TABLE1";
> ID1 NAME1
> --- -----
> 1   X
> 2   Y
@ rows: 2

MockFileDatabase implements MockDataProvider, so it’s dead-simple to provide your unit tests with sample data. Future versions of jOOQ will allow for:

  • Regex pattern-matching SQL statements to provide mock results
  • Load these results from other formats, such as jOOQ’s supported export formats
  • Specify the behaviour of batch statements, multi-result statements, etc.

Using jOOQ’s MockConnection in other contexts

Things don’t stop here. As jOOQ’s MockConnection is the entry point for this mocking sub-API of jOOQ, you can also use it in other environments, such as when running JPA queries, Hibernate queries, iBatis or just your plain old legacy JDBC queries.

jOOQ has just become your preferred JDBC mock framework! ;-)

You never stop learning about Oracle features

Oracle‘s name is no coincidence. It is truly an oracle, telling you mystical, secret things about your database. It may take great expertise and skill to optimally extract that knowledge from this monster. While it implements vast parts of the SQL:2008 standard, it ships with a lot of features no other database has, but which are likely to make it into the SQL:2011 and subsequent standards, eventually.

One particular syntax clause is the PARTITION BY clause. It is very similar to the GROUP BY clause, but it can appear in various other contexts. The most trivial one is that of window functions / analytical functions. The most advanced one is that of the MODEL clause (which I shall blog about soon). But one of the most hidden ones is that of the OUTER JOIN clause. With Oracle, you can issue a “partitioned outer join”.

The formal definition of an Oracle OUTER JOIN clause can be seen here:

 

The query_partition_clause is a regular “PARTITION BY expr” clause. Now, this useful addition that can be put to the left or to the right of the actual OUTER JOIN clause, allowing to specify a means of how to partition either side in a way that there will be at least one (possibly empty) record for every partition, in case the OUTER JOIN does not produce any records for such a partition. Whew.

In more intelligible English: This clause helps filling your result set with blank records. There!

So, whenever you feel you need to fill in some blank rows with Oracle, remember that you might just have found yourself a use-case for the partitioned outer join! (This clause will be supported by jOOQ 2.5.0)

SLICK, integrating SQL into Scala

Now it’s official – even if version numbers are still preceded by a “zero” major release: SLICK has been publicly announced by Typesafe:

SLICK stands for Scala Language-Integrated Connection Kit, which is more or less the Scala equivalent for LINQ-to-SQL. Note that I say LINQ-to-SQL, not LINQ in general, as Scala already has sufficient means of querying collections using the Scala language itself.

Here’s a sample of what SLICK code will look like (taken from the SLICK website):

object Coffees extends Table[(String, Int, Double)]("COFFEES") {
  def name = column[String]("COF_NAME", O.PrimaryKey)
  def supID = column[Int]("SUP_ID")
  def price = column[Double]("PRICE")
  def * = name ~ supID ~ price
}

Coffees.insertAll(
  ("Colombian",         101, 7.99),
  ("Colombian_Decaf",   101, 8.99),
  ("French_Roast_Decaf", 49, 9.99)
)

val q = for {
  c <- Coffees if c.supID === 101
  //                       ^ comparing Rep[Int] to Rep[Int]!
} yield (c.name, c.price)

println(q.selectStatement)

q.foreach { case (n, p) => println(n + ": " + p) }

As you can see, SLICK neatly integrates with Scala’s own syntax. As with LINQ-to-SQL, SLICK’s goal is to

“write your database queries in Scala instead of SQL”

This is quite orthogonal to what jOOQ is aiming for:

“SQL was never meant to be anything other than… SQL!”

As a reminder, see my previous blog post about how jOOQ integrates with Scala, and how you can write almost-SQL queries in Scala using jOOQ against 13 popular databases. It would be interesting to compare the two approaches side-by-side in an independent evaluation, to see the pro’s and con’s of each one, in terms of

  • Developer productivity
  • Maintainability
  • Performance
  • Feature scope
  • etc.

I think it’s time someone made that evaluation. An example can be seen here:

http://stackoverflow.com/questions/10537766/closest-equivalent-to-sqlalchemy-for-java-scala

MySQL Bad Idea #384

MySQL is a database of compromise. Compromise between running a production-ready relational database and being popular with all sorts of hackers – mostly the ones that don’t really like SQL. And because they don’t really like SQL, they choose MySQL, as MySQL is very forgiving. It is just as forgiving as their favourite language PHP, which forgives their mistakes involving escaping and quoting through funny things like “magic quotes”. Not only is MySQL forgiving, it allows you to write “wrong” SQL and still does something with it. Here’s what I mean by “wrong” SQL:

In MySQL, you can legally execute the following statement:

SELECT o.custid, c.name, MAX(o.payment)
  FROM orders AS o, customers AS c
  WHERE o.custid = c.custid
  GROUP BY o.custid;

The statement was taken from here:
http://dev.mysql.com/doc/refman/5.6/en/group-by-hidden-columns.html

So what does this statement even mean? What will be returned in the c.name projection? MAX(c.name)? ANY(c.name)? FIRST(c.name)? NULL? 42? According to the documentation, ANY(c.name) would best describe what’s going on. This peculiar syntax is probably quite clever for those few that really know when this is useful. When they know exactly, that o.custid and c.name have a 1:1 correlation, and they can speed things up a little by avoiding writing things like MAX(c.name), or by adding c.name to the GROUP BY clause (“yes, saved yet another 8 characters”).

But the bulk of newbie MySQL users will be confused by this.

  • First, they will be confused because they don’t get the c.name they’d expect.
  • Secondly, they will eventually switch over to another database that gets these things right, and be frustrated all over again, over the funny syntax errors, such as ORA-00979 not a GROUP BY expression

So please,

  • MySQL users: stop using this non-feature. It will only cause pain and suffering, even if you know how/why it works. SQL’s GROUP BY was not meant to work that way.
  • MySQL: Deprecate this non-feature.