Faster SQL Through Occasionally Choosing Natural Keys Over Surrogate Keys


There are many many opinions out there regarding the old surrogate key vs. natural key debate. Most of the times, surrogate keys (e.g. sequence generated IDs) win because they’re much easier to design:

  • They’re easy to keep consistent across a schema (e.g. every table has an ID column, and that’s always the primary key)
  • They’re thus a no-brainer to add. When you create a new table, you don’t need to worry about any candidate keys
  • They’re guaranteed to be unique, because they have absolutely no business value, only a technical value

Great. So why bother using natural keys in the first place?

Well, there is a very compelling reason!

Performance!

Whenever you introduce surrogate keys, this means that your key data becomes completely meaningless. From a design perspective, that’s not too bad. You can easily join that other table to get the interesting, meaningful information that hides behind the surrogate foreign key value. For example, in our Sakila database

… we have a typical many-to-many relationship modelled with a relationship table between the FILM table and the CATEGORY table – state-of-the-art normalisation. But check out this interesting thing:

  • The FILM_CATEGORY relationship table doesn’t contain any interesting information at all. Just the relationships
  • The category table only contains a single useful column: The NAME column
  • The remaining columns (CATEGORY_ID and LAST_UPDATE) have absolutely no meaning

With this in mind, we could design a much simpler schema, where we use the category name as a natural key, and in fact, we don’t even need the CATEGORY table anymore, we can now remove it (that’s optional here. To ensure data correctness, we could keep it around, containing only the NAME column as a primary key). Check this out:

Now, if we run a query like the following one against our entire Sakila schema:

SELECT c.name, count(*)
FROM film_actor fa USING (actor_id)
JOIN film_category fc USING (film_id)
JOIN category c USING (category_id)
WHERE actor_id = 1
GROUP BY c.name
ORDER BY count(*) DESC

The query finds all categories a given actor played in, and the number of films that the given actor played in each category. For instance, this could be the result:

NAME       COUNT(*)
-------------------
Horror     3
Classics   2
Family     2
New        2
Games      2
Animation  1
Sports     1
Children   1
...

With an alternative schema where the category NAME has been moved to a new FILM_CATEGORY_NATURAL table, we could run this much simpler query:

SELECT fc.name, count(*)
FROM film_actor fa 
JOIN film_category_natural fc 
  USING (film_id)
WHERE actor_id = 1
GROUP BY fc.name
ORDER BY count(*) DESC

Notice how we can omit an entire JOIN.

The execution plans (here on Oracle) are quite different. Check this out:

Before:

After:

Unfortunately, the cost difference (8 vs 5) cannot be taken as a tool to compare actual costs between the two queries/plans. But the plans are otherwise very similar, except that we’re simply missing one table access (CATEGORY and an entire JOIN). That’s a significant improvement for something this simple. Imagine the improvement if we could roll out this kind of better query throughout the system?

We could look into more execution plan measurements (especially from the actual plan results), but what if we simply benchmark the two queries using the same silly benchmark, as always, repeating each statement 100 times:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT c.name, count(*)
      FROM film_actor fa USING (actor_id)
      JOIN film_category fc USING (film_id)
      JOIN category c USING (category_id)
      WHERE actor_id = 1
      GROUP BY c.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT fc.name, count(*)
      FROM film_actor fa 
      JOIN film_category_natural fc 
        USING (film_id)
      WHERE actor_id = 1
      GROUP BY fc.name
      ORDER BY count(*) DESC
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The results are drastic:

Statement 1 : 00:00:00.122070000
Statement 2 : 00:00:00.051179000

A factor of 2.5x faster!

(Disclaimer: Benchmarks are an inaccurate way of measuring things because they suffer (or benefit) from “unfair” side-effects like heavy caching, but they’re a helpful tool to assess the order of magnitude of a difference between two queries).

The JOIN is completely unnecessary

The only reason why we joined the CATEGORY table each time is because we needed to display the meaningful business value of a category to the user, the CATEGORY.NAME. We could have avoided the JOIN if that was not a requirement, but displaying surrogate keys to the user (the CATEGORY_ID) would be rather harsh, wouldn’t it?

And we’re doing this all the time with all sorts of tables. We have:

  • Category tables (category name is a good candidate key)
  • Translation tables (label is a good candidate key)
  • Country tables (ISO 3166 country codes are a good candidate key)
  • Language tables (ISO 639 language codes are a good candidate key)
  • Currency tables (ISO 4217 currency codes are a good candidate key)
  • Stock symbol tables (ISIN security codes are a good candidate key)
  • … and many more

Working with natural keys can be quite cumbersome. But in some entities, the internationally standardised codes are really good candidate keys, and most of the time, they’re sufficient. What does the LANGUAGE_ID 47 even mean? It means nothing. An experienced DBA will remember, after a while, that it means “English”. But wouldn’t EN be a much better value?

You would have EN as a primary key AND foreign key value, so chances are, because everyone (including frontend developers who probably hard-code some translations anyway) knows what language EN is (but no one knows what 47 means), you will almost never need to join the language table again – except in those rare cases where you want to work with secondary language columns, such as, for instance, DESCRIPTION.

Now, imagine what happens if we search for English entries, or as in our previous example, for films of category “Classics”? Our entire JOIN graph would be simplified.

(Another place where we don’t need additional surrogate keys is the relationship table. In this particular case, there’s no such key anyway)

Caveats

Our category strings are quite short. If natural keys become longer, then the duplication itself can become a problem on a lower storage level, as you might need more pages and blocks to store the same amount of rows.

Please, do take this advice in this article with a grain of salt. The essence here is to not always follow strict rules that were established only as a good default. There are always tradeoffs!

Conclusion

This should be quite a straightforward refactoring for many applications. If your tables are extremely obvious picks for a natural key (like the above), then do use natural keys. Your queries will immediately be faster – not necessarily much faster, but probably you can speed up a significant number of queries, i.e. take load off your entire system. Plus: your database will be more user-friendly.

And all of this at the price of not using the identical table design everywhere. I mean – when was having an identical table design a real business case anyway, right?

Side-note

In some cases, you could take this even one step further and denormalise your schema by putting categories as arrays or XML or JSON data structures directly inside your films. You’ll lose the normalisation benefits, but you could further win in terms of performance. For more details, read this very interesting article (about PostgreSQL) here:

http://www.databasesoup.com/2015/01/tag-all-things.html

Don’t Even use COUNT(*) For Primary Key Existence Checks


In a recent blog post, I’ve advocated against the use of COUNT(*) in SQL, when a simple EXISTS() would suffice. This is important stuff. I keep tuning productive queries where a customer runs a COUNT(*) query like so:

SELECT count(*)
INTO v_any_wahlbergs
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'

… where after they discard the exact count to only check for existence:

IF v_any_wahlbergs = 0 THEN
  something();
ELSE
  something_else();
END IF;

It doesn’t matter if the client logic is written in PL/SQL (as above), or in any other language like Java, the overhead is significant compared to the following, much simpler EXISTS() query:

SELECT CASE WHEN EXISTS (
  SELECT * FROM actor a
  JOIN film_actor fa USING (actor_id)
  WHERE a.last_name = 'WAHLBERG'
) THEN 1 ELSE 0 END
INTO v_any_wahlbergs
FROM dual

Clearly, you can see the effect is the same, and the database can optimise the existence check by taking a short cut once it has found at least one result (instead of going through the entire result set to get the exact count, which wasn’t needed in the first place).

This is true also for primary key checks

The above is obvious. But then, during a recent execution of my SQL Masterclass training, one of the delegates asked me a very interesting question.

Is this also true for primary key checks?

Now, I personally always prefer the EXISTS() syntax, because it clearly communicates that I’m after an existence check, not an actual count query. But in principle, the following two queries are exactly the same:

-- Can be 0 or 1
SELECT count(*) FROM film WHERE film_id = 1;

-- Can also be 0 or 1
SELECT CASE WHEN EXISTS (
  SELECT * FROM film WHERE film_id = 1
) THEN 1 ELSE 0 END
FROM dual;

I always say: Never guess, always measure.

Comparing techniques in Oracle

Here’s a benchmark in Oracle 11g XE running against the Sakila database:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*) FROM film WHERE film_id = 1
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM film WHERE film_id = 1
      ) THEN 1 ELSE 0 END
      FROM dual
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

The result is (only qualitative numbers, not actual seconds because benchmark results comparing Oracle with other DBs must not be published in Oracle 😦 )

Statement 1 : 0.0000012 slurbs
Statement 2 : 0.0000011 slurbs

As you can see, the EXISTS() query still slightly outperforms the COUNT(*) query in Oracle.

Comparing techniques in PostgreSQL

Let’s repeat the same in PostgreSQL 9.5:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*) FROM film WHERE film_id = 1
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT EXISTS (
        SELECT * FROM film WHERE film_id = 1
      )
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

This time, with actual seconds because yay, Open Source can be benchmarked freely:

INFO:  Statement 1: 00:00:13.32556
INFO:  Statement 2: 00:00:08.350491

Oh wow! That’s a massive improvement in PostgreSQL! OK, we know that COUNT(*) is slow in PostgreSQL. Here’s tons of excuses why.

Comparing techniques in SQL Server

Let’s repeat on SQL Server 2014. And please, observe the beautiful procedural language called T-SQL, which doesn’t even support implicit cursor loops as Oracle and PostgreSQL:

USE sakila
DECLARE @ts DATETIME;
DECLARE @repeat INT = 1000000;
DECLARE @i INT;
DECLARE @dummy VARCHAR;

DECLARE @s1 CURSOR;
DECLARE @s2 CURSOR;

SET @s1 = CURSOR FOR 
SELECT count(*) FROM film WHERE film_id = 1;

SET @s2 = CURSOR FOR 
SELECT CASE WHEN EXISTS (
  SELECT * FROM film WHERE film_id = 1
) THEN 1 ELSE 0 END;


SET @ts = current_timestamp;
SET @i = 0;
WHILE @i < @repeat
BEGIN
  SET @i = @i + 1

  OPEN @s1;
  FETCH NEXT FROM @s1 INTO @dummy;
  WHILE @@FETCH_STATUS = 0
  BEGIN
    FETCH NEXT FROM @s1 INTO @dummy;
  END;

  CLOSE @s1;
END;

DEALLOCATE @s1;
PRINT 'Statement 1: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

SET @ts = current_timestamp;
SET @i = 0;
WHILE @i < @repeat
BEGIN
  SET @i = @i + 1

  OPEN @s2;
  FETCH NEXT FROM @s2 INTO @dummy;
  WHILE @@FETCH_STATUS = 0
  BEGIN
    FETCH NEXT FROM @s2 INTO @dummy;
  END;

  CLOSE @s2;
END;

DEALLOCATE @s2;
PRINT 'Statement 2: ' + CAST(DATEDIFF(ms, @ts, current_timestamp) AS VARCHAR) + 'ms';

Again, I must not publish actual results because benchmarks results must not be published in SQL Server either, so let’s anonymise by introducing another new unit of measurement:

Statement 1: OVER 9000
Statement 2: OVER 8999

Oh excellent! No measurable difference in SQL Server, this time!

Conclusion

While the difference between COUNT(*) and EXISTS() queries is drastic for ordinary queries that may possibly return thousands of rows (and thus counts of > 1000), the difference for primary key checks is:

  • Marginal but still worth improving in Oracle
  • Significant in PostgreSQL
  • Non existent in SQL Server

So, there’s nothing wrong with consistently applying EXISTS() in all of your queries.

If you’re using jOOQ, getting it right is even easier. Just run:

boolean exists = ctx.fetchExists(
  select()
  .from(ACTOR)
  .join(FILM_ACTOR).using(ACTOR.ACTOR_ID)
  .where(ACTOR.LAST_NAME.eq("WAHLBERG"))
);

And jOOQ will wrap your query in that EXISTS() block for you.

jOOQ Tuesdays: Chris Saxon Explains the 3 Things Every Developer Should Know About SQL


Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month (today, exceptionally on a Wednesday because of technical issues) where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

chris-saxon-headshot[1]

I’m very excited to feature today Chris Saxon who has worked with Oracle forever, and who is one of the brains behind the famous Ask Tom website.

Chris, you’re part of the famous Ask Tom team. Everyone working with Oracle has wound up on Ask Tom’s website at least once. You’ve answered an incredible amount of questions already. What’s it like to work for such a big community as Oracle’s?

It’s an amazing experience! My first real job was as a PL/SQL developer. My only knowledge of SQL was a couple of vaguely remembered lectures at university. Ask Tom was the main place I learned about SQL and Oracle Database. So it’s a huge honor to be on the other side, helping others get the best out of the technology.

The best part has to be the positive comments when you help someone solve a problem that’s been troubling them for days. That’s why we’re here. To help developers learn more about Oracle and improve their SQL skills. When you use the database to its full extent, you can write better, faster applications with less code!

What were your three most interesting questions, so far?

Any question that has a clear definition and a complete test case is interesting! 😉 Personally I enjoy using SQL to solve complex problems the best. So the first two do just that:

1. Finding the previous row in a different group

The poster had a series of transactions. These were grouped into two types. For each row, they wanted to show the id of the previous transaction from the other group.

At first this sounds like a problem you can solve using LAG or LEAD. But these only search for values within the same group. So you need a different method.

I provided a solution using the model clause. Using this, you can generate columns based on complex, spreadsheet-like formulas. Rows in your table are effectively cells in the sheet. You identify them by defining dimensions which can be other columns or expressions. By setting the transaction type as a dimension, you can then easily reference – and assign – values from one type to the other.

This worked well. But commenters were quick to provide solutions using window functions and 12c’s match_recognize clause. Both of which were faster than my approach!

I like this because it shows the flexibility of SQL. And it shows the value of an engaged community. No one knows everything. By sharing our knowledge and workin together we can all become better developers.

2. Improving SQL that deliberately generates cartesian product

The poster had a set of abbreviations for words. For example, Saint could also be “St.” or “St”. They wanted to take text containing these words. Then generate all combinations of strings using these abbreviations.

The “obvious” solution the user had is to split the text into words. Then for each word, join the abbreviation table, replacing the string as needed. So for a five word string, you have five joins.

There are a couple of problems with this method. The number of joins limits the number of words. So if you have a string with seven words, but only six table joins you won’t abbreviate the final word.

The other issue is performance. Joining the same table N times increases the work you do. If you have long sentences and/or a large number of abbreviations, the query could take a long time to run.

To overcome these you need to ask: “how can I join to the abbreviation table just once?”

The solution to do this starts the same as the original. Split the sentence into a table of words. Then join this to the abbreviations to give a row for each replacement needed.

You can then recursively walk down these rows using CTEs. These build up the sentence again, replacing words with their abbreviations as needed. A scalable solution that only needs a single pass of each table!

The final question relates to performance. Tom Kyte’s mantra was always “if you can do it in SQL, do it in SQL”. The reason is because a pure SQL solution is normally faster than one which combines SQL and other code. Yet a question came in that cast doubt on this:

3. Difference in performance SQL vs PL/SQL

The poster was updating a table. The new values came from another table. He was surprised that PL/SQL using bulk processing came out faster than the pure SQL approach.

The query in question was in the form:

update table1
set col1 = (select col2 from table2 where t1.code = t2.code);

It turned out the reason was due to “missing” indexes. Oracle executes the subquery once for every row in table1. Unless there’s an index on table2 (code), this will full scan table2 once for every row in table1!

The PL/SQL only approach avoided this problem by reading the whole of table2 into an array. So there was only one full scan of table2.
 

The problem here is there was no index on the join condition (t1.code = t2.code). With this in place Oracle does an index lookup of table2 for each row in table1. A massive performance improvement!
 

The moral being if your SQL is “slow”, particularly in compared to a combined SQL + other language method, it’s likely you have a missing index (or two).

This question again showed the strength and value of the Oracle community. Shortly after I posted the explanation, a reviewer was quick to point out the following SQL solution:

merge into table1
using  table2
on   (t1.code = t2.code)
when matched
  then update set t1.col = t2.col;

This came out significantly faster than both the original update and PL/SQL – without needing any extra indexes!

You’re running a YouTube channel called “The Magic of SQL”. Are SQL developers magicians?

Of course they are! In fact, I’d say that all developers are magicians. As Arthur C Clarke said:

“Any sufficiently advanced technology is indistinguishable from magic”

The amount of computing power you carry around in your phone today is mind blowing. Just ask your grandparents!

I think SQL developers have a special kind of magic though :). The ability to answer hard questions with a few lines of SQL is amazing. And for it to adapt to changes in the underlying data to give great performance without you changing it is astounding.

Your Twitter account has a pinned tweet about window functions. I frequently talk to Java developers at conferences, and few of them know about window functions, even if they’ve been in databases like Oracle for a very long time. Why do you think they’re still so “obscure”?

Oracle Database has had window functions has had them since the nineties. But many other RDBMSes have only fully supported them recently. So a combination of writing “database independent” code and people using other databases is certainly a factor.

Use of tools which hide SQL from developers is also a problem. If you’re not actively using SQL, it’s easy to overlook many of its features.

Fortunately I think this is changing. As more and more developers are realizing, SQL is a powerful language. Learning how to use it effectively is a key skill for all developers. Window functions and other SQL features mean you can get write better performing applications with less code. Who doesn’t want that? 😉

What are three things that every developer should know about SQL?

1. Understand set based processing

If you find yourself writing a cursor loop (select … from … loop), and inside that loop you run more SQL, you’re doing it wrong.

Think about it. Do you eat your cornflakes by placing one flake in your bowl, adding the milk, and eating that one piece? Then doing the same for the next. And the next. And so on? Or do you pour yourself a big bowl and eat all the flakes at once?

If you have a cursor loop with more SQL within the loop, you’re effectively doing this. There’s a lot of overhead in executing each SQL statement. This will slow you down if you have a large number of statements that each process one row. Instead you want few statements that process lots of rows where possible.

It’s also possible to do this by accident. As developers we’re taught that code reuse is A Good Thing. So if there’s an API available we’ll often use it. For example, say you’re building a batch process. This finds the unshipped orders, places them on shipments and marks them as sent.

If a ship_order function exists, you could write something like:

select order_id from unshipped_orders loop
  ship_order ( order_id );
end loop;

The problem here is ship_order almost certainly contains SQL. SQL you’ll be executing once for every order awaiting postage. If it’s only a few this may be fine. But if there’s hundreds or thousands this process could take a long time to run.

The way to make this faster is to process all the orders in one go. You can do this with SQL like:

insert into shipments
  select … from unshipped_orders;

update unshipped_orders
set  shipment_date = sysdate;

You may counter there’s other, non-SQL, processing you need to do such as sending emails. So you still need a query to find the order ids.

But you can overcome this! With update’s returning clause, you can get values from all the changed rows:

update unshipped_orders
set  shipment_date = sysdate
returning order_id bulk collect into order_array;

This gives you all the order ids to use as you need.

2. Learn what an execution plan is and how to create and read one

“How can I make my SQL faster” is one of the most common classes of questions posted on Ask Tom. The trouble is there’s scant one-size-fits-all advice when it comes to SQL performance. To help we need to know what your query is, what the tables and indexes are and details about the data. But most importantly we need to know what the query is actually doing!

For example, say you want me to help you figure out a faster route to work. To do this I need to know which route you currently use and how long each part of it takes. We can then compare this against other routes, checking how far they are, expected traffic and predicted speeds. But we need the data to do this!

So when answering performance questions, the first thing we look for is an execution plan. Many confuse this with an explain plan. An explain plan is just a prediction. Often it’s wrong. And even when it’s right, we still don’t know how much work each step does.

An execution plan shows exactly what the database did. It also gives stats about how much work, how often and how long it took to process each step. Combine this with a basic understanding of indexes and join methods and you can often solve your own performance problems.

3. Use bind variables

Sadly data breaches are all too common. There hardly seems to be a week that goes by without news of a major company leaking sensitive data. And the root cause of these attacks is often SQL injection.

This is a simple, well known attack vector. If you write vulnerable SQL on a web enabled application, eventually you’ll be attacked.

And this isn’t something you can avoid by using NoSQL databases. SQL injection like attacks are possible there too!

Fortunately the solution is easy: use bind variables. Not only do these secure your application, they can improve performance too.

Make sure your company is not tomorrow’s data leak headline. Use bind variables!

Last but not least: When will Oracle have a BOOLEAN type? 🙂

We have a BOOLEAN type! It’s just only in PL/SQL ;P

There’s currently a push in the community to for us to add a SQL BOOLEAN type. If this is a feature you’d like to see, you can vote for it on the Database Ideas forum. The more support there is, the more likely we are to implement it! But no promises 😉

Why Most Programmers Get Pagination Wrong


Pagination is one of those things that almost everyone gets wrong for two reasons:

  • User experience
  • Database performance

Here’s why.

What’s wrong with pagination?

Most applications blindly produce pagination like this:

pagination

This is how GMail implements pagination. With my current settings, it displays 100 E-Mails at a time and also shows how many E-Mails there are in total, namely 1094. Those aren’t the total number of E-Mails I’ve ever had, they’re the total number of E-Mails in my “blog-idea” label (I’m using GMail as a TODO list, and yes, this blog won’t run out of articles any time soon).

What’s wrong with this practice?

Bad user experience

As a user, in most cases, I don’t really care about the total number of objects that are in my result set. At least, I don’t care about the exact number. Does it make any difference if I have 1094 or 1093 E-Mails? What about if I had 1067? Or 1000? 1000 would be precise enough for what I’m concerned.

Also, as a user, in most cases, I don’t care that I’m on page 317 of my paginated screen that displays me rows 3170-3179 (assuming 10 rows per page). I really don’t. The page number is absolutely useless in terms of user experience.

Who got it right?

  • Facebook
  • Twitter
  • Reddit

And all the other websites that do timelines. Yes, I want to display only 10 rows at a time (or perhaps 100, but almost never all). So, pagination is important. But I don’t care about the fact that I’ve clicked 317 times on that “next page” button. If I ever browse that many pages (and I hardly ever do), then the only thing that matters is the next 10 rows. Just like when you play Civilization. You don’t care that you’re in turn 317. You just want to play one more turn:

4b665f86cbb10d44e2db6ae4c96fef4050f0ce42878015ab30cf681b84537a30[1]

Moreover, I never ever ever want to jump to page 317 right from the beginning. There’s absolutely no use case out there, where I search for something, and then I say, hey, I believe my search result will be item #3175 in the current sort order. Exactly. Instead, I will do any of these:

  • Refine the search
  • Sort the result

In both cases, I will get a result where the record that I’m looking for is much more likely to appear on page #1 or perhaps #2. Again, when was the last time you googled for SQL and then went to page #18375 to find that particular blog post that you were looking for? No. You searched for “Java 8 SQL” to find jOOQ, the best way to write SQL in Java 8. For instance.

How to implement a timeline with SQL

If your data source is a SQL database, you might have implemented pagination by using LIMIT .. OFFSET, or OFFSET .. FETCH or some ROWNUM / ROW_NUMBER() filtering (see the jOOQ manual for some syntax comparisons across RDBMS). OFFSET is the right tool to jump to page 317, but remember, no one really wants to jump to that page, and besides, OFFSET just skips a fixed number of rows. If there are new rows in the system between the time page number 316 is displayed to a user and when the user skips to page number 317, the rows will shift, because the offsets will shift. No one wants that either, when they click on “next”.

Instead, you should be using what we refer to as “keyset pagination” (as opposed to “offset pagination”). We’ve described this in past articles here:

The SQL syntax is a bit cumbersome as the pagination criteria becomes an ordinary predicate, but if you’re using jOOQ, you can use the simple synthetic SQL SEEK clause as such:

DSL.using(configuration)
   .select(PLAYERS.PLAYER_ID,
           PLAYERS.FIRST_NAME,
           PLAYERS.LAST_NAME,
           PLAYERS.SCORE)
   .from(PLAYERS)
   .where(PLAYERS.GAME_ID.eq(42))
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949, 15)
   .limit(10)
   .fetch();

The above will fetch the next 10 players after the player with SCORE 949 and ID 15. The pagination really depends on the ORDER BY clause, which is why you have to provide as many values in the pagination as you provided columns in the ORDER BY clause.

Now, that we’ve fixed the user experience let’s also look at …

How OFFSET pagination is bad for performance

The previously linked articles about keyset pagination also mention the poor performance of OFFSET pagination. Which is kind of obvious as OFFSET has to skip a given number of rows after applying all predicates and sorting, etc. So the database has to do all the work and then throw away 3170 records (if you’re jumping to page 317 with 10 rows per page). What a waste.

The following diagram shows very nicely how OFFSET gets slower and slower for large offsets:

Reproduced from use-the-index-luke.com with permission by Markus Winand

That’s the obvious problem, but there’s another one. People always count the total number of rows to calculate the total number of possible pages. Why? To display nonsense like the following:

Page number:
1 2 3 ... 315 316 317 318 319 ... 50193 50194

Wow. OK so we’re on page number 317, which we don’t really care about in the first place, but we could just as well jump to page number 50194. This means that the database needed to run the query across all the rows just to be sure we get exactly 50194 pages in total.

Google something like page number pagination and observe the number of tutorials that show how you can implement the above nonsense. On Google Image search, you’ll find:

pagination-google

At the same time, the Google search itself reveals:

pagination-google-search

As you can see, Google estimates that there are probably at least 10 pages for your search and you can go “next”. Yes, you can skip some pages, but you cannot skip to a page number 50194, because, again:

  • No one wants that
  • It’s costly to predict, even for Google

In fact, Google search implements keyset pagination as well, just like Twitter, Facebook, Reddit. And they don’t display the total number of pages because counting that total can be very costly, depending on your database.

In particular, databases that do not support window functions will require you to run two separate queries:

  1. The actual query with a LIMIT clause
  2. An additional query replacing the SELECT column list with a simple COUNT(*)

Needless to say that this is not the best approach. If your database supports window functions (read about that miraculous SQL feature here on the jOOQ blog), you could produce the total row count in one go as such:

SELECT 
  rental_date, 
  inventory_id,
  COUNT(*) OVER()
FROM rental
WHERE customer_id = 1
ORDER BY rental_date
LIMIT 10

That COUNT(*) OVER() window function is like an ordinary aggregate function, except that it doesn’t group your results. It just counts all the rows of your result and produces that count in each row, prior to limiting the result to 10.

When run against the Sakila database, the above produces:

rental_date          inventory_id  count
2005-05-25 11:30:37          3021     32
2005-05-28 10:35:23          4020     32
2005-06-15 00:54:12          2785     32
2005-06-15 18:02:53          1021     32
2005-06-15 21:08:46          1407     32
2005-06-16 15:18:57           726     32
2005-06-18 08:41:48           197     32
2005-06-18 13:33:59          3497     32
2005-06-21 06:24:45          4566     32
2005-07-08 03:17:05          1443     32

So, we’re displaying the first page with 10 rows and we need to provide navigational links for a total of 4 pages because we have a total of 32 rows.

What happens when we benchmark this query on PostgreSQL? The first run doesn’t calculate this COUNT(*) OVER() value, whereas the second one does:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 10000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id
      FROM rental
      WHERE customer_id = 1
      ORDER BY rental_date
      LIMIT 10
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id,
        COUNT(*) OVER()
      FROM rental
      WHERE customer_id = 1
      ORDER BY rental_date
      LIMIT 10
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

The result clearly indicates that in PostgreSQL, there’s a significant overhead in calculating this value:

INFO:  Statement 1: 00:00:01.041823
INFO:  Statement 2: 00:00:03.57145

Oracle optimises things a bit better when you’re using ROWNUM to paginate:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 5000;
BEGIN
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id
      FROM (
        SELECT 
          rental.*, 
          ROWNUM rn
        FROM rental
        WHERE customer_id = 1
        ORDER BY rental_date
      ) rental
      WHERE rn < 5
      ORDER BY rn
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Statement 1: ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id,
        COUNT(*) OVER()
      FROM (
        SELECT 
          rental.*,  
          ROWNUM rn
        FROM rental
        WHERE customer_id = 1
        ORDER BY rental_date
      ) rental
      WHERE rn < 5
      ORDER BY rn
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Statement 2: ' || (SYSTIMESTAMP - v_ts));
END;
/

Result:

Statement 1: X
Statement 2: X +/- 1%

So, the COUNT(*) seems to be calculated “for free”. Bonus question: Why is that?

Due to Oracle license restrictions, we cannot publish benchmark results here, comparing Oracle with PostgreSQL, sorry, but you can run the above code yourself against the Sakila database:
https://github.com/jOOQ/jOOQ/tree/master/jOOQ-examples/Sakila

Conclusion

TL;DR: OFFSET pagination bad. Keyset pagination good.

no-offset-banner-468x60.white

If you want to paginate in your application, please make sure whether you really, really, really need:

  • Exact page number
  • High page numbers
  • The last page number
  • The total number of rows

Because if you don’t (and in 98% of all UIs, you really don’t), then you can drastically speed up your queries while providing your users a much better experience. If that’s not a win-win situation worth thinking about…?

And don’t forget, jOOQ ships with native keyset pagination support!

jOOQ Tuesdays: Thorben Janssen Shares his Hibernate Performance Secrets


Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

thorben-janssen

I’m very excited to feature today Thorben Janssen who has spent most of his professional life with Hibernate.

Thorben, with your blog and training, you are one of the few daring “annotatioficionados” as we like to call them, who risks diving deep into JPA’s more sophisticated annotations – like @SqlResultSetMapping. What is your experience with JPA’s advanced, declarative programming style?

From my point of view, the declarative style of JPA is great and a huge problem at the same time.

If you know what you’re doing, you just add an annotation, set a few properties and your JPA implementation takes care of the rest. That makes it very easy to use complex features and avoids a lot of boilerplate code.

But it can also become a huge issue, when someone is not that familiar with JPA and just copies a few annotations from stack overflow and hopes that it works.

It will work in most of the cases. JPA and Hibernate are highly optimized and handle suboptimal code and annotations quite well. At least as long as it is tested with one user on a local machine. But that changes quickly when the code gets deployed to production and several hundred or thousand users use it in parallel. These issues get then often posted on stack overflow or other forums together with a complaint about the bad performance of Hibernate…

Your training goes far beyond these rather esoteric use-cases and focuses on JPA / Hibernate performance. What are three things every ORM user should know about JPA / SQL performance?

Only three things? I could talk about a lot more things related to JPA and Hibernate performance.

The by far most important one is to remember that your ORM framework is using SQL to store your data in a relational database. That seems to be pretty obvious, but you can avoid the most common performance issues by analyzing and optimizing the executed SQL statements. One example for that is the popular n+1 select issue which you can easily find and fix as I show in my free, 3-part video course.

Another important thing is that no framework or specification provides a good solution for every problem. JPA and Hibernate make it very easy to insert and update data into a relational database. And they provide a set of advanced features for performance optimizations, like caching or the ordering of statements to improve the efficiency of JDBC batches.

But Hibernate and JPA are not a good fit for applications that have to perform a lot of very complex queries for reporting or data mining use cases. The feature set of JPQL is too limited for these use cases. You can, of course, use native queries to execute plain SQL, but you should have a look at other frameworks if you need a lot of these queries.

So, always make sure that your preferred framework is a good fit for your project.

The third thing you should keep in mind is that you should prefer lazy fetching for the relationships between your entities. This prevents Hibernate from executing additional SQL queries to initialize the relationships to other entities when it gets an entity from the database. Most use cases don’t need the related entities, and the additional queries slow down the application. And if one of your use cases uses the relationships, you can use FETCH JOIN statements or entity graphs to initialize them with the initial query.

This approach avoids the overhead of unnecessary SQL queries for most of your use cases and allows you to initialize the relationships if you need them.

These are the 3 most important things you should keep in mind, if you want to avoid performance problems with Hibernate. If you want to dive deeper into this topic, have a look at my Hibernate Performance Tuning Online Training. The next one starts on 23th July.

What made you focus your training mostly on Hibernate, rather than also on EclipseLink / OpenJPA, or just plain SQL / jOOQ? Do you have plans to extend to those topics?

To be honest, that decision was quite easy for me. I’m working with Hibernate for about 15 years now and used it in a lot of different projects with very different requirements. That gives me the experience and knowledge about the framework, which you need if you want to optimize its performance. I also tried EclipseLink but not to the same extent as Hibernate.

And I also asked my readers which JPA implementation they use, and most of them told me that they either use plain JPA or Hibernate. That made it pretty easy to focus on Hibernate.

I might integrate jOOQ into one of my future trainings. Because as I said before, Hibernate and JPA are a good solution if you want to create or update data or if your queries are not too complex. As soon as your queries get complex, you have to use native queries with plain SQL. In these cases, jOOQ can provide some nice benefits.

What’s the advantage of your online training over a more classic training format, where people meet physically – both for you and for your participants?

The good thing about a classroom training is that you can discuss your questions with other students and the instructor. But it also requires you to be in a certain place at a certain time which creates additional costs, requires you to get out of your current projects and keeps you away from home.

With the Hibernate Performance Tuning Online Training, I want to provide a similar experience to a classroom training in which you study with other students and ask your questions but without having to travel somewhere. You can watch my training videos and do the exercises from your office or home and meet with me, and other students in the forum or group coaching calls to discuss your questions.

So you get the best of both worlds without declaring any travel expenses 😉

Your blog also includes a weekly digest of all things happening in the Java ecosystem called Java Weekly. What are the biggest insights into our ecosystem that you’ve gotten out of this work, yourself?

The Java ecosystem is always changing and improving, and you need to learn constantly if you want to stay up to date. One way to do that is to read good blog posts. And there are A LOT of great, small blogs out there written by very experienced Java developers who like to share their knowledge. You just have to find them. That’s probably the biggest insight I got.

I read a lot about Java and Java EE each week (that’s probably the only advantage of a 1.5-hour commute with public transportation) and present the most interesting ones every Monday in a new issue of Java Weekly.

How to Detect Slow Queries with jOOQ


Just now, we implemented a nice little feature in jOOQ’s code generator:
https://github.com/jOOQ/jOOQ/issues/4974

It detects whenever the jOOQ code generator runs a slow query to reverse engineer schema meta information. Why?

In our development and integration test environment, we don’t have huge schemas with all the different performance edge cases put in place. For instance, we don’t have 5000 Oracle synonyms. Or 10000 procedures with 500 parameters each. We do cover some common edge cases, but not in all databases.

Users on the other hand tend to accept the status quo after a while. The code generator is slow-ish? Sure, because we have a huge schema. This lazy acceptance is an impediment to our product quality. We had rather our users report all sorts of issues they encounter, so we want to encourage them.

And we did

In the upcoming jOOQ version 3.8 (and patch releases for 3.5.5, 3.6.5, and 3.7.3), we’ve added a nice little ExecuteListener to jOOQ-meta, which looks roughly like this:

class PerformanceListener 
    extends DefaultExecuteListener {

    StopWatch watch;
    class SQLPerformanceWarning 
        extends Exception {}

    @Override
    public void executeStart(ExecuteContext ctx) {
        super.executeStart(ctx);
        watch = new StopWatch();
    }

    @Override
    public void executeEnd(ExecuteContext ctx) {
        super.executeEnd(ctx);
        if (watch.split() > 5_000_000_000L)
            log.warn(
                "Slow SQL",
                "jOOQ Meta executed a slow query"
              + "\n\n"
              + "Please report this bug here: "
              + "https://github.com/jOOQ/jOOQ/issues/new\n\n"
              + formatted(ctx.query()),
                new SQLPerformanceWarning());
    }
}

It’s very simple. Every time we start executing a query, a “stop watch” is started. Every time we end execution, we check if the watch elapsed more than 5 seconds. If so, we log a warning, a link to our issue tracker, a formatted version of the SQL query, and a stack trace to help find the exact spot where the slow statement was executed.

Let’s run this

The reason why we did this is because we’ve seen ourselves that the PostgreSQL code generator runs a slow query to fetch all the stored procedures (and to generate overload indexes). The produced error message is:

[WARNING] Slow SQL                 : jOOQ Meta executed a slow query (slower than 5 seconds)

Please report this bug here: https://github.com/jOOQ/jOOQ/issues/new

select
  "r1"."routine_schema",
  "r1"."routine_name",
  "r1"."specific_name",
  case when exists (
        select 1 as "one"
        from "information_schema"."parameters"
        where (
          "information_schema"."parameters"."specific_schema" = "r1"."specific_schema"
          and "information_schema"."parameters"."specific_name" = "r1"."specific_name"
          and upper("information_schema"."parameters"."parameter_mode")  'IN'
        )
      ) then 'void'
       else "r1"."data_type"
  end as "data_type",
  "r1"."character_maximum_length",
  "r1"."numeric_precision",
  "r1"."numeric_scale",
  "r1"."type_udt_schema",
  "r1"."type_udt_name",
  case when exists (
        select 1 as "one"
        from "information_schema"."routines" as "r2"
        where (
          "r2"."routine_schema" in (
            'public', 'multi_schema', 'pg_catalog'
          )
          and "r2"."routine_schema" = "r1"."routine_schema"
          and "r2"."routine_name" = "r1"."routine_name"
          and "r2"."specific_name"  "r1"."specific_name"
        )
      ) then (
        select count(*)
        from "information_schema"."routines" as "r2"
        where (
          "r2"."routine_schema" in (
            'public', 'multi_schema', 'pg_catalog'
          )
          and "r2"."routine_schema" = "r1"."routine_schema"
          and "r2"."routine_name" = "r1"."routine_name"
          and "r2"."specific_name" <= "r1"."specific_name"
        )
      ) end as "overload",
  "pg_catalog"."pg_proc"."proisagg"
from "information_schema"."routines" as "r1"
  join "pg_catalog"."pg_namespace"
  on "pg_catalog"."pg_namespace"."nspname" = "r1"."specific_schema"
  join "pg_catalog"."pg_proc"
  on (
    "pg_catalog"."pg_proc"."pronamespace" = "pg_catalog"."pg_namespace".oid
    and (("pg_catalog"."pg_proc"."proname" || '_') || cast("pg_catalog"."pg_proc".oid as varchar)) = "r1"."specific_name"
  )
where (
  "r1"."routine_schema" in (
    'public', 'multi_schema', 'pg_catalog'
  )
  and not("pg_catalog"."pg_proc"."proretset")
)
order by
  "r1"."routine_schema" asc,
  "r1"."routine_name" asc,
  "overload" asc
org.jooq.util.AbstractDatabase$1$SQLPerformanceWarning
        at org.jooq.util.AbstractDatabase$1.executeEnd(AbstractDatabase.java:230)
        at org.jooq.impl.ExecuteListeners.executeEnd(ExecuteListeners.java:163)
        at org.jooq.impl.AbstractResultQuery.execute(AbstractResultQuery.java:269)
        at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:346)
        at org.jooq.impl.AbstractResultQuery.fetch(AbstractResultQuery.java:308)
        at org.jooq.impl.SelectImpl.fetch(SelectImpl.java:2703)
        at org.jooq.util.postgres.PostgresDatabase.getRoutines0(PostgresDatabase.java:707)
        at org.jooq.util.AbstractDatabase.getRoutines(AbstractDatabase.java:1131)
        at org.jooq.util.JavaGenerator.generate(JavaGenerator.java:417)
        at org.jooq.util.JavaGenerator.generate(JavaGenerator.java:314)
        at org.jooq.util.JavaGenerator.generate(JavaGenerator.java:279)
        at org.jooq.util.GenerationTool.run(GenerationTool.java:490)
        at org.jooq.util.GenerationTool.generate(GenerationTool.java:193)
        at org.jooq.util.maven.Plugin.execute(Plugin.java:131)
        ...

We can now proceed with fixing the query, easily.

You can do the same!

The implementation of the ExecuteListener was straight forward. You can do the same, very easily. Just hook a simple execute listener into your jOOQ Configuration, measuring execution speeds and logging warnings after a threshold, done.

Happy debugging!

Further reading

Coincidentally, a very similar approach has been documented by the engineering team at Square – The Query Sniper:

https://corner.squareup.com/2016/01/query-sniper.html

Still Using Windows 3.1? So why stick to SQL-92?


We’ve been blogging a lot about the merits of modern SQL on the jOOQ blog. Specifically, window functions are one of the most fascinating features. But there are many many others.

Markus Winand, author of the popular book SQL Performance Explained has recently given a very well-researched talk about modern SQL. We particularly like his headline:

Should this wonderful presentation have convinced you to buy a copy of SQL Performance Explained (our review here), do not forget to enter the “jOOQ” promo code for an exclusive 10% discount!