When to Choose SQL and When to Choose NoSQL


Some people make architecture decisions purely based on the loudest consultant:

For most others, however, decisions are not as simple as this. For instance: When should we start evaluating NoSQL storage systems as an alternative to RDBMS?

RDBMS as the hammer for all nails

This question obviously assumes that you’re starting out with an RDBMS, which is classically the database system that solves pretty much any problem decently enough not to be replaced easily. What does this mean? Simply put:

  • RDBMS have been around forever, so they have a huge advantage compared to “newcomers” in the market, who don’t have all the excellent tooling, community, support, maturity yet
  • E.F. Codd’s work may have been the single biggest influence on our whole industry. There has hardly been anything as revolutionary as the relational model ever since. It’s hard for an alternative database to be equally universal, i.e. they’re mostly solving niche problems

Having said so, sometimes you do have a niche problem. For instance a graph database problem. In fact, a graph is nothing fundamentally different from what you can represent in the relational model. It is easy to model a graph with a many-to-many relationship table.

The same is true for XML/JSON in the database (don’t forget, JSON is just XML with less syntax and less features, so it’s less awesome than XML). Sometimes you need to actually store document structures (hierarchical data) in their hierarchical form in your database, instead of normalising them. Sure, you could normalise a document, but that’s a lot of work and not necessarily beneficial. This was actually the question on Twitter by Gareth, that triggered the subsequent discussion:

Most modern RDBMS offer means to store (and much more importantly, query) XML/JSON data structures in simple forms. This includes PostgreSQL, Oracle, DB2, SQL Server and others.

So, when do we decide to switch?

As developers, we may be inclined to switch quite quickly. E.g. when working with graphs, we’d love to use Neo4j which implements the awesome Cypher query language. When working with JSON, we’d love to use something like Couchbase which implements the interesting N1QL query language. Both of these query languages are heavily inspired by SQL, which were wise vendor choices in my opinion (as opposed to MongoDB’s weird JSON-based query language), because in the end, the SQL language is still the most powerful and popular 4GL that was ever created.

But as developers, we shouldn’t make that decision lightly. Yes, at first, these specialised databases seem like a better fit. But the additional cost for operations teams to maintain, monitor, patch, tune production systems must not be underestimated. This is even true within the RDBMS ecosystem. A prominent recent example was Uber’s switch (back!) from PostgreSQL to MySQL:

Why Uber Engineering Switched from Postgres to MySQL

Do note that they switched the other way round before, only to regret things. Truth is, there are tons of reasons why your operations teams prefer to always use the same database, even if that’s quite expensive in terms of licensing. In many cases, however, it would be even more expensive to:

  • Engage in additional license and/or support contracts with new database vendors
  • Find skilled DBA for the new technology (can be very hard with niche databases)
  • Maintain two data silos and possibly sustain the cost of keeping them in sync

Ultimately, there’s a threshold:

In terms of using JSON in the database, it’s simple:

  • Occasional JSON storage: Stick with RDBMS.
  • Everything is JSON: Perhaps not RDBMS.

The same is true for graph problems. SQL can totally handle graphs and traverse them with recursive SQL. Here’s a funky statement that recursively calculates the subset sum problem:

subset-sum

Full article here: How to Find the Closest Subset Sum with SQL

If you only have a few tree/graph traversals to calculate (e.g. for a simple menu structure), don’t jump ship with RDBMS yet. If graphs are your business and that’s all you do, then probably RDBMS aren’t the right choice.

Conclusion

Whatever problem you’re solving. Do remember. Yes, all you have is a hammer, every problem starts looking like a nail. But don’t dismiss RDBMS as a silly hammer. It is a very mighty hammer and in 2016, it can do a lot of non-relational niche things decently well.

RDBMS is still the best default choice for all sorts of data problems. Only once you have gone above a certain threshold (or if you can foresee doing that), then you should look for alternatives. Because those alternatives will give you a much rougher time when you go outside of that niche (JSON, graphs, etc.) back to your “ordinary” relational business.

A Hidden jOOQ Gem: Foreach Loop Over ResultQuery


A recent question on Stack Overflow about jOOQ caught my attention. The question essentially asked:

Why do both of these loops work?

// With fetch()
for (MyTableRecord rec : DSL
    .using(configuration)
    .selectFrom(MY_TABLE)
    .orderBy(MY_TABLE.COLUMN)
    .fetch()) { // fetch() here

    doThingsWithRecord(rec);
}

// Without fetch()
for (MyTableRecord rec : DSL
    .using(configuration)
    .selectFrom(MY_TABLE)
    .orderBy(MY_TABLE.COLUMN)) { // No fetch() here

    doThingsWithRecord(rec);
}

And indeed, just like in PL/SQL, you can use any jOOQ ResultQuery as a Java 5 Iterable, because that’s what it is. An Iterable<R> where R extends Record.

The semantics is simple. When Iterable.iterator() is invoked, the query is executed and the Result.iterator() is returned. So, the result is materialised in the client memory just like if I called fetch(). Unsurprisingly, this is the implementation of AbstractResultQuery.iterator():

@Override
public final Iterator<R> iterator() {
    return fetch().iterator();
}

No magic. But it’s great that this works like PL/SQL:

FOR rec IN (SELECT * FROM my_table ORDER BY my_table.column)
LOOP
  doThingsWithRecord(rec);
END LOOP;

Note, unfortunately, there’s no easy way to manage resources through Iterable, i.e. there’s no AutoCloseableIterable returning an AutoCloseableIterator, which could be used in an auto-closing try-with-resources style loop. This is why the entire result set needs to be fetched at the beginning of the loop. For lazy fetching, you can still use ResultQuery.fetchLazy()

try (Cursor<MyTableRecord> cursor = DSL
    .using(configuration)
    .selectFrom(MY_TABLE)
    .orderBy(MY_TABLE.COLUMN)
    .fetchLazy()) {

    for (MyTableRecord rec : cursor)
        doThingsWithRecord(rec);
}

Happy coding!

Avoid Using COUNT() in SQL When You Could Use EXISTS()


A while ago, I blogged about the importance of avoiding unnecessary COUNT(*) queries:
SQL Tip of the Day: Be Wary of SELECT COUNT(*)

… and how to replace them with equivalent EXISTS queries

exist

As I’m updating the jOOQ SQL Masterclass to show also PostgreSQL performance characteristics in addition to Oracle, I really have to reiterate this topic. Please repeat after me:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

The rationale is simple

COUNT(*) needs to return the exact number of rows. EXISTS only needs to answer a question like:

  “Are there any rows at all?”

In other words, EXISTS can short-circuit after having found the first matching row. If your client code (e.g. written in Java or in PL/SQL, or any other client language) needs to know something like:

  “Did actors called “Wahlberg” play in any films at all?”

Then you have two options to write that query:

Very very bad: Use COUNT(*)

Using PostgreSQL syntax:

SELECT count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'

The above query will return a number > 0 if we any Wahlberg played in a film, or 0 if not. Notice that we don’t care how many films all the Wahlbergs played in, yet we ask the database to calculate the precise number.

Let’s run the above query against the Sakila database. The execution plans for the above query in Oracle:

Oracle Execution Plan for a Query with COUNT(*)

And in PostgreSQL:

mqlgukh1

Much much better: Use EXISTS()

Using PostgreSQL syntax:

SELECT EXISTS (
  SELECT * FROM actor a
  JOIN film_actor fa USING (actor_id)
  WHERE a.last_name = 'WAHLBERG'
)

The execution plans for the above query in Oracle:

Oracle Execution Plan for a Query with EXISTS()

And in PostgreSQL:

plwnqga1

How to read this?

As you can see from the above execution plans, the cost in Oracle is slightly better (going from 3 to 2) when using EXISTS than when using COUNT(*), because of a much better cardinality estimate in the middle of the execution plan. In other words, Oracle “knows” that we’re looking for only one record and as soon as that record has been found, we can stop looking.

In PostgreSQL, things are more drastic (going from 123 to 3.4). The EXISTS version has an associated cost that is almost 30x lower than the version that uses COUNT(*) for the same result.

You can gaze at the plan for a while and figure out what the exact difference is, or you can believe me that this is true:

It is obviously much faster to check for existence rather than to count all results, if what you’re really trying to do is checking for existence

Duh.

Does this apply to me?

Yes. I’m taking bets. Many many code bases out there get this wrong all the time. Checking for sizes to be zero is just too convenient. Not only in SQL, but also in Java. Consider this. Which one is better?

Collection<?> collection = ...

// EXISTS
if (!collection.isEmpty())
    doSomething();

// COUNT(*)
if (collection.size() == 0)
    doSomething();

Sometimes, this doesn’t really matter, e.g. in ArrayList, whose isEmpty() method reads:

public boolean isEmpty() {
    return size == 0;
}

But what if your collection is a lazy loaded Hibernate collection? Not all collections cache this size value, and even if they do, they may still produce overhead in the source system in order to calculate the exact size. In fact, they might even run a completely unnecessary query fetching all the child entities from the database just to check for existence.

Bonus exercise for my Hibernate-aficionado readers out there: Do the exercise with Hibernate. Because at this point, I for one would say: Just use SQL™

OK, costs. But what does it mean?

Let’s benchmark these two statements in Oracle and PostgreSQL.

Oracle

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 10000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM actor a
        JOIN film_actor fa USING (actor_id)
        WHERE a.last_name = 'WAHLBERG'
      ) THEN 1 ELSE 0 END
      FROM dual
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      WHERE a.last_name = 'WAHLBERG'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

We get a slight but significant performance improvement of factor 1.3x:

Statement 1 : 3
Statement 2 : 4

(not actual times, because thank you Oracle legal for prohibiting all sorts of stuff). But you can check out the Sakila database yourself and run the above benchmark on your machine.

PostgreSQL

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 1000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT CASE WHEN EXISTS (
        SELECT * FROM actor a
        JOIN film_actor fa USING (actor_id)
        WHERE a.last_name = 'WAHLBERG'
      ) THEN 1 ELSE 0 END
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT count(*)
      FROM actor a
      JOIN film_actor fa USING (actor_id)
      WHERE a.last_name = 'WAHLBERG'
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

A whopping factor 40x in terms of wallclock time gain!

INFO:  Statement 1: 00:00:00.023656
INFO:  Statement 2: 00:00:00.7944

Let me repeat this:

Factor 40x on PostgreSQL

That’s something! It looks as though COUNT(*) is much better optimised on Oracle (e.g. by counting leaf nodes in an index) than on PostgreSQL, but in any case, the amount of extra work is prohibitive in both databases.

Conclusion

I’m repeating myself, but this is important. Print it out and put it on your office wall:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

Thank you.

Using jOO位 to Combine Several Java 8 Collectors into One


With Java 8 being mainstream now, people start using Streams for everything, even in cases where that’s a bit exaggerated (a.k.a. completely nuts, if you were expecting a hyperbole here). For instance, take mykong’s article here, showing how to collect a Map’s entry set stream into a list of keys and a list of values:

http://www.mkyong.com/java8/java-8-convert-map-to-list

The code posted on mykong.com does it in two steps:

package com.mkyong.example;

import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

public class ConvertMapToList {
    public static void main(String[] args) {
        Map<Integer, String> map = new HashMap<>();
        map.put(10, "apple");
        map.put(20, "orange");
        map.put(30, "banana");
        map.put(40, "watermelon");
        map.put(50, "dragonfruit");

        System.out.println("\n1. Export Map Key to List...");

        List<Integer> result = map.entrySet().stream()
                .map(x -> x.getKey())
                .collect(Collectors.toList());

        result.forEach(System.out::println);

        System.out.println("\n2. Export Map Value to List...");

        List<String> result2 = map.entrySet().stream()
                .map(x -> x.getValue())
                .collect(Collectors.toList());

        result2.forEach(System.out::println);
    }
}

This is probably not what you should do in your own code. First off, if you’re OK with iterating the map twice, the simplest way to collect a map’s keys and values would be this:

List<Integer> result1 = new ArrayList<>(map.keySet());
List<String> result2 = new ArrayList<>(map.values());

There’s absolutely no need to resort to Java 8 streams for this particular example. The above is about as simple and speedy as it gets.

Don’t shoehorn Java 8 Streams into every problem

But if you really want to use streams, then I would personally prefer a solution where you do it in one go. There’s no need to iterate the Map twice in this particular case. For instance, you could do it by using jOO位’s Tuple.collectors() method, a method that combines two collectors into a new collector that returns a tuple of the individual collections.

Code speaks for itself more clearly than the above description. Mykong.com’s code could be replaced by this:

Tuple2<List<Integer>, List<String>> result = 
map.entrySet()
    .stream()
    .collect(Tuple.collectors(
        Collectors.mapping(Entry::getKey, Collectors.toList()),
        Collectors.mapping(Entry::getValue, Collectors.toList())
    ));

The only jOO位 code put in place here is the call to Tuple.collectors(), which combines the standard JDK collectors that apply mapping on the Map entries before collecting keys and values into lists.

When printing the above result, you’ll get:

([50, 20, 40, 10, 30], [dragonfruit, orange, watermelon, apple, banana])

i.e. a tuple containing the two resulting lists.

Even simpler, don’t use the Java 8 Stream API, use jOO位’s Seq (for sequential stream) and write this shorty instead:

Tuple2<List<Integer>, List<String>> result = 
Seq.seq(map)
   .collect(
        Collectors.mapping(Tuple2::v1, Collectors.toList()),
        Collectors.mapping(Tuple2::v2, Collectors.toList())
   );

Where Collectable.collect(Collector, Collector) provides awesome syntax sugar over the previous example

Convinced? Get jOO位 here: https://github.com/jOOQ/jOOL

jOOQ Tuesdays: Chris Saxon Explains the 3 Things Every Developer Should Know About SQL


Welcome to the jOOQ Tuesdays series. In this series, we鈥檒l publish an article on the third Tuesday every other month (today, exceptionally on a Wednesday because of technical issues) where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

chris-saxon-headshot[1]

I’m very excited to feature today Chris Saxon who has worked with Oracle forever, and who is one of the brains behind the famous Ask Tom website.

Chris, you鈥檙e part of the famous Ask Tom team. Everyone working with Oracle has wound up on Ask Tom鈥檚 website at least once. You鈥檝e answered an incredible amount of questions already. What鈥檚 it like to work for such a big community as Oracle鈥檚?

It’s an amazing experience! My first real job was as a PL/SQL developer. My only knowledge of SQL was a couple of vaguely remembered lectures at university. Ask Tom was the main place I learned about SQL and Oracle Database. So it’s a huge honor to be on the other side, helping others get the best out of the technology.

The best part has to be the positive comments when you help someone solve a problem that’s been troubling them for days. That’s why we’re here. To help developers learn more about Oracle and improve their SQL skills. When you use the database to its full extent, you can write better, faster applications with less code!

What were your three most interesting questions, so far?

Any question that has a clear definition and a complete test case is interesting!馃槈 Personally I enjoy using SQL to solve complex problems the best. So the first two do just that:

1. Finding the previous row in a different group

The poster had a series of transactions. These were grouped into two types. For each row, they wanted to show the id of the previous transaction from the other group.

At first this sounds like a problem you can solve using LAG or LEAD. But these only search for values within the same group. So you need a different method.

I provided a solution using the model clause. Using this, you can generate columns based on complex, spreadsheet-like formulas. Rows in your table are effectively cells in the sheet. You identify them by defining dimensions which can be other columns or expressions. By setting the transaction type as a dimension, you can then easily reference – and assign – values from one type to the other.

This worked well. But commenters were quick to provide solutions using window functions and 12c’s match_recognize clause. Both of which were faster than my approach!

I like this because it shows the flexibility of SQL. And it shows the value of an engaged community. No one knows everything. By sharing our knowledge and workin together we can all become better developers.

2. Improving SQL that deliberately generates cartesian product

The poster had a set of abbreviations for words. For example, Saint could also be “St.” or “St”. They wanted to take text containing these words. Then generate all combinations of strings using these abbreviations.

The “obvious” solution the user had is to split the text into words. Then for each word, join the abbreviation table, replacing the string as needed. So for a five word string, you have five joins.

There are a couple of problems with this method. The number of joins limits the number of words. So if you have a string with seven words, but only six table joins you won’t abbreviate the final word.

The other issue is performance. Joining the same table N times increases the work you do. If you have long sentences and/or a large number of abbreviations, the query could take a long time to run.

To overcome these you need to ask: “how can I join to the abbreviation table just once?”

The solution to do this starts the same as the original. Split the sentence into a table of words. Then join this to the abbreviations to give a row for each replacement needed.

You can then recursively walk down these rows using CTEs. These build up the sentence again, replacing words with their abbreviations as needed. A scalable solution that only needs a single pass of each table!

The final question relates to performance. Tom Kyte’s mantra was always “if you can do it in SQL, do it in SQL”. The reason is because a pure SQL solution is normally faster than one which combines SQL and other code. Yet a question came in that cast doubt on this:

3. Difference in performance SQL vs PL/SQL

The poster was updating a table. The new values came from another table. He was surprised that PL/SQL using bulk processing came out faster than the pure SQL approach.

The query in question was in the form:

update table1
set col1 = (select col2 from table2 where t1.code = t2.code);

It turned out the reason was due to “missing” indexes. Oracle executes the subquery once for every row in table1. Unless there’s an index on table2 (code), this will full scan table2 once for every row in table1!

The PL/SQL only approach avoided this problem by reading the whole of table2 into an array. So there was only one full scan of table2.
 

The problem here is there was no index on the join condition (t1.code = t2.code). With this in place Oracle does an index lookup of table2 for each row in table1. A massive performance improvement!
 

The moral being if your SQL is “slow”, particularly in compared to a combined SQL + other language method, it’s likely you have a missing index (or two).

This question again showed the strength and value of the Oracle community. Shortly after I posted the explanation, a reviewer was quick to point out the following SQL solution:

merge into table1
using 聽table2
on 聽聽(t1.code = t2.code)
when matched
 聽then update set t1.col = t2.col;

This came out significantly faster than both the original update and PL/SQL 鈥 without needing any extra indexes!

You鈥檙e running a YouTube channel called 鈥淭he Magic of SQL鈥. Are SQL developers magicians?

Of course they are! In fact, I’d say that all developers are magicians. As Arthur C Clarke said:

“Any sufficiently advanced technology is indistinguishable from magic”

The amount of computing power you carry around in your phone today is mind blowing. Just ask your grandparents!

I think SQL developers have a special kind of magic though馃檪. The ability to answer hard questions with a few lines of SQL is amazing. And for it to adapt to changes in the underlying data to give great performance without you changing it is astounding.

Your Twitter account has a pinned tweet about window functions. I frequently talk to Java developers at conferences, and few of them know about window functions, even if they鈥檝e been in databases like Oracle for a very long time. Why do you think they鈥檙e still so 鈥渙bscure鈥?

Oracle Database has had window functions has had them since the nineties. But many other RDBMSes have only fully supported them recently. So a combination of writing “database independent” code and people using other databases is certainly a factor.

Use of tools which hide SQL from developers is also a problem. If you’re not actively using SQL, it’s easy to overlook many of its features.

Fortunately I think this is changing. As more and more developers are realizing, SQL is a powerful language. Learning how to use it effectively is a key skill for all developers. Window functions and other SQL features mean you can get write better performing applications with less code. Who doesn’t want that?馃槈

What are three things that every developer should know about SQL?

1. Understand set based processing

If you find yourself writing a cursor loop (select 鈥 from 鈥 loop), and inside that loop you run more SQL, you’re doing it wrong.

Think about it. Do you eat your cornflakes by placing one flake in your bowl, adding the milk, and eating that one piece? Then doing the same for the next. And the next. And so on? Or do you pour yourself a big bowl and eat all the flakes at once?

If you have a cursor loop with more SQL within the loop, you’re effectively doing this. There’s a lot of overhead in executing each SQL statement. This will slow you down if you have a large number of statements that each process one row. Instead you want few statements that process lots of rows where possible.

It’s also possible to do this by accident. As developers we’re taught that code reuse is A Good Thing. So if there’s an API available we’ll often use it. For example, say you’re building a batch process. This finds the unshipped orders, places them on shipments and marks them as sent.

If a ship_order function exists, you could write something like:

select order_id from unshipped_orders loop
 聽ship_order ( order_id );
end loop;

The problem here is ship_order almost certainly contains SQL. SQL you’ll be executing once for every order awaiting postage. If it’s only a few this may be fine. But if there’s hundreds or thousands this process could take a long time to run.

The way to make this faster is to process all the orders in one go. You can do this with SQL like:

insert into shipments
聽聽select 鈥 from unshipped_orders;

update unshipped_orders
set 聽shipment_date = sysdate;

You may counter there’s other, non-SQL, processing you need to do such as sending emails. So you still need a query to find the order ids.

But you can overcome this! With update’s returning clause, you can get values from all the changed rows:

update unshipped_orders
set 聽shipment_date = sysdate
returning order_id bulk collect into order_array;

This gives you all the order ids to use as you need.

2. Learn what an execution plan is and how to create and read one

“How can I make my SQL faster” is one of the most common classes of questions posted on Ask Tom. The trouble is there’s scant one-size-fits-all advice when it comes to SQL performance. To help we need to know what your query is, what the tables and indexes are and details about the data. But most importantly we need to know what the query is actually doing!

For example, say you want me to help you figure out a faster route to work. To do this I need to know which route you currently use and how long each part of it takes. We can then compare this against other routes, checking how far they are, expected traffic and predicted speeds. But we need the data to do this!

So when answering performance questions, the first thing we look for is an execution plan. Many confuse this with an explain plan. An explain plan is just a prediction. Often it’s wrong. And even when it’s right, we still don’t know how much work each step does.

An execution plan shows exactly what the database did. It also gives stats about how much work, how often and how long it took to process each step. Combine this with a basic understanding of indexes and join methods and you can often solve your own performance problems.

3. Use bind variables

Sadly data breaches are all too common. There hardly seems to be a week that goes by without news of a major company leaking sensitive data. And the root cause of these attacks is often SQL injection.

This is a simple, well known attack vector. If you write vulnerable SQL on a web enabled application, eventually you’ll be attacked.

And this isn’t something you can avoid by using NoSQL databases. SQL injection like attacks are possible there too!

Fortunately the solution is easy: use bind variables. Not only do these secure your application, they can improve performance too.

Make sure your company is not tomorrow’s data leak headline. Use bind variables!

Last but not least: When will Oracle have a BOOLEAN type?馃檪

We have a BOOLEAN type! It’s just only in PL/SQL ;P

There’s currently a push in the community to for us to add a SQL BOOLEAN type. If this is a feature you’d like to see, you can vote for it on the Database Ideas forum. The more support there is, the more likely we are to implement it! But no promises馃槈

All Libraries Should Follow a Zero-Dependency Policy


This hilarious article with a click-bait title caught my attention, recently:

View story at Medium.com

A hilarious (although not so true or serious) rant about the current state of JavaScript development in the node ecosystem.

Dependency hell isn’t new

Dependency hell is a term that made it into wikipedia. It defines it as such:

Dependency hell is a colloquial term for the frustration of some software users who have installed software packages which have dependencies on specific versions of other software packages.

The big problem in dependency hell is the fact that small libraries pull in additional depedencies on which they rely in order to avoid too much code duplication. For instance, check out who is using Guava 18.0:
https://mvnrepository.com/artifact/com.google.guava/guava/18.0/usages

You’ll find libraries like:

  • com.fasterxml.jackson.core 禄 jackson-databind
  • org.springframework 禄 spring-context-support
  • org.reflections 禄 reflections
  • org.joda 禄 joda-convert
  • … 2000 more

Now, let’s assume you’re still using Guava 15.0 in your application. You want to upgrade Spring. Will your application still work? Is that upgrade binary compatible? Will Spring guarantee this to you, or are you on your own? Now what if Spring also uses Joda Time, which in turn also uses Guava? Does this even work? Conversely, can Guava ever depend on Joda Time or will a circular Maven Central dependency cause all singularities to collapse into one huge black hole?

Truth is: You don’t need the dependency

… and by you, I don’t mean the end user who writes complex enterprise applications with Guava (or whatever). You need it. But YOU, dear library developer. You certainly don’t need any dependency.

An example from jOOQ. Being a SQL string manipulation library, we pulled in a dependency on Apache Commons Lang because:

  • They have some nice StringUtils, which we like to use
  • Their license is also ASL 2.0, which is compatible to jOOQ’s license

But instead of hard wiring a jOOQ 3.x to commons-lang 2.x dependency, we opted for internalising one of their classes and only parts of it, repackaging it as org.jooq.tools.StringUtils. Essentially, we needed things like:

  • abbreviate()
  • isEmpty()
  • isBlank()
  • leftPad() (hello node developers)

… and some more. That certainly doesn’t justify pulling in the entire dependency, does it? Because while it wouldn’t matter to us, it would matter to our thousands of users, who might prefer to use an older or newer version of commons-lang. And that’s just the beginning. What if commons-lang had transitive dependencies? Etc.

Please, library developers, avoid dependencies

So, please, dear library developers. Please avoid adding dependencies to your libraries. The only things you should depend on are:

  • The JDK
  • Some API governed by a JSR (e.g. JPA)

That’s it. We can all start writing better software and stop downloading the whole internet if YOU the library developers start being reasonable and stop being lazy.

Exceptions to the above rules:

  • Framework and platform vendors (e.g. Spring, Java EE) are excluded. They define the whole platform, i.e. they impose a set of well-documented dependencies. If you’re using a framework / platform in your application, then you have to abide to the platform’s rules

That’s all. Small libraries like jOOQ must not have any dependency.

Why Most Programmers Get Pagination Wrong


Pagination is one of those things that almost everyone gets wrong for two reasons:

  • User experience
  • Database performance

Here’s why.

What’s wrong with pagination?

Most applications blindly produce pagination like this:

pagination

This is how GMail implements pagination. With my current settings, it displays 100 E-Mails at a time and also shows how many E-Mails there are in total, namely 1094. Those aren’t the total number of E-Mails I’ve ever had, they’re the total number of E-Mails in my “blog-idea” label (I’m using GMail as a TODO list, and yes, this blog won’t run out of articles any time soon).

What’s wrong with this practice?

Bad user experience

As a user, in most cases, I don’t really care about the total number of objects that are in my result set. At least, I don’t care about the exact number. Does it make any difference if I have 1094 or 1093 E-Mails? What about if I had 1067? Or 1000? 1000 would be precise enough for what I’m concerned.

Also, as a user, in most cases, I don’t care that I’m on page 317 of my paginated screen that displays me rows 3170-3179 (assuming 10 rows per page). I really don’t. The page number is absolutely useless in terms of user experience.

Who got it right?

  • Facebook
  • Twitter
  • Reddit

And all the other websites that do timelines. Yes, I want to display only 10 rows at a time (or perhaps 100, but almost never all). So, pagination is important. But I don’t care about the fact that I’ve clicked 317 times on that “next page” button. If I ever browse that many pages (and I hardly ever do), then the only thing that matters is the next 10 rows. Just like when you play Civilization. You don’t care that you’re in turn 317. You just want to play one more turn:

4b665f86cbb10d44e2db6ae4c96fef4050f0ce42878015ab30cf681b84537a30[1]

Moreover, I never ever ever want to jump to page 317 right from the beginning. There’s absolutely no use case out there, where I search for something, and then I say, hey, I believe my search result will be item #3175 in the current sort order. Exactly. Instead, I will do any of these:

  • Refine the search
  • Sort the result

In both cases, I will get a result where the record that I’m looking for is much more likely to appear on page #1 or perhaps #2. Again, when was the last time you googled for SQL and then went to page #18375 to find that particular blog post that you were looking for? No. You searched for “Java 8 SQL” to find jOOQ, the best way to write SQL in Java 8. For instance.

How to implement a timeline with SQL

If your data source is a SQL database, you might have implemented pagination by using LIMIT .. OFFSET, or OFFSET .. FETCH or some ROWNUM / ROW_NUMBER() filtering (see the jOOQ manual for some syntax comparisons across RDBMS). OFFSET is the right tool to jump to page 317, but remember, no one really wants to jump to that page, and besides, OFFSET just skips a fixed number of rows. If there are new rows in the system between the time page number 316 is displayed to a user and when the user skips to page number 317, the rows will shift, because the offsets will shift. No one wants that either, when they click on “next”.

Instead, you should be using what we refer to as “keyset pagination” (as opposed to “offset pagination”). We’ve described this in past articles here:

The SQL syntax is a bit cumbersome as the pagination criteria becomes an ordinary predicate, but if you’re using jOOQ, you can use the simple synthetic SQL SEEK clause as such:

DSL.using(configuration)
   .select(PLAYERS.PLAYER_ID,
           PLAYERS.FIRST_NAME,
           PLAYERS.LAST_NAME,
           PLAYERS.SCORE)
   .from(PLAYERS)
   .where(PLAYERS.GAME_ID.eq(42))
   .orderBy(PLAYERS.SCORE.desc(),
            PLAYERS.PLAYER_ID.asc())
   .seek(949, 15)
   .limit(10)
   .fetch();

The above will fetch the next 10 players after the player with SCORE 949 and ID 15. The pagination really depends on the ORDER BY clause, which is why you have to provide as many values in the pagination as you provided columns in the ORDER BY clause.

Now, that we’ve fixed the user experience let’s also look at …

How OFFSET pagination is bad for performance

The previously linked articles about keyset pagination also mention the poor performance of OFFSET pagination. Which is kind of obvious as OFFSET has to skip a given number of rows after applying all predicates and sorting, etc. So the database has to do all the work and then throw away 3170 records (if you’re jumping to page 317 with 10 rows per page). What a waste.

The following diagram shows very nicely how OFFSET gets slower and slower for large offsets:

Reproduced from use-the-index-luke.com with permission by Markus Winand

That’s the obvious problem, but there’s another one. People always count the total number of rows to calculate the total number of possible pages. Why? To display nonsense like the following:

Page number:
1 2 3 ... 315 316 317 318 319 ... 50193 50194

Wow. OK so we’re on page number 317, which we don’t really care about in the first place, but we could just as well jump to page number 50194. This means that the database needed to run the query across all the rows just to be sure we get exactly 50194 pages in total.

Google something like page number pagination and observe the number of tutorials that show how you can implement the above nonsense. On Google Image search, you’ll find:

pagination-google

At the same time, the Google search itself reveals:

pagination-google-search

As you can see, Google estimates that there are probably at least 10 pages for your search and you can go “next”. Yes, you can skip some pages, but you cannot skip to a page number 50194, because, again:

  • No one wants that
  • It’s costly to predict, even for Google

In fact, Google search implements keyset pagination as well, just like Twitter, Facebook, Reddit. And they don’t display the total number of pages because counting that total can be very costly, depending on your database.

In particular, databases that do not support window functions will require you to run two separate queries:

  1. The actual query with a LIMIT clause
  2. An additional query replacing the SELECT column list with a simple COUNT(*)

Needless to say that this is not the best approach. If your database supports window functions (read about that miraculous SQL feature here on the jOOQ blog), you could produce the total row count in one go as such:

SELECT 
  rental_date, 
  inventory_id,
  COUNT(*) OVER()
FROM rental
WHERE customer_id = 1
ORDER BY rental_date
LIMIT 10

That COUNT(*) OVER() window function is like an ordinary aggregate function, except that it doesn’t group your results. It just counts all the rows of your result and produces that count in each row, prior to limiting the result to 10.

When run against the Sakila database, the above produces:

rental_date          inventory_id  count
2005-05-25 11:30:37          3021     32
2005-05-28 10:35:23          4020     32
2005-06-15 00:54:12          2785     32
2005-06-15 18:02:53          1021     32
2005-06-15 21:08:46          1407     32
2005-06-16 15:18:57           726     32
2005-06-18 08:41:48           197     32
2005-06-18 13:33:59          3497     32
2005-06-21 06:24:45          4566     32
2005-07-08 03:17:05          1443     32

So, we’re displaying the first page with 10 rows and we need to provide navigational links for a total of 4 pages because we have a total of 32 rows.

What happens when we benchmark this query on PostgreSQL? The first run doesn’t calculate this COUNT(*) OVER() value, whereas the second one does:

DO $$
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT INT := 10000;
  rec RECORD;
BEGIN
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id
      FROM rental
      WHERE customer_id = 1
      ORDER BY rental_date
      LIMIT 10
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); 
  v_ts := clock_timestamp();

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id,
        COUNT(*) OVER()
      FROM rental
      WHERE customer_id = 1
      ORDER BY rental_date
      LIMIT 10
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); 
END$$;

The result clearly indicates that in PostgreSQL, there’s a significant overhead in calculating this value:

INFO:  Statement 1: 00:00:01.041823
INFO:  Statement 2: 00:00:03.57145

Oracle optimises things a bit better when you’re using ROWNUM to paginate:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 5000;
BEGIN
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id
      FROM (
        SELECT 
          rental.*, 
          ROWNUM rn
        FROM rental
        WHERE customer_id = 1
        ORDER BY rental_date
      ) rental
      WHERE rn < 5
      ORDER BY rn
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Statement 1: ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
     
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT 
        rental_date, 
        inventory_id,
        COUNT(*) OVER()
      FROM (
        SELECT 
          rental.*,  
          ROWNUM rn
        FROM rental
        WHERE customer_id = 1
        ORDER BY rental_date
      ) rental
      WHERE rn < 5
      ORDER BY rn
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
     
  dbms_output.put_line('Statement 2: ' || (SYSTIMESTAMP - v_ts));
END;
/

Result:

Statement 1: X
Statement 2: X +/- 1%

So, the COUNT(*) seems to be calculated “for free”. Bonus question: Why is that?

Due to Oracle license restrictions, we cannot publish benchmark results here, comparing Oracle with PostgreSQL, sorry, but you can run the above code yourself against the Sakila database:
https://github.com/jOOQ/jOOQ/tree/master/jOOQ-examples/Sakila

Conclusion

TL;DR: OFFSET pagination bad. Keyset pagination good.

no-offset-banner-468x60.white

If you want to paginate in your application, please make sure whether you really, really, really need:

  • Exact page number
  • High page numbers
  • The last page number
  • The total number of rows

Because if you don’t (and in 98% of all UIs, you really don’t), then you can drastically speed up your queries while providing your users a much better experience. If that’s not a win-win situation worth thinking about…?

And don’t forget, jOOQ ships with native keyset pagination support!