How to use SQL PIVOT to Compare Two Tables in Your Database


This can happen ever so easily. You adapt a table by adding a new column:

ALTER TABLE payments ADD code NUMBER(3);

You go on, implementing your business logic – absolutely no problem. But then, later on (perhaps in production), some batch job fails because it makes some strong assumptions about data types. Namely, it assumes that the two tables payments and payments_archive are of the same row type:

CREATE TABLE payments
  (
    id         NUMBER(18) NOT NULL,
    account_id NUMBER(18) NOT NULL,
    value_date DATE,
    amount     NUMBER(25, 2) NOT NULL
  );

CREATE TABLE payments_archive
  (
    id         NUMBER(18) NOT NULL,
    account_id NUMBER(18) NOT NULL,
    value_date DATE,
    amount     NUMBER(25, 2) NOT NULL
  );

Being of the same row type, you can simply move a row from one table to the other, e.g. using a query like this one:

INSERT INTO payments_archive
SELECT * FROM payments
WHERE value_date < SYSDATE - 30;

(not that using the above syntax is a good idea in general, it’s actually a bad idea. but you get the point)

What you’re getting now is this:

ORA-00913: too many values

The fix is obvious, but probably, the poor soul who has to fix this is not you, but someone else who has to figure out among possibly hundreds of columns, which ones don’t match. Here’s how (in Oracle):

Use PIVOT to compare two tables!

You could of course not use PIVOT and simply select all columns from either table from the dictionary views:

SELECT 
  table_name,
  column_name
FROM all_tab_cols
WHERE table_name LIKE 'PAYMENTS%'

This will produce the following result:

TABLE_NAME         COLUMN_NAME                  
------------------ ---------------
PAYMENTS           ID                             
PAYMENTS           ACCOUNT_ID                     
PAYMENTS           VALUE_DATE                     
PAYMENTS           AMOUNT                         
PAYMENTS           CODE                           
PAYMENTS_ARCHIVE   ID                             
PAYMENTS_ARCHIVE   ACCOUNT_ID                     
PAYMENTS_ARCHIVE   VALUE_DATE                     
PAYMENTS_ARCHIVE   AMOUNT    

Not very readable. You could of course use set operations and apply INTERSECT and MINUS (EXCEPT) to filter out matching values. But much better:

SELECT *
FROM (
  SELECT 
    table_name,
    column_name
  FROM all_tab_cols
  WHERE table_name LIKE 'PAYMENTS%'
) 
PIVOT ( 
  COUNT(*) AS cnt
  FOR (table_name) 
  IN (
    'PAYMENTS' AS payments, 
    'PAYMENTS_ARCHIVE' AS payments_archive 
  ) 
) t;

And the above now produces:

COLUMN_NAME  PAYMENTS_CNT PAYMENTS_ARCHIVE_CNT
------------ ------------ --------------------
CODE                    1                    0 
ACCOUNT_ID              1                    1 
ID                      1                    1 
VALUE_DATE              1                    1 
AMOUNT                  1                    1 

It is now very easy to identify the column that is missing from the PAYMENTS_ARCHIVE table. As you can see, the result from the original query produced one row per column AND per table. We took that result and pivoted it “FOR” the table name, such that we will now only get one row per column

How to read PIVOT?

It’s easy. Comments are inline:

SELECT *

-- This is the table that we're pivoting. Note that
-- we select only the minimum to prevent side-effects
FROM (
  SELECT 
    table_name,
    column_name
  FROM all_tab_cols
  WHERE table_name LIKE 'PAYMENTS%'
) 

-- PIVOT is a keyword that is applied to the above
-- table. It generates a new table, similar to JOIN
PIVOT (

  -- This is the aggregated value that we want to
  -- produce for each pivoted value
  COUNT(*) AS available 

  -- This is the source of the values that we want to
  -- pivot
  FOR (table_name) 

  -- These are the values that we accept as pivot
  -- columns. The columns names are produced from
  -- these values concatenated with the corresponding
  -- aggregate function name
  IN (
    'PAYMENTS' AS payments, 
    'PAYMENTS_ARCHIVE' AS payments_archive 
  ) 
) t;

That’s it. Not so hard, was it?

The nice thing about this syntax is that we can generate as many additional columns as we want, very easily:

SELECT *
FROM (
  SELECT 
    table_name,
    column_name,
    cast(data_type as varchar(6)) data_type
  FROM all_tab_cols
  WHERE table_name LIKE 'PAYMENTS%'
) 
PIVOT ( 
  COUNT(*) AS cnt,
  MAX(data_type) AS type -- new function here
  FOR (table_name) 
  IN (
    'PAYMENTS' AS p, 
    'PAYMENTS_ARCHIVE' AS a
  ) 
) t;

… producing (after additional erroneous DDL) …

COLUMN_NAME      P_CNT P_TYPE      A_CNT A_TYPE
----------- ---------- ------ ---------- ------
CODE                 1 NUMBER          0
ACCOUNT_ID           1 NUMBER          1 NUMBER 
ID                   1 NUMBER          1 NUMBER 
VALUE_DATE           1 DATE            1 TIMESTAMP
AMOUNT               1 NUMBER          1 NUMBER

This way, we can discover even more flaws between the different row types of the tables. In the above example, we’ve used MAX(), because we have to provide an aggregation function, even if each pivoted column corresponds to exactly one row in our example – but that doesn’t have to be.

What if I’m not using Oracle?

SQL Server also supports PIVOT, but other databases don’t. You can always emulate PIVOT using GROUP BY and CASE. The following statement is equivalent to the previous one:

SELECT
  t.column_name,
  count(CASE table_name 
        WHEN 'PAYMENTS' THEN 1 END) p_cnt,
  max  (CASE table_name 
        WHEN 'PAYMENTS' THEN data_type END) p_type,
  count(CASE table_name 
        WHEN 'PAYMENTS_ARCHIVE' THEN 1 END) a_cnt,
  max  (CASE table_name 
        WHEN 'PAYMENTS_ARCHIVE' THEN data_type END) a_type
FROM (
  SELECT 
    table_name,
    column_name,
    data_type
  FROM all_tab_cols
  WHERE table_name LIKE 'PAYMENTS%'
) t
GROUP BY
  t.column_name;

This query will now produce the same result on all the other databases as well.

Isn’t that… ?

Yes, it is! The above usage of aggregate functions in combination with CASE can be shortened even more, using the SQL standard FILTER clause, which we’ve blogged about recently.

So, in PostgreSQL, you could write the following query:

SELECT
  t.column_name,
  count(table_name) 
    FILTER (WHERE table_name = 'payments') p_cnt,
  max(data_type) 
    FILTER (WHERE table_name = 'payments') p_type,
  count(table_name) 
    FILTER (WHERE table_name = 'payments_archive') a_cnt,
  max(data_type) 
    FILTER (WHERE table_name = 'payments_archive') a_type
FROM (
  SELECT 
    table_name,
    column_name,
    data_type
  FROM information_schema.columns
  WHERE table_name LIKE 'payments%'
) t
GROUP BY
  t.column_name;

Further reading

Excited? Yes. There are more awesome SQL features in various databases. Read on about:

How to Extract a Date Part in SQL


The Modern SQL Twitter account (by Markus Winand) published a hint about how to extract a date part in SQL:

Is it true? Yes it is, in the SQL standard and in a variety of standards-compliant databases. But let’s check what jOOQ does when you run the following program on all 18 currently supported RDBMS:

import static org.jooq.impl.DSL.currentDate;
import static org.jooq.impl.DSL.extract;
import static org.jooq.impl.DSL.using;

import java.util.stream.Stream;

import org.jooq.DatePart;
import org.jooq.SQLDialect;

public class Extract {
    public static void main(String[] args) {
        // Get all distinct SQLDialect families
        Stream
        .of(SQLDialect.values())
        .map(SQLDialect::family)
        .distinct()
        .forEach(family -> {
            System.out.println();
            System.out.println(family);

            // Get all supported date parts
            Stream
            .of(DatePart.values())

            // For each family / part, get the
            // EXTRACT() function
            .map(part -> extract(currentDate(), part))
            .forEach(expr -> {
                System.out.println(
                    using(family).render(expr)
                );
            });
        });
    }
}

The output is:

Open Source databases

DEFAULT
extract(year from current_date())
extract(month from current_date())
extract(day from current_date())
extract(hour from current_date())
extract(minute from current_date())
extract(second from current_date())

CUBRID
extract(year from current_date())
extract(month from current_date())
extract(day from current_date())
extract(hour from current_date())
extract(minute from current_date())
extract(second from current_date())

DERBY
year(current_date)
month(current_date)
day(current_date)
hour(current_date)
minute(current_date)
second(current_date)

FIREBIRD
extract(year from current_date)
extract(month from current_date)
extract(day from current_date)
extract(hour from current_date)
extract(minute from current_date)
extract(second from current_date)

H2
extract(year from current_date())
extract(month from current_date())
extract(day from current_date())
extract(hour from current_date())
extract(minute from current_date())
extract(second from current_date())

HSQLDB
extract(year from current_date)
extract(month from current_date)
extract(day from current_date)
extract(hour from current_date)
extract(minute from current_date)
extract(second from current_date)

MARIADB
extract(year from current_date())
extract(month from current_date())
extract(day from current_date())
extract(hour from current_date())
extract(minute from current_date())
extract(second from current_date())

MYSQL
extract(year from current_date())
extract(month from current_date())
extract(day from current_date())
extract(hour from current_date())
extract(minute from current_date())
extract(second from current_date())

POSTGRES
extract(year from current_date)
extract(month from current_date)
extract(day from current_date)
extract(hour from current_date)
extract(minute from current_date)
extract(second from current_date)

SQLITE
strftime('%Y', current_date)
strftime('%m', current_date)
strftime('%d', current_date)
strftime('%H', current_date)
strftime('%M', current_date)
strftime('%S', current_date)

Commercial databases

ACCESS
datepart('yyyy', date())
datepart('m', date())
datepart('d', date())
datepart('h', date())
datepart('n', date())
datepart('s', date())

ASE
datepart(yy, current_date())
datepart(mm, current_date())
datepart(dd, current_date())
datepart(hh, current_date())
datepart(mi, current_date())
datepart(ss, current_date())

DB2
year(current_date)
month(current_date)
day(current_date)
hour(current_date)
minute(current_date)
second(current_date)

HANA
extract(year from current_date)
extract(month from current_date)
extract(day from current_date)
extract(hour from current_date)
extract(minute from current_date)
extract(second from current_date)

INFORMIX
year(current year to day)
month(current year to day)
day(current year to day)
current year to day::datetime hour to hour::char(2)::int
current year to day::datetime minute to minute::char(2)::int
current year to day::datetime second to second::char(2)::int

INGRES
extract(year from current_date)
extract(month from current_date)
extract(day from current_date)
extract(hour from current_date)
extract(minute from current_date)
extract(second from current_date)

ORACLE (in jOOQ 3.5)
to_char(trunc(sysdate), 'YYYY')
to_char(trunc(sysdate), 'MM')
to_char(trunc(sysdate), 'DD')
to_char(trunc(sysdate), 'HH24')
to_char(trunc(sysdate), 'MI')
to_char(trunc(sysdate), 'SS')

ORACLE (in jOOQ 3.6)
extract(year from current_date)
extract(month from current_date)
extract(day from current_date)
extract(hour from cast(current_date as timestamp))
extract(minute from cast(current_date as timestamp))
extract(second from cast(current_date as timestamp))

SQLSERVER
datepart(yy, convert(date, current_timestamp))
datepart(mm, convert(date, current_timestamp))
datepart(dd, convert(date, current_timestamp))
datepart(hh, convert(date, current_timestamp))
datepart(mi, convert(date, current_timestamp))
datepart(ss, convert(date, current_timestamp))

SYBASE
datepart(yy, current date)
datepart(mm, current date)
datepart(dd, current date)
datepart(hh, current date)
datepart(mi, current date)
datepart(ss, current date)

Yes. The standard… If only it were implemented thoroughly…

jOOQ vs. Slick – Pros and Cons of Each Approach


Every framework introduces a new compromise. A compromise that is introduced because the framework makes some assumptions about how you’d like to interact with your software infrastructure.

An example of where this compromise has struck users recently is the discussion “Are Slick queries generally isomorphic to the SQL queries?“. And, of course, the answer is: No. What appears to be a simple Slick query:

val salesJoin = sales 
      join purchasers 
      join products 
      join suppliers on {
  case (((sale, purchaser), product), supplier) =>
    sale.productId === product.id &&
    sale.purchaserId === purchaser.id &&
    product.supplierId === supplier.id
}

… turns into a rather large monster with tons of derived tables that are totally unnecessary, given the original query (formatting is mine):

select x2.x3, x4.x5, x2.x6, x2.x7 
from (
    select x8.x9 as x10, 
           x8.x11 as x12, 
           x8.x13 as x14, 
           x8.x15 as x7, 
           x8.x16 as x17, 
           x8.x18 as x3, 
           x8.x19 as x20, 
           x21.x22 as x23, 
           x21.x24 as x25, 
           x21.x26 as x6 
    from (
        select x27.x28 as x9,
               x27.x29 as x11, 
               x27.x30 as x13, 
               x27.x31 as x15, 
               x32.x33 as x16, 
               x32.x34 as x18, 
               x32.x35 as x19 
        from (
            select x36."id" as x28, 
                   x36."purchaser_id" as x29, 
                   x36."product_id" as x30, 
                   x36."total" as x31 
            from "sale" x36
        ) x27 
        inner join (
            select x37."id" as x33, 
                   x37."name" as x34, 
                   x37."address" as x35 
	    from "purchaser" x37
        ) x32 
        on 1=1
    ) x8 
    inner join (
        select x38."id" as x22, 
               x38."supplier_id" as x24, 
               x38."name" as x26 
        from "product" x38
    ) x21
    on 1=1
) x2 
inner join (
    select x39."id" as x40, 
           x39."name" as x5, 
           x39."address" as x41 
    from "supplier" x39
) x4 
on ((x2.x14 = x2.x23) 
and (x2.x12 = x2.x17)) 
and (x2.x25 = x4.x40) 
where x2.x7 >= ?

Christopher Vogt, a former Slick maintainer and still actively involved member of the Slick community, explains the above in the following words:

This means that Slick relies on your database’s query optimizer to be able to execute the sql query that Slick produced efficiently. Currently that is not always the case in MySQL

One of the main ideas behind Slick, according to Christopher, is:

Slick is not a DSL that allows you to build exactly specified SQL strings. Slick’s Scala query translation allows for re-use and composition and using Scala as the language to write your queries. It does not allow you to predict the exact sql query, only the semantics and the rough structure.

Slick vs. jOOQ

Since Christopher later on also compared Slick with jOOQ, I allowed myself to chime in and to add my two cents:

From a high level (without actual Slick experience) I’d say that Slick and jOOQ embrace compositionality equally well. I’ve seen crazy queries of several 100s of lines of [jOOQ] SQL in customer code, composed over several methods. You can do that with both APIs.

On the other hand, as Chris said: Slick has a focus on Scala collections, jOOQ on SQL tables.

  • From a conceptual perspective (= in theory), this focus shouldn’t matter.
  • From a type safety perspective, Scala collections are easier to type-check than SQL tables and queries because SQL as a language itself is rather hard to type-check given that the semantics of various of the advanced SQL clauses alter type configurations rather implicitly (e.g. outer joins, grouping sets, pivot clauses, unions, group by, etc.).
  • From a practical perspective, SQL itself is only an approximation of the original relational theories and has attained a life of its own. This may or may not matter to you.

I guess in the end it really boils down to whether you want to reason about Scala collections (queries are better integrated / more idiomatic with your client code) or about SQL tables (queries are better integrated / more idiomatic with your database).

At this point, I’d like to add another two cents to the discussion. Customers don’t buy the product that you’re selling. They never do. In the case of Hibernate, customers and users were hoping to be able to forget SQL forever. The opposite is true. As Gavin King himself (the creator of Hibernate) had told me:

gavin-king

Because customers and users had never listened to Gavin (and to other ORM creators), we now have what many call the object-relational impedance mismatch. A lot of unjustified criticism has been expressed against Hibernate and JPA, APIs which are simply too popular for the limited scope they really cover.

With Slick (or C#’s LINQ, for that matter), a similar mismatch is impeding integrations, if users abuse these tools for what they believe to be a replacement for SQL. Slick does a great job at modelling the relational model directly in the Scala language. This is wonderful if you want to reason about relations just like you reason about collections. But it is not a SQL API. To illustrate how difficult it is to overcome these limitations, you can browse the issue tracker or user group to learn about:

We’ll simply call this:

The Functional-Relational Impedance Mismatch

SQL is much more

Markus Winand (the author of the popular SQL Performance Explained) has recently published a very good presentation about “modern SQL”, an idea that we fully embrace at jOOQ:

We believe that APIs that have been trying to hide the SQL language from general purpose languages like Java, Scala, C# are missing out on a lot of the very nice features that can add tremendous value to your application. jOOQ is an API that fully embraces the SQL language, with all its awesome features (and with all its quirks). You obviously may or may not agree with that.

We’ll leave this article open ended, hoping you’ll chime in to discuss the benefits and caveats of each approach. Of staying close to Scala vs. staying close to SQL.

As a small teaser, however, I’d like to announce a follow-up article showing that there is no such thing as an object-relational impedance mismatch. You (and your ORM) are just not using SQL correctly. Stay tuned!

Thou Shalt Not Name Thy Method “Equals”


(unless you really override Object.equals(), of course).

I’ve stumbled upon a rather curious Stack Overflow question by user Frank:

Why does Java’s Area#equals method not override Object#equals?

Interestingly, there is a Area.equals(Area) method which really takes an Area argument, instead of a Object argument as declared in Object.equals(). This leads to rather nasty behaviour, as discovered by Frank:

@org.junit.Test
public void testEquals() {
    java.awt.geom.Area a = new java.awt.geom.Area();
    java.awt.geom.Area b = new java.awt.geom.Area();
    assertTrue(a.equals(b)); // -> true

    java.lang.Object o = b;
    assertTrue(a.equals(o)); // -> false
}

Technically, it is correct for AWT’s Area to have been implemented this way (as hashCode() isn’t implemented either), but the way Java resolves methods, and the way programmers digest code that has been written like the above code, it is really a terrible idea to overload the equals method.

No static equals, either

These rules also hold true for static equals() methods, such as for instance Apache Commons Lang‘s

ObjectUtils.equals(Object o1, Object o2)

The confusion here arises by the fact that you cannot static-import this equals method:

import static org.apache.commons.lang.ObjectUtils.equals;

When you now type the following:

equals(obj1, obj2);

You will get a compiler error:

The method equals(Object) in the type Object is not applicable for the arguments (…, …)

The reason for this is that methods that are in the scope of the current class and its super types will always shadow anything that you import this way. The following doesn’t work either:

import static org.apache.commons.lang.ObjectUtils.defaultIfNull;

public class Test {
  void test() {
    defaultIfNull(null, null);
    // ^^ compilation error here
  }

  void defaultIfNull() {
  }
}

Details in this Stack Overflow question.

Conclusion

The conclusion is simple. never overload any of the methods declared in Object (overriding is fine, of course). This includes:

  • clone()
  • equals()
  • finalize()
  • getClass()
  • hashCode()
  • notify()
  • notifyAll()
  • toString()
  • wait()

Of course, it would be great if those methods weren’t declared in Object in the first place, but that ship has sailed 20 years ago.

jOOQ Tuesdays: Raoul-Gabriel Urma Explores What Java 8 and English have in Common


Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

rgurma[1]

We have the pleasure of talking to Raoul-Gabriel Urma in this second edition who will be telling us a little about what Java 8 and English have in common.

Hi Raoul – You’re about to publish a very interesting PhD regarding programming language evolution. What is it about?

Natural languages such as English and Spanish have evolved over the years. However, in general the evolution take centuries. Consequently, society has had time to adapt.

Programming languages share a similar characteristic: they evolve. However, unlike natural languages, they evolve almost every year! This puts greater pressure on developers and existing code bases. For example, new features are introduced, idioms changes and sometime different language versions are backward incompatible.

As part of my dissertation I argue that developers need machine support focused on locating sources of backward incompatibilities and contribute various techniques for both static and dynamically typed languages.

I’m also co-organising a workshop on the topic as part of ECOOP if that’s of interest! http://2015.ecoop.org/track/PLE-2015-papers

Very interesting! Programming languages seem to be a favourite topic of yours. You’ve written an impressive book with great reviews on Amazon (4.7 out of 5!). What do you readers learn from it?

I’ve written a book called Java 8 in Action together with Mario Fusco and Alan Mycroft. We wanted to write a book which teaches how to benefit from the various new Java 8 features. However, we also wanted to write a book that the Java community will still be reading in five or ten years. This is why the book covers many topics on top of lambda expressions and the Streams API such as testing & debugging, refactoring, enhanced concurrency, functional programming concepts and even a bit of Scala!

And what was your biggest learning from writing the book?

It always takes more time than you think ;-)

Being (blog) writers ourselves, we can certainly relate to that!

Java 8 with lambdas and streams is inevitable for all of us. From your earlier work experience at Google, Oracle, Ebay, Goldman Sachs, what do you expect will be the biggest challenges for adopters in the industry when upgrading to Java 8 and functional programming?

Java 8 makes Java sexy again. Unfortunately, a big part of software engineering is to deal with legacy frameworks and codebases, which may impede the adoption of Java 8 for various companies. Nonetheless, several recent surveys show that Java 8’s adoption is going strong.

From the various conference talks, discussions and training courses I’ve given so far, it is clear that the Java community is excited about using lambdas and the Streams API. Hopefully this means that more people will embrace the functional-style in the next 5 years :-).

At Data Geekery with jOOQ, we’ve been working to integrate SQL – a very powerful external domain-specific language – into Java, (ab)using the Java language to the extreme. From a Java / Java 8 perspective, what is your take on domain-specific languages in general, and on SQL in particular?

I think jOOQ is a fantastic project! I’m a believer of writing code that matches the business problem’s domain. It helps readabiliy and maintenance, and ultimately it’s useful for productivity.

You’ve also been busy co-founding startups next to writing books and PhDs. What is it that you love about your various jobs?

I love contributing to people’s lives and knowledge!

<PLUG>
If you are interested in Java 8 training, check out the course I’ve put together with Richard Warbuton and James Gough: http://java8training.com
</PLUG>

10 SQL Articles Everyone Must Read


We’ve been blogging about Java and SQL for a while now, on the jOOQ blog. Over the years, while researching interesting blog topics, we’ve discovered a lot of SQL gems in the blogosphere that have inspired our work and our passion for SQL.

Today, we’re presenting to you a list of 10 articles that we think you should absolutely read. At the end of the list, you will agree that either:

  • SQL is awesome
  • SQL is crazy

… or probably both. Here goes, in no particular order:

1. Joe Celko: “Divided We Stand: The SQL of Relational Division”

Relational division is a very powerful concept in relational algebra. It answers questions like:

Give me all the students that have completed a given set of courses

Unfortunately, division doesn’t have any direct equivalent in SQL. We wish there would be a clause like

TABLE_A
  DIVIDE BY TABLE_B 
  ON [ some predicate ]

Nonetheless, you can express a division in SQL in various ways.

Read Joe’s: “Divided We Stand: The SQL of Relational Division

2. Alex Bolenok: “Happy New Year!”

Alek Bolenok (a.k.a. Quassnoi) blogs about various interesting SQL-related things, but one of his top contributions every year are his “happy new year” series. Alek paints “beautiful” (beauty is in the eye of the beholder), and certainly impressive pictures into your SQL console. For instance:

Read Alek’s, “Christmas tree in SQL

3. Markus Winand: “Clustering Data: The Second Power of Indexing”

Markus Winand is the author of the popular book SQL Performance Explained, parts of which you can also read on his blog “Use The Index Luke“. There is an incredible amount of very useful knowledge both in the book and on this page, but one of the most revealing and neat SQL tricks is to know about “covering indexes”, “clustering indexes”, or “index only scans”

Read Markus’s: “Clustering Data: The Second Power of Indexing

4. Dimitri Fontaine: “Understanding Window Functions”

There was SQL before window functions and SQL after window functions

Window functions are some of the most powerful and underused features of SQL. They’re available in all commercial databases, in PostgreSQL, and soon also in Firebird 3.0. We’ve blogged about window functions a couple of times ourselves, but one of the best summaries and explanations about what they really are and how they work has been written by Dimitri Fontaine.

Read Dimitri’s: “Understanding Window Functions

5. Lukas Eder: “10 Common Mistakes Java Developers Make when Writing SQL”

A bit of advertising for our own writing. We’ve collected 10 of the most common mistakes that Java developers make when writing SQL. These mistakes are actually not even specific to Java developers, they could happen to any developer. This article has generated so much traction on our blog, there must be some great truth in it.

Read Lukas’s: “10 Common Mistakes Java Developers Make when Writing SQL

6. András Gábor’s “Techniques for Pagination in SQL”

Up until recently, offset pagination has been rather difficult to implement in commercial databases like Oracle, SQL Server, DB2, Sybase, which didn’t feature the equivalent of MySQL’s / PostgreSQL’s LIMIT .. OFFSET clause. Pagination could be emulated, however, and there are a lot of techniques for doing that. Picking the right technique is essential for performance. If you’re using Oracle 11g or less, you should filter on ROWNUM:

Read András’s “Techniques for Pagination in SQL

On a side-note, you should probably consider not using OFFSET at all. For details about the NO OFFSET movement, read…

7. Markus Windand: “We need tool support for keyset pagination”

If you think about OFFSET pagination, it’s actually a rather dumb thing from a technical perspective, and a useless thing from a business perspective. Here’s why.

From a technical perspective…

… you need to apply filtering, grouping, and ordering on a vast amount of data, skipping and throwing away all the data that appears before the offset until you reach the first row of interest. That is a lot of waste of resources given that…

From a business perspective…

… perhaps, pages 1-3 are interesting, but there is absolutely no meaning in offering users to navigate to page 1337. After a certain offset, the meaning of the offset from a business perspective has vanished. You might as well display random, unordered data samples. The user wouldn’t notice. Proably, when you reach a higher page on Google search results, this is exactly what happens. Random stuff.

Or on reddit. There, you get random stuff already on the first page – such as this popular display of a Stabilized head on green vine snake.

Much better than offset pagination is keyset pagination (which we’ve blogged about as well).

Read Markus’s “We need tool support for keyset pagination

no-offset-banner-468x60.white

8. Josh Berkus “Tag All The Things”

Implementing tagging in a relational database can be a beast from a performance perspective. Should you normalise (one-to-many)? Should you normalise heavily (many-to-many)? Should you use nested collections / arrays / or even JSON data structures?

Josh has written a very interesting write-up on the performance of heavy tagging in PostgreSQL, showing that normalisation isn’t always the best choice.

Read Josh’s “Tag All The Things

9. Alek Bolenok’s “10 things in SQL Server (which don’t work as expected)”

This is again Alek’s (Quassnoi’s) work. A very interesting set of things that happen inside of SQL Server, which you might not have expected when you’re used to using other databases. Whether you’re using SQL Server or not, this is a must-read to re-raise awareness of the subtle little differences between SQL implementations

Read Alek’s “10 things in SQL Server (which don’t work as expected)

10. Aaron Bertrand: “Best approaches for running totals”

Running totals are a very typical use-case for SQL-based reporting. A running total is something that every project manager using Excel knows intuitively how to do. Just drag-and-drop that sweet sweet formula across your spreadsheet and done:

excel-running-total

How to do the same in SQL? There are again tons of ways. Aaron Bertrand has summarised various solutions for SQL Server 2012.

Read Aaron’s “Best approaches for running totals

Many other articles

There are, of course, many other very good articles providing deep insight into useful SQL tricks. If you find you’ve encountered an article that would nicely complement this list, please leave a link and description in the comments section. Future readers will appreciate the additional insight.

“NoSQL Injection” – What 40000 Unsecured MongoDB Databases Mean for our Industry


The news is all over reddit

Major security alert as 40,000 MongoDB databases left unsecured on the internet

Security is a feature that is often neglected until it’s too late. And when it’s too late, it is often hard to bake it into a well-established architecture without major refactoring efforts.

Every system and thus also every database is always vulnerable. Most databases, however, do offer a significant amount of features to implement a security layer – and MongoDB is no different from any other DBMS here. So, how could this massive security hole happen?

Security is a cultural thing. Either, a company has security in their DNA, or it doesn’t. The same is true for scalability, or user experience, or any other aspect of software engineering. I’ve worked for companies that are at completely opposite ends of security awareness. Some (in the E-Banking field) were ultra-paranoiac, implementing thorough security checks in around 7 layers of the application. Others were rather lenient with management focusing much more on marketing than anything else. Without any empirical evidence, however, there was a certain correlation between security-awareness in a company and the backend-orientedness of the same company, E-Banking being a very backend-oriented business.

Backend developers are more security aware

This is an over-generalisation and probably doesn’t do justice to many excellent frontend developers out there, but security is where the data is. Where the algorithms are. Where people reason about constraints, workflows, batch jobs, accounting, money, … algorithms. These folks focus on all the users. On the system. And they want to protect it. On the flip side, they might neglect usability.

There is only little security-awareness where the user experience is. Where people reason about layout, formatting, usability, style, … user interfaces. These folks focus on single users. On their experience. And they want to make things easy for the user.

(and again, the same is true for scalability)

It is no coincidence that backend technology evolves extremely slowly. Java: 20 years and we’ve just finally gotten lambdas. SQL: 30 years and we still don’t have easy ways to reuse code.

At the same time, frontend technology evolves at the “speed of reddit”. The next hype is just 100 karma away, and we’ll throw all the previous tech out of the window, just to be part of the game.

Clearly, security is something that has to be reasoned about way too thoroughly for it to survive in the fast-paced frontend world.

What does MongoDB have to do with it?

The current event isn’t actually directly related to MongoDB (you could probably find just as many unprotected MySQL instances out there). But it strongly correlates with MongoDB’s sales and marketing strategies. MongoDB has done very aggressive and successful marketing in the past, claiming that the reign of the RDBMS is over – just as much as the reign of the RDBMS had been over before, when the astonishing object databases surfaced this planet. Well, we all know where object or XML databases went:

d8938bef47ea2f62ed0543dd9e35a483

This time, the anti-RDBMS marketing resonated mostly with frontend developers, obviously, because JSON is their favourite data representation format, and MongoDB promised to be able to store data directly from the DOM into the DB. Not only did this mean “the end of the DBA” for some software vendors, but many vendors also hoped that they could omit operations, and perhaps even backend development. What obviously worked well for prototyping and simple applications doesn’t scale well to applications with sensitive data.

The Solution

The solution is obvious. Homogeneity kills your business. You should hire a variety of different types of personnel. You should have skilled frontend developers, backend developers, operations people, DBA, and security experts on your team. You should make them work all together, hear each of their opinions, review each others’ code, learn from each other. Because each one of them has a strong focus and interest on an entirely different, yet equally important aspect of your application.

Do not neglect any of these aspects. Because if you do, and if it’s security, and if you lose sensitive customer data – well, you’re not going to stay in business, you’ll be sued in court.

Got hooked on the security topic?

Continue reading about …

Top 10 Easy Performance Optimisations in Java


There has been a lot of hype about the buzzword “web scale“, and people are going through lengths of reorganising their application architecture to get their systems to “scale”.

But what is scaling, and how can we make sure that we can scale?

Different aspects of scaling

The hype mentioned above is mostly about scaling load, i.e. to make sure that a system that works for 1 user will also work well for 10 users, or 100 users, or millions. Ideally, your system is as “stateless” as possible such that the few pieces of state that really remain can be transferred and transformed on any processing unit in your network. When load is your problem, latency is probably not, so it’s OK if individual requests take 50-100ms. This is often also referred to as scaling out

An entirely different aspect of scaling is about scaling performance, i.e. to make sure that an algorithm that works for 1 piece of information will also work well for 10 pieces, or 100 pieces, or millions. Whether this type of scaling is feasible is best described by Big O Notation. Latency is the killer when scaling performance. You want to do everything possible to keep all calculation on a single machine. This is often also referred to as scaling up

If there was anything like free lunch (there isn’t), we could indefinitely combine scaling up and out. Anyway, today, we’re going to look at some very easy ways to improve things on the performance side.

Big O Notation

Java 7’s ForkJoinPool as well as Java 8’s parallel Stream help parallelising stuff, which is great when you deploy your Java program onto a multi-core processor machine. The advantage of such parallelism compared to scaling across different machines on your network is the fact that you can almost completely eliminate latency effects, as all cores can access the same memory.

But don’t be fooled by the effect that parallelism has! Remember the following two things:

  • Parallelism eats up your cores. This is great for batch processing, but a nightmare for asynchronous servers (such as HTTP). There are good reasons why we’ve used the single-thread servlet model in the past decades. So parallelism only helps when scaling up.
  • Parallelism has no effect on your algorithm’s Big O Notation. If your algorithm is O(n log n), and you let that algorithm run on c cores, you will still have an O(n log n / c) algorithm, as c is an insignificant constant in your algorithm’s complexity. You will save wall-clock time, but not reduce complexity!

The best way to improve performance, of course, is by reducing algorithm complexity. The killer is achieve O(1) or quasi-O(1), of course, for instance a HashMap lookup. But that is not always possible, let alone easy.

If you cannot reduce your complexity, you can still gain a lot of performance if you tweak your algorithm where it really matters, if you can find the right spots. Assume the following visual representation of an algorithm:

algorithm

The overall complexity of the algorithm is O(N3), or O(N x O x P) if we want to deal with individual orders of magnitude. However, when profiling this code, you might find a funny scenario:

  • On your development box, the left branch (N -> M -> Heavy operation) is the only branch that you can see in your profiler, because the values for O and P are small in your development sample data.
  • On production, however, the right branch (N -> O -> P -> Easy operation or also N.O.P.E.) is really causing trouble. Your operations team might have figured this out using AppDynamics, or DynaTrace, or some similar software.

Without production data, you might quickly jump to conclusions and optimise the “heavy operation”. You ship to production and your fix has no effect.

There are no golden rules to optimisation apart from the facts that:

  • A well-designed application is much easier to optimise
  • Premature optimisation will not solve any performance problems, but make your application less well-designed, which in turn makes it harder to be optimised

Enough theory. Let’s assume that you have found the right branch to be the issue. It may well be that a very easy operation is blowing up in production, because it is called lots and lots of times (if N, O, and P are large). Please read this article in the context of there being a problem at the leaf node of an inevitable O(N3) algorithm. These optimisations won’t help you scale. They’ll help you save your customer’s day for now, deferring the difficult improvement of the overall algorithm until later!

Here are the top 10 easy performance optimisations in Java:

1. Use StringBuilder

This should be your default in almost all Java code. Try to avoid the + operator. Sure, you may argue that it is just syntax sugar for a StringBuilder anyway, as in:

String x = "a" + args.length + "b";

… which compiles to

 0  new java.lang.StringBuilder [16]
 3  dup
 4  ldc <String "a"> [18]
 6  invokespecial java.lang.StringBuilder(java.lang.String) [20]
 9  aload_0 [args]
10  arraylength
11  invokevirtual java.lang.StringBuilder.append(int) : java.lang.StringBuilder [23]
14  ldc <String "b"> [27]
16  invokevirtual java.lang.StringBuilder.append(java.lang.String) : java.lang.StringBuilder [29]
19  invokevirtual java.lang.StringBuilder.toString() : java.lang.String [32]
22  astore_1 [x]

But what happens, if later on, you need to amend your String with optional parts?

String x = "a" + args.length + "b";

if (args.length == 1)
    x = x + args[0];

You will now have a second StringBuilder, that just needlessly consumes memory off your heap, putting pressure on your GC. Write this instead:

StringBuilder x = new StringBuilder("a");
x.append(args.length);
x.append("b");

if (args.length == 1);
    x.append(args[0]);

Takeaway

In the above example, it is probably completely irrelevant if you’re using explicit StringBuilder instances, or if you rely on the Java compiler creating implicit instances for you. But remember, we’re in the N.O.P.E. branch. Every CPU cycle that we’re wasting on something as stupid as GC or allocating a StringBuilder‘s default capacity, we’re wasting N x O x P times.

As a rule of thumb, always use a StringBuilder rather than the + operator. And if you can, keep the StringBuilder reference across several methods, if your String is more complex to build. This is what jOOQ does when you generate a complex SQL statement. There is only one StringBuilder that “traverses” your whole SQL AST (Abstract Syntax Tree)

And for crying out loud, if you still have StringBuffer references, do replace them by StringBuilder. You really hardly ever need to synchronize on a string being created.

2. Avoid regular expressions

Regular expressions are relatively cheap and convenient. But if you’re in the N.O.P.E. branch, they’re about the worst thing you can do. If you absolutely must use regular expressions in computation-intensive code sections, at least cache the Pattern reference instead of compiling it afresh all the time:

static final Pattern HEAVY_REGEX = 
    Pattern.compile("(((X)*Y)*Z)*");

But if your regular expression is really silly like

String[] parts = ipAddress.split("\\.");

… then you really better resort to ordinary char[] or index-based manipulation. For example this utterly unreadable loop does the same thing:

int length = ipAddress.length();
int offset = 0;
int part = 0;
for (int i = 0; i < length; i++) {
    if (i == length - 1 || 
            ipAddress.charAt(i + 1) == '.') {
        parts[part] = 
            ipAddress.substring(offset, i + 1);
        part++;
        offset = i + 2;
    }
}

… which also shows why you shouldn’t do any premature optimisation. Compared to the split() version, this is unmaintainable.

Challenge: The clever ones among your readers might find even faster algorithms.

Takeaway

Regular expressions are useful, but they come at a price. If you’re deep down in a N.O.P.E. branch, you must avoid regular expressions at all costs. Beware of a variety of JDK String methods, that use regular expressions, such as String.replaceAll(), or String.split().

Use a popular library like Apache Commons Lang instead, for your String manipulation.

3. Do not use iterator()

Now, this advice is really not for general use-cases, but only applicable deep down in a N.O.P.E. branch. Nonetheless, you should think about it. Writing Java-5 style foreach loops is convenient. You can just completely forget about looping internals, and write:

for (String value : strings) {
    // Do something useful here
}

However, every time you run into this loop, if strings is an Iterable, you will create a new Iterator instance. If you’re using an ArrayList, this is going to be allocating an object with 3 ints on your heap:

private class Itr implements Iterator<E> {
    int cursor;
    int lastRet = -1;
    int expectedModCount = modCount;
    // ...

Instead, you can write the following, equivalent loop and “waste” only a single int value on the stack, which is dirt cheap:

int size = strings.size();
for (int i = 0; i < size; i++) {
    String value : strings.get(i);
    // Do something useful here
}

… or, if your list doesn’t really change, you might even operate on an array version of it:

for (String value : stringArray) {
    // Do something useful here
}

Takeaway

Iterators, Iterable, and the foreach loop are extremely useful from a writeability and readability perspective, as well as from an API design perspective. However, they create a small new instance on the heap for each single iteration. If you run this iteration many many times, you want to make sure to avoid creating this useless instance, and write index-based iterations instead.

Discussion

Some interesting disagreement about parts of the above (in particular replacing Iterator usage by access-by-index) has been discussed on Reddit here.

4. Don’t call that method

Some methods are simple expensive. In our N.O.P.E. branch example, we don’t have such a method at the leaf, but you may well have one. Let’s assume your JDBC driver needs to go through incredible trouble to calculate the value of ResultSet.wasNull(). Your homegrown SQL framework code might look like this:

if (type == Integer.class) {
    result = (T) wasNull(rs, 
        Integer.valueOf(rs.getInt(index)));
}

// And then...
static final <T> T wasNull(ResultSet rs, T value) 
throws SQLException {
    return rs.wasNull() ? null : value;
}

This logic will now call ResultSet.wasNull() every time you get an int from the result set. But the getInt() contract reads:

Returns: the column value; if the value is SQL NULL, the value returned is 0

Thus, a simple, yet possibly drastic improvement to the above would be:

static final <T extends Number> T wasNull(
    ResultSet rs, T value
) 
throws SQLException {
    return (value == null || 
           (value.intValue() == 0 && rs.wasNull())) 
        ? null : value;
}

So, this is a no-brainer:

Takeaway

Don’t call expensive methods in an algorithms “leaf nodes”, but cache the call instead, or avoid it if the method contract allows it.

5. Use primitives and the stack

The above example is from jOOQ, which uses a lot of generics, and thus is forced to use wrapper types for byte, short, int, and long – at least before generics will be specialisable in Java 10 and project Valhalla. But you may not have this constraint in your code, so you should take all measures to replace:

// Goes to the heap
Integer i = 817598;

… by this:

// Stays on the stack
int i = 817598;

Things get worse when you’re using arrays:

// Three heap objects!
Integer[] i = { 1337, 424242 };

… by this:

// One heap object.
int[] i = { 1337, 424242 };

Takeaway

When you’re deep down in your N.O.P.E. branch, you should be extremely wary of using wrapper types. Chances are that you will create a lot of pressure on your GC, which has to kick in all the time to clean up your mess.

A particularly useful optimisation might be to use some primitive type and create large, one-dimensional arrays of it, and a couple of delimiter variables to indicate where exactly your encoded object is located on the array.

An excellent library for primitive collections, which are a bit more sophisticated than your average int[] is trove4j, which ships with LGPL.

Exception

There is an exception to this rule: boolean and byte have few enough values to be cached entirely by the JDK. You can write:

Boolean a1 = true; // ... syntax sugar for:
Boolean a2 = Boolean.valueOf(true);

Byte b1 = (byte) 123; // ... syntax sugar for:
Byte b2 = Byte.valueOf((byte) 123);

The same is true for low values of the other integer primitive types, including char, short, int, long.

But only if you’re auto-boxing them, or calling TheType.valueOf(), not when you call the constructor!

Never call the constructor on wrapper types, unless you really want a new instance

This fact can also help you write a sophisticated, trolling April Fool’s joke for your co-workers

Off heap

Of course, you might also want to experiment with off-heap libraries, although they’re more of a strategic decision, not a local optimisation.

An interesting article on that subject by Peter Lawrey and Ben Cotton is: OpenJDK and HashMap… Safely Teaching an Old Dog New (Off-Heap!) Tricks

6. Avoid recursion

Modern functional programming languages like Scala encourage the use of recursion, as they offer means of optimising tail-recursing algorithms back into iterative ones. If your language supports such optimisations, you might be fine. But even then, the slightest change of algorithm might produce a branch that prevents your recursion from being tail-recursive. Hopefully the compiler will detect this! Otherwise, you might be wasting a lot of stack frames for something that might have been implemented using only a few local variables.

Takeaway

There’s not much to say about this apart from: Always prefer iteration over recursion when you’re deep down the N.O.P.E. branch

7. Use entrySet()

When you want to iterate through a Map, and you need both keys and values, you must have a very good reason to write the following:

for (K key : map.keySet()) {
    V value : map.get(key);
}

… rather than the following:

for (Entry<K, V> entry : map.entrySet()) {
    K key = entry.getKey();
    V value = entry.getValue();
}

When you’re in the N.O.P.E. branch, you should be wary of maps anyway, because lots and lots of O(1) map access operations are still lots of operations. And the access isn’t free either. But at least, if you cannot do without maps, use entrySet() to iterate them! The Map.Entry instance is there anyway, you only need to access it.

Takeaway

Always use entrySet() when you need both keys and values during map iteration.

8. Use EnumSet or EnumMap

There are some cases where the number of possible keys in a map is known in advance – for instance when using a configuration map. If that number is relatively small, you should really consider using EnumSet or EnumMap, instead of regular HashSet or HashMap instead. This is easily explained by looking at EnumMap.put():

private transient Object[] vals;

public V put(K key, V value) {
    // ...
    int index = key.ordinal();
    vals[index] = maskNull(value);
    // ...
}

The essence of this implementation is the fact that we have an array of indexed values rather than a hash table. When inserting a new value, all we have to do to look up the map entry is ask the enum for its constant ordinal, which is generated by the Java compiler on each enum type. If this is a global configuration map (i.e. only one instance), the increased access speed will help EnumMap heavily outperform HashMap, which may use a bit less heap memory, but which will have to run hashCode() and equals() on each key.

Takeaway

Enum and EnumMap are very close friends. Whenever you use enum-like structures as keys, consider actually making those structures enums and using them as keys in EnumMap.

9. Optimise your hashCode() and equals() methods

If you cannot use an EnumMap, at least optimise your hashCode() and equals() methods. A good hashCode() method is essential because it will prevent further calls to the much more expensive equals() as it will produce more distinct hash buckets per set of instances.

In every class hierarchy, you may have popular and simple objects. Let’s have a look at jOOQ’s org.jooq.Table implementations.

The simplest and fastest possible implementation of hashCode() is this one:

// AbstractTable, a common Table base implementation:

@Override
public int hashCode() {

    // [#1938] This is a much more efficient hashCode()
    // implementation compared to that of standard
    // QueryParts
    return name.hashCode();
}

… where name is simply the table name. We don’t even consider the schema or any other property of the table, as the table names are usually distinct enough across a database. Also, the name is a string, so it has already a cached hashCode() value inside.

The comment is important, because AbstractTable extends AbstractQueryPart, which is a common base implementation for any AST (Abstract Syntax Tree) element. The common AST element does not have any properties, so it cannot make any assumptions an optimised hashCode() implementation. Thus, the overridden method looks like this:

// AbstractQueryPart, a common AST element
// base implementation:

@Override
public int hashCode() {
    // This is a working default implementation. 
    // It should be overridden by concrete subclasses,
    // to improve performance
    return create().renderInlined(this).hashCode();
}

In other words, the whole SQL rendering workflow has to be triggered to calculate the hash code of a common AST element.

Things get more interesting with equals()

// AbstractTable, a common Table base implementation:

@Override
public boolean equals(Object that) {
    if (this == that) {
        return true;
    }

    // [#2144] Non-equality can be decided early, 
    // without executing the rather expensive
    // implementation of AbstractQueryPart.equals()
    if (that instanceof AbstractTable) {
        if (StringUtils.equals(name, 
            (((AbstractTable<?>) that).name))) {
            return super.equals(that);
        }

        return false;
    }

    return false;
}

First thing: Always (not only in a N.O.P.E. branch) abort every equals() method early, if:

  • this == argument
  • this "incompatible type" argument

Note that the latter condition includes argument == null, if you’re using instanceof to check for compatible types. We’ve blogged about this before in 10 Subtle Best Practices when Coding Java.

Now, after aborting comparison early in obvious cases, you might also want to abort comparison early when you can make partial decisions. For instance, the contract of jOOQ’s Table.equals() is that for two tables to be considered equal, they must be of the same name, regardless of the concrete implementation type. For instance, there is no way these two items can be equal:

  • com.example.generated.Tables.MY_TABLE
  • DSL.tableByName("MY_OTHER_TABLE")

If the argument cannot be equal to this, and if we can check that easily, let’s do so and abort if the check fails. If the check succeeds, we can still proceed with the more expensive implementation from super. Given that most objects in the universe are not equal, we’re going to save a lot of CPU time by shortcutting this method.

some objects are more equal than others

In the case of jOOQ, most instances are really tables as generated by the jOOQ source code generator, whose equals() implementation is even further optimised. The dozens of other table types (derived tables, table-valued functions, array tables, joined tables, pivot tables, common table expressions, etc.) can keep their “simple” implementation.

10. Think in sets, not in individual elements

Last but not least, there is a thing that is not Java-related but applies to any language. Besides, we’re leaving the N.O.P.E. branch as this advice might just help you move from O(N3) to O(n log n), or something like that.

Unfortunately, many programmers think in terms of simple, local algorithms. They’re solving a problem step by step, branch by branch, loop by loop, method by method. That’s the imperative and/or functional programming style. While it is increasingly easy to model the “bigger picture” when going from pure imperative to object oriented (still imperative) to functional programming, all these styles lack something that only SQL and R and similar languages have:

Declarative programming.

In SQL (and we love it, as this is the jOOQ blog) you can declare the outcome you want to get from your database, without making any algorithmic implications whatsoever. The database can then take all the meta data available into consideration (e.g. constraints, keys, indexes, etc.) to figure out the best possible algorithm.

In theory, this has been the main idea behind SQL and relational calculus from the beginning. In practice, SQL vendors have implemented highly efficient CBOs (Cost-Based Optimisers) only since the last decade, so stay with us in the 2010’s when SQL will finally unleash its full potential (it was about time!)

But you don’t have to do SQL to think in sets. Sets / collections / bags / lists are available in all languages and libraries. The main advantage of using sets is the fact that your algorithms will become much much more concise. It is so much easier to write:

SomeSet INTERSECT SomeOtherSet

rather than:

// Pre-Java 8
Set result = new HashSet();
for (Object candidate : someSet)
    if (someOtherSet.contains(candidate))
        result.add(candidate);

// Even Java 8 doesn't really help
someSet.stream()
       .filter(someOtherSet::contains)
       .collect(Collectors.toSet());

Some may argue that functional programming and Java 8 will help you write easier, more concise algorithms. That’s not necessarily true. You can translate your imperative Java-7-loop into a functional Java-8 Stream collection, but you’re still writing the very same algorithm. Writing a SQL-esque expression is different. This…

SomeSet INTERSECT SomeOtherSet

… can be implemented in 1000 ways by the implementation engine. As we’ve learned today, perhaps it is wise to transform the two sets into EnumSet automatically, before running the INTERSECT operation. Perhaps we can parallelise this INTERSECT without making low-level calls to Stream.parallel()

Conclusion

In this article, we’ve talked about optimisations done on the N.O.P.E. branch, i.e. deep down in a high-complexity algorithm. In our case, being the jOOQ developers, we have interest in optimising our SQL generation:

  • Every query is generated only on a single StringBuilder
  • Our templating engine actually parses characters, instead of using regular expressions
  • We use arrays wherever we can, especially when iterating over listeners
  • We stay clear of JDBC methods that we don’t have to call
  • etc…

jOOQ is at the “bottom of the food chain”, because it’s the (second-)last API that is being called by our customers’ applications before the call leaves the JVM to enter the DBMS. Being at the bottom of the food chain means that every line of code that is executed in jOOQ might be called N x O x P times, so we must optimise eagerly.

Your business logic is not deep down in the N.O.P.E. branch. But your own, home-grown infrastructure logic may be (custom SQL frameworks, custom libraries, etc.) Those should be reviewed according to the rules that we’ve seen today. For instance, using Java Mission Control or any other profiler.

Liked this article?

If you can’t go and profile your application right now, you might enjoy reading any of these articles instead:

Still Using Windows 3.1? So why stick to SQL-92?


We’ve been blogging a lot about the merits of modern SQL on the jOOQ blog. Specifically, window functions are one of the most fascinating features. But there are many many others.

Markus Winand, author of the popular book SQL Performance Explained has recently given a very well-researched talk about modern SQL. We particularly like his headline:

Should this wonderful presentation have convinced you to buy a copy of SQL Performance Explained (our review here), do not forget to enter the “jOOQ” promo code for an exclusive 10% discount!

Top 5 Use-Cases For Nested Types


There has been an interesting discussion on reddit, the other day Static Inner Classes. When is it too much?

First, let’s review a little bit of basic historic Java knowledge. Java-the-language offers four levels of nesting classes, and by “Java-the-language”, I mean that these constructs are mere “syntax sugar”. They don’t exist in the JVM, which only knows ordinary classes.

(Static) Nested classes

class Outer {
    static class Inner {
    }
}

In this case, Inner is completely independent of Outer, except for a common, shared namespace.

Inner classes

class Outer {
    class Inner {
    }
}

In this case, Inner instances have an implicit reference to their enclosing Outer instance. In other words, there can be no Inner instance without an associated Outer instance.

The Java way of creating such an instance is this:

Outer.Inner yikes = new Outer().new Inner();

What looks totally awkward makes a lot of sense. Think about creating an Inner instance somewhere inside of Outer:

class Outer {
    class Inner {
    }

    void somewhereInside() {
        // We're already in the scope of Outer.
        // We don't have to qualify Inner explicitly.
        Inner aaahOK;

        // This is what we're used to writing.
        aaahOK = new Inner();

        // As all other locally scoped methods, we can
        // access the Inner constructor by 
        // dereferencing it from "this". We just
        // hardly ever write "this"
        aaahOK = this.new Inner();
    }
}

Note that much like the public or abstract keywords, the static keyword is implicit for nested interfaces. While the following hypothetical syntax might look familiar at first sight…:

class Outer {
    <non-static> interface Inner {
        default void doSomething() {
            Outer.this.doSomething();
        }
    }

    void doSomething() {}
}

… it is not possible to write the above. Apart from the lack of a <non-static> keyword, there don’t seem to be any obvious reason why “inner interfaces” shouldn’t be possible. I’d suspect the usual – there must be some really edge-casey caveat related to backwards-compatibility and/or multiple inheritance that prevents this.

Local classes

class Outer {
    void somewhereInside() {
        class Inner {
        }
    }
}

Local classes are probably one of the least known features in Java, as there is hardly any use for them. Local classes are named types whose scope extends only to the enclosing method. Obvious use-cases are when you want to reuse such a type several times within that method, e.g. to construct several similar listeners in a JavaFX application.

Anonymous classes

class Outer {
    Serializable dummy = new Serializable() {};
}

Anonymous classes are subtypes of another type with only one single instance.

Top 5 Use-Cases For Nested Classes

All of anonymous, local, and inner classes keep a reference to their enclosing instance, if they’re not defined in a static context. This may cause a lot of trouble if you let instances of these classes leak outside of their scope. Read more about that trouble in our article: Don’t be Clever: The Double Curly Braces Anti Pattern.

Often, however, you do want to profit from that enclosing instance. It can be quite useful to have some sort of “message” object that you can return without disclosing the actual implementation:

class Outer {

    // This implementation is private ...
    private class Inner implements Message {
        @Override
        public void getMessage() {
            Outer.this.someoneCalledMe();
        }
    }

    // ... but we can return it, being of
    // type Message
    Message hello() {
        return new Inner();
    }

    void someoneCalledMe() {}
}

With (static) nested classes, however, there is no enclosing scope as the Inner instance is completely independent of any Outer instance. So what’s the point of using such a nested class, rather than a top-level type?

1. Association with the outer type

If you want to communicate to the whole world, hey, this (inner) type is totally related to this (outer) type, and doesn’t make sense on its own, then you can nest the types. This has been done with Map and Map.Entry, for instance:

public interface Map<K, V> {
    interface Entry<K, V> {
    }
}

2. Hiding from the outside of the outer type

If package (default) visibility isn’t enough for your types you can create private static classes that are available only to their enclosing type and to all other nested types of the enclosing type. This is really the main use-case for static nested classes.

class Outer {
    private static class Inner {
    }
}

class Outer2 {
    Outer.Inner nope;
}

3. Protected types

This is really a very rare use-case, but sometimes, within a class hierarchy, you need types that you want to make available only to subtypes of a given type. This is a use-case for protected static classes:

class Parent {
    protected static class OnlySubtypesCanSeeMe {
    }

    protected OnlySubtypesCanSeeMe someMethod() {
        return new OnlySubtypesCanSeeMe();
    }
}

class Child extends Parent {
    OnlySubtypesCanSeeMe wow = someMethod();
}

4. To emulate modules

Unlike Ceylon, Java doesn’t have first-class modules. With Maven or OSGi, it is possible to add some modular behaviour to Java’s build (Maven) or runtime (OSGi) environments, but if you want to express modules in code, this isn’t really possible.

However, you can establish modules by convention by using static nested classes. Let’s look at the java.util.stream package. We could consider it a module, and within this module, we have a couple of “sub-modules”, or groups of types, such as the internal java.util.stream.Nodes class, which roughly looks like this:

final class Nodes {
    private Nodes() {}
    private static abstract class AbstractConcNode {}
    static final class ConcNode {
        static final class OfInt {}
        static final class OfLong {}
    }
    private static final class FixedNodeBuilder {}
    // ...
}

Some of this Nodes stuff is available to all of the java.util.stream package, so we might say that the way this is written, we have something like:

  • a synthetic java.util.stream.nodes sub-package, visible only to the java.util.stream “module”
  • a couple of java.util.stream.nodes.* types, visible also only to the java.util.stream “module”
  • a couple of “top-level” functions (static methods) in the synthetic java.util.stream.nodes package

Looks a lot like Ceylon, to me!

5. Cosmetic reasons

The last bit is rather boring. Or some may find it interesting. It’s about taste, or ease of writing things. Some classes are just so small and unimportant, it’s just easier to write them inside of another class. Saves you a .java file. Why not.

Conclusion

In times of Java 8, thinking about the very old features of Java the language might not prove to be extremely exciting. Static nested classes are a well-understood tool for a couple of niche use-cases.

The takeaway from this article, however, is this. Every time you nest a class, be sure to make it static if you don’t absolutely need a reference to the enclosing instance. You never know when that reference is blowing up your application in production.

Follow

Get every new post delivered to your Inbox.

Join 2,464 other followers

%d bloggers like this: