I just fixed a bug. The fix required me to initialise an Object[] array with the init values for each type, instead of just null, i.e. false for boolean, 0 for int, 0.0 for double, etc. So, instead of just doing:
Object[] converted = new Object[parameterTypes.length];
I needed:
Object[] converted = new Object[parameterTypes.length];
for (int i = 0; i < converted.length; i++)
converted[i] = Reflect.initValue(parameterTypes[i]);
For the subjective 8E17th time, I wrote a loop. A loop that did nothing interesting other than call a method for each of the looped structure’s elements. And I felt the pain of our friend Murtaugh
Why do we distinguish between T and T[]?
What I really wanted to do is this. I have a method Reflect.initValue()
public static <T> T initValue(Class<T> type) {}
What I really want to do is this, in one way or another:
converted = initValue(parameterTypes);
(Yes, there are subtleties that need to be thought about, such as should this init an array or assign values to an array. Forget about them for now. Big picture first).
The point is, no one enjoys writing loops. No one enjoys writing map/flatMap either:
It’s so much useless, repetitive, infrastructural ceremony that I don’t enjoy writing nor reading. My “business logic” here is simply
converted = initValue(parameterTypes);
I have 3 elements:
A source data structure parameterTypes
A target data structure converted
A mapping function initValue
That’s all I should be seeing in my code. All the infrastructure of how to iterate is completely meaningless and boring.
SQL joins
In fact, SQL joins are often the same. We use primary key / foreign key relationships, so the path between parent and child tables is very obvious in most cases. Joins are cool, relational algebra is cool, but in most cases, it just gets in the way of writing understandable business logic. In my opinion, this is one of Hibernate’s biggest innovations (probably others did this too, perhaps even before Hibernate): implicit joins, which jOOQ copied.
There’s much ceremony in writing this:
SELECT
cu.first_name,
cu.last_name,
co.country
FROM customer AS cu
JOIN address USING (address_id)
JOIN city USING (city_id)
JOIN country AS co USING (country_id)
When this alternative, intuitive syntax would be much more convenient:
SELECT
cu.first_name,
cu.last_name,
cu.address.city.country.country
FROM customer AS cu
It is immediately clear what is meant by the implicit join syntax. The syntactic ceremony of writing the explicit joins is not necessary.
Again, joins are really cool, and power users will be able to use them when needed. E.g. the occasional NATURAL FULL OUTER JOIN can still be done! But let’s admit it, 80% of all joins are boring, and could be replaced with the above syntax sugar.
Suggestion for Java
Of course, this suggestion will not be perfect, because it doesn’t deal with the gazillion edge cases of introducing such a significant feature to an old language. But again, if we allow ourselves to focus on the big picture, wouldn’t it be nice if we could:
class Author {
String firstName;
String lastName;
Book[] books; // Or use any collection type here
}
class Book {
String title;
}
And then:
Author[] authors = ...
// This...
String[] firstNames = authors.firstName;
// ...is sugar for this (oh how it hurts to type this):
String[] firstNames = new String[authors.length];
for (int i = 0; i < firstNames.length; i++)
firstNames[i] = authors[i].firstName;
// And this...
int[] firstNameLengths = authors.firstName.length()
// ... is sugar for this:
int[] firstNameLengths = new int[authors.length];
for (int i = 0; i < firstNames.length; i++)
firstNames[i] = authors[i].firstName.length();
// ... or even this, who cares (hurts to type even more):
int[] firstNameLengths = Stream
.of(authors)
.map(a -> a.firstName)
.mapToInt(String::length)
.toArray();
Ignore the usage of arrays, it could just as well be a List, Stream, Iterable, whatever data structure or syntax that allows to get from a 1 arity to an N arity.
Why do we have to keep spelling these things out? They’re not business logic, they’re meaningless, boring, infrastructure. While yes, there are surely many edge cases (and we could live with the occasional compiler errors, if the compiler can’t figure out how to get from A to B), there are also many “very obvious” cases, where the cerimonial mapping logic (imperative or functional, doesn’t matter) is just completely obvious and boring.
But it gets in the way of writing and reading, and despite the fact that it seems obvious in many cases, it is still error prone!
I think it’s time to revisit the ideas behind APL, where everything is an array, and by consequence, operations on arity 1 types can be applied to arity N types just the same, because the distinction is often not very useful.
Bonus: Null
While difficult to imagine retrofitting a language like Java with this, a new language could do away with nulls forever, because the arity 0-1 is just a special case of the arity N: An empty array.
One of the Stream APIs greatest features is its laziness. The whole pipeline is constructed lazily, stored as a set of instructions, akin to a SQL execution plan. Only when we invoke a terminal operation, the pipeline is started. It is still lazy, meaning that some operations may be short circuited.
Some third party libraries produce streams that are not entirely lazy. For example, jOOQ until version 3.12 eagerly executed a SQL query when calling ResultQuery.stream(), regardless if the Stream is consumed afterwards:
try (var stream = ctx.select(T.A, T.B).from(T).stream()) {
// Not consuming the stream here
}
While this is probably a bug in client code, not executing the statement in this case might still be a useful feature. The exception being, of course, if the query contains a FOR UPDATE clause, in case of which the user probably uses Query.execute() instead, if they don’t care about the result.
A more interesting example where laziness helps is the fact that we might not want this query to be executed right away, as are perhaps still on the wrong thread to execute it. Or we would like any possible exceptions to be thrown from wherever the result is consumed, i.e. where the terminal operation is called. For example:
While we’re fixing this in jOOQ 3.13 (https://github.com/jOOQ/jOOQ/issues/4934), you may be stuck to an older version of jOOQ, or have another library do the same thing. Luckily, there’s an easy trick to quickly make a third party stream “lazy”. Flatmap it! Just write this instead:
How to write a simple API is already an art on its own.
I didn’t have time to write a short letter, so I wrote a long one instead.
― Mark Twain
But keeping an API simple for beginners and most users, and making it extensible for power users seems even more of a challenge. But is it?
What does “extensible” mean?
Imagine an API like, oh say, jOOQ. In jOOQ, you can write SQL predicates like this:
ctx.select(T.A, T.B)
.from(T)
.where(T.C.eq(1)) // Predicate with bind value here
.fetch();
By default (as this should always be the default), jOOQ will generate and execute this SQL statement on your JDBC driver, using a bind variable:
SELECT t.a, t.b
FROM t
WHERE t.c = ?
The API made the most common use case simple. Just pass your bind variable as if the statement was written in e.g. PL/SQL, and let the language / API do the rest. So we passed that test.
You can turn your variable into an inline value explicitly for this single occasion:
ctx.select(T.A, T.B)
.from(T)
.where(T.C.eq(inline(1))) // Predicate without bind value here
.fetch();
This is using the static imported DSL.inline() method. Works, but not very convenient, if you have to do this for several queries, for several bind values, or worse, depending on some context.
This is a necessary API enhancement, but it does not make the API extensible.
On a global basis
Notice that ctx object there? It is the DSLContext object, the “contextual DSL”, i.e. the DSL API that is in the context of a jOOQ Configuration. You can thus set:
ctx2 = DSL.using(ctx
.configuration()
.derive()
.set(new Settings()
.withStatementType(StatementType.STATIC_STATEMENT));
// And now use this new DSLContext instead of the old one
ctx2.select(T.A, T.B)
.from(T)
.where(T.C.eq(1)) // No longer a bind variable
.fetch();
Different approaches to offering such extensibility
We have our clean and simple API. Now some user wants to extend it. So often, we’re tempted to resort to a hack, e.g. by using thread locals, because they would work easily when under the assumption of a thread-bound execution model – such as e.g. classic Java EE Servlets
Given Java does not support optional method arguments, has anyone ever written Java APIs that use, e.g. ThreadLocal, to pass information into the API such that it does not need to be an explicit argument into the API? Are there other patterns people have seen?
It’s a hack, and as such it will break easily. If we offer this as functionality to a user, they will start depending on it, and we will have to support and maintain it
It’s a hack, and it is based on assumptions, such as thread bound ness. It will not work in an async / reactive / parallel stream context, where our logic may jump back and forth between threads
It’s a hack, and deep inside, we know it’s wrong. Obligatory XKCD: https://xkcd.com/292
This might obviously work, just like global (static) variables. You can set this variable globally (or “globally” for your own thread), and then the API’s internals will be able to read it. No need to pass around parameters, so no need to compromise on the APIs simplicity by adding optional and often ugly, distractive parameters.
What are better approaches to offering such extensibility?
Dependency Injection
One way is to use explicit Dependency Injection (DI). If you have a container like Spring, you can rely on Spring injecting arbitrary objects into your method call / whatever, where you need access to it:
Most of Spring Frameworks annotation based services (transaction, security) work that way: Register via a proxy on method entry, usable down the call stack.
This way, if you maintain several contextual objects of different lifecycle scopes, you can let the DI framework make appropriate decisions to figure out where to get that contextual information from. For example, when using JAX-RS, you can do this using an annotation based approach:
// These annotations bind the method to some HTTP address
@GET
@Produces("text/plain")
@Path("/api")
public String method(
// This annotation fetches a request-scoped object
// from the method call's context
@Context HttpServletRequest request,
// This annotation produces an argument from the
// URL's query parameters
@QueryParam("arg") String arg
) {
...
}
This approach works quite nicely for static environments (annotations being static), where you do not want to react to dynamic URLs or endpoints. It is declarative, and a bit magic, but well designed, so once you know all the options, you can choose the right one for your use case very easily.
While @QueryParam is mere convenience (you could have gotten the argument also from the HttpServletRequest), the @Context is powerful. It can help inject values of arbitrary lifecycle scope into your method / class / etc.
I personally favour explicit programming over annotation-based magic (e.g. using Guice for DI), but that’s probably a matter of taste. Both are a great way for implementors of APIs (e.g. HTTP APIs) to help get access to framework objects.
However, if you’re an API vendor, and want to give users of your API a way to extend the API, I personally favour jOOQ’s SPI approach.
SPIs
One of jOOQ’s strengths, IMO, is precisely this single, central place to register all SPI implementations that can be used for all sorts of purposes: The Configuration.
For example, on such a Configuration you can specify a JSR-310 java.time.Clock. This clock will be used by jOOQ’s internals to produce client side timestamps, instead of e.g. using System.currentTimeMillis(). Definitely a use case for power users only, but once you have this use case, you really only want to tweak a single place in jOOQ’s API: The Configuration.
All of jOOQ’s internals will always have a Configuration reference available. And it’s up to the user to decide what the scope of this object is, jOOQ doesn’t care. E.g.
per query
per thread
per request
per session
per application
In other words, to jOOQ, it doesn’t matter at all if you’re implementing a thread-bound, blocking, classic servlet model, or if you’re running your code reactively, or in parallel, or whatever. Just manage your own Configuration lifecycle, jOOQ doesn’t care.
In fact, you can have a global, singleton Configuration and implement thread bound components of it, e.g. the ConnectionProvider SPI, which takes care of managing the JDBC Connection lifecycle for jOOQ. Typically, users will use e.g. a Spring DataSource, which manages JDBC Connection (and transactions) using a thread-bound model, internally using ThreadLocal. jOOQ does not care. The SPI specifies that jOOQ will:
Again, it does not matter to jOOQ what the specific ConnectionProvider implementation does. You can implement it in any way you want if you’re a power user. By default, you’ll just pass jOOQ a DataSource, and it will wrap it in a default implementation called DataSourceConnectionProvider for you.
The key here is again:
The API is simple by default, i.e. by default, you don’t have to know about this functionality, just pass jOOQ a DataSource as always when working with Java and SQL, and you’re ready to go
The SPI allows for easily extending the API without compromising on its simplicity, by providing a single, central access point to this kind of functionality
Other SPIs in Configuration include:
ExecuteListener: An extremely useful and simple way to hook into the entire jOOQ query management lifecycle, from generating the SQL string to preparing the JDBC statement, to binding variables, to execution, to fetching result sets. A single SPI can accomodate various use cases like SQL logging, patching SQL strings, patching JDBC statements, listening to result set events, etc.
ExecutorProvider: Whenever jOOQ runs something asynchronously, it will ask this SPI to provide a standard JDK Executor, which will be used to run the asynchronous code block. By default, this will be the JDK default (the default ForkJoinPool), as always. But you probably want to override this default, and you want to be in full control of this, and not think about it every single time you run a query.
MetaProvider: Whenever jOOQ needs to look up database meta information (schemas, tables, columns, types, etc.), it will ask this MetaProvider about the available meta information. By default, this will run queries on the JDBC DatabaseMetaData, which is good enough, but maybe you want to wire these calls to your jOOQ-generated classes, or something else.
RecordMapperProvider and RecordUnmapperProvider: jOOQ has a quite versatile default implementation of how to map between a jOOQ Record and an arbitrary Java class, supporting a variety of standard approaches including JavaBeans getter/setter naming conventions, JavaBeans @ConstructorProperties, and much more. These defaults apply e.g. when writing query.fetchInto(MyBean.class). But sometimes, the defaults are not good enough, and you want this particular mapping to work differently. Sure, you could write query.fetchInto(record -> mymapper(record)), but you may not want to remember this for every single query. Just override the mapper (and unmapper) at a single, central spot for your own chosen Configuration scope (e.g. per query, per request, per session, etc.) and you’re done
Conclusion
Writing a simple API is difficult. Making it extensible in a simple way, however, is not. If your API has achieved “simplicity”, then it is very easy to support injecting arbitrary SPIs for arbitrary purposes at a single, central location, such as jOOQ’s Configuration.
Again, this is hard in terms of creating a simple API. But it is extremely easy when making this simple API extensible. Make your SPIs very easily discoverable. A jOOQ power user will always look for extension points in jOOQ’s Configuration. And because the extension points are explicit types which have to be implemented (as opposed to annotations and their magic), no documentation is needed to learn the SPI (of course it is still beneficial as a reference).
I’d love to hear your alternative approaches to this API design challenge in the comments.
Unit testing annotation processors is a bit more tricky than using them. Your processor hooks into the Java compiler and manipulates the compiled AST (or does other things). If you want to test your own processor, you need the test to run a Java compiler, but that is difficult to do in a normal project setup, especially if the expected behaviour for a given test is a compilation error.
Let’s assume we have the following two annotations:
@interface A {}
@interface B {}
And now, we would like to establish a rule that @A must always be accompanied by @B. For example:
// This must not compile
@A
class Bad {}
// This is fine
@A @B
class Good {}
We’ll enforce that with an annotation processor:
class AProcessor implements Processor {
boolean processed;
private ProcessingEnvironment processingEnv;
@Override
public Set<String> getSupportedOptions() {
return Collections.emptySet();
}
@Override
public Set<String> getSupportedAnnotationTypes() {
return Collections.singleton("*");
}
@Override
public SourceVersion getSupportedSourceVersion() {
return SourceVersion.RELEASE_8;
}
@Override
public void init(ProcessingEnvironment processingEnv) {
this.processingEnv = processingEnv;
}
@Override
public boolean process(Set<? extends TypeElement> annotations, RoundEnvironment roundEnv) {
for (TypeElement e1 : annotations)
if (e1.getQualifiedName().contentEquals(A.class.getName()))
for (Element e2 : roundEnv.getElementsAnnotatedWith(e1))
if (e2.getAnnotation(B.class) == null)
processingEnv.getMessager().printMessage(ERROR, "Annotation A must be accompanied by annotation B");
this.processed = true;
return false;
}
@Override
public Iterable<? extends Completion> getCompletions(Element element, AnnotationMirror annotation, ExecutableElement member, String userText) {
return Collections.emptyList();
}
}
Now, this works. We can easily verify that manually by adding the annotation processor to some Maven compiler configuration and by annotating a few classes with A and B. But then, someone changes the code and we don’t notice the regression. How can we unit test this, rather than doing things manually?
jOOR 0.9.10 support for annotation processors
jOOR is our little open source reflection library that we’re using internally in jOOQ
jOOR has a convenient API to invoke the javax.tools.JavaCompiler API through Reflect.compile(). The most recent release 0.9.10 now takes an optional CompileOptions argument where annotation processors can be registered.
Yes. I am guilty. Evil? Don’t know. But guilty. I heavily use and abuse the java.lang.Boolean type to implement three valued logic in Java:
Boolean.TRUE means true (duh)
Boolean.FALSE means false
null can mean anything like “unknown” or “uninitialised”, etc.
I know – a lot of enterprise developers will bikeshed and cargo cult the old saying:
Code is read more often than it is written
But as with everything, there is a tradeoff. For instance, in algorithm-heavy, micro optimised library code, it is usually more important to have code that really performs well, rather than code that apparently doesn’t need comments because the author has written it in such a clear and beautiful way.
woot:
if (something) {
for (Object o : list)
if (something(o))
break woot;
throw new E();
}
Yes. You can break out of “labeled ifs”. Because in Java, any statement can be labeled, and if the statement is a compound statement (observe the curly braces following the if), then it may make sense to break out of it. Even if you’ve never seen that idiom, I think it’s quite immediately clear what it does.
Ghasp!
If Java were a bit more classic, it might have supported this syntax:
if (something) {
for (Object o : list)
if (something(o))
goto woot;
throw new E();
}
woot:
Nicolai suggested that the main reason I hadn’t written the following, equivalent, and arguably more elegant logic, is because jOOQ still supports Java 6:
if (something && list.stream().noneMatch(this::something))
throw new E();
It’s more concise! So, it’s better, right? Everything new is always better.
A third option would have been the less concise solution that essentially just replaces break by return:
if (something && noneMatchSomething(list)
throw new E();
// And then:
private boolean noneMatchSomething(List<?> list) {
for (Object o : list)
if (something(o))
return false;
return true;
}
There’s an otherwise useless method that has been extracted. The main benefit is that people are not used to breaking out of labeled statements (other than loops, and even then it’s rare), so this is again about some subjective “readability”. I personally find this particular example less readable, because the extracted method is no longer local. I have to jump around in the class and interrupt my train of thoughts. But of course, YMMV with respect to the two imperative alternatives.
Back to objectivity: Performance
When I tweet about Java these days, I’m mostly tweeting about my experience writing jOOQ. A library. A library that has been tuned so much over the past years, that the big client side bottleneck (apart from the obvious database call) is the internal StringBuilder that is used to generate dynamic SQL. And compared to most database queries, you will not even notice that.
But sometimes you do. E.g. if you’re using an in-memory H2 database and run some rather trivial queries, then jOOQ’s overhead can become measurable again. Yes. There are some use-cases, which I do want to take seriously as well, where the difference between an imperative loop and a stream pipeline is measurable.
With this simple example, break or return don’t matter. At some point, adding additional methods might start getting in the way of inlining (because of stacks getting too deep), but not creating additional methods might be getting in the way of inlining as well (because of method bodies getting too large). I don’t want to bet on either approach here at this level, nor is jOOQ tuned that much. Like most similar libraries, the traversal of the jOOQ expression tree generates stack that are too deep to completely inline anyway.
But the very obvious loser here is the Stream approach, which is roughly 6.5x slower in this benchmark than the imperative approaches. This isn’t surprising. The stream pipeline has to be set up every single time to represent something as trivial as the above imperative loop. I’ve already blogged about this in the past, where I compared replacing simple for loops by Stream.forEach()
In your infrastructure logic? Maybe! If you’re writing a library, or if you’re using a library like jOOQ, then yes. Chances are that a lot of your logic is CPU bound. You should occasionally profile your application and spot such bottlenecks, both in your code and in third party libraries. E.g. in most of jOOQ’s internals, using a stream pipeline might be a very bad choice, because ultimately, jOOQ is something that might be invoked from within your loops, thus adding significant overhead to your application, if your queries are not heavy (e.g. again when run against an H2 in-memory database).
So, given that you’re clearly “micro-losing” on the performance side by using the Stream API, you may need to evaluate the readability tradeoff more carefully. When business logic is complex, readability is very important compared to micro optimisations. With infrastructure logic, it is much less likely so, in my opinion. And I’m not alone:
In Spring Data, we consistently observed Streams of any kind (and Optional) to add significant overhead over foreach loops so that we strictly avoid them for hot code paths.
Note: there’s that other cargo cult of premature optimisation going around. Yes, you shouldn’t worry about these details too early in your application implementation. But you should still know when to worry about them, and be aware of the tradeoffs.
And while you’re still debating what name to give to that extracted method, I’ve written 5 new labeled if statements! ;-)
Everyone knows the SQL SUM() aggregate function (and many people also know its window function variant).
When querying the Sakila database, we can get the daily revenue (using PostgreSQL syntax):
WITH p AS (
SELECT
CAST (payment_date AS DATE) AS date,
amount
FROM payment
)
SELECT
date,
SUM (amount) AS daily_revenue,
SUM (SUM (amount)) OVER (ORDER BY date) AS cumulative_revenue
FROM p
GROUP BY date
ORDER BY date
This is exciting because we’re not only requiring multiplicative aggregation, but even cumulative multiplicative aggregation. So, another window function.
But regrettably, SQL doesn’t offer a MUL() aggregate function, even if it were relatively simple to implement. We have two options:
Implementing a custom aggregate function (stay tuned for a future blog post)
Using a trick by summing logarithms, rather than multiplying operands directly
SELECT
date,
factor,
EXP(SUM(LN(1000 * (1 + COALESCE(factor, 1))))
OVER (ORDER BY date)) AS accumulated
FROM t
And we get the nice result as previously shown. You may have to replace LN() by LOG() depending on your database.
Caveat: Negative numbers
Try running this:
SELECT LN(-1)
You’ll get:
SQL Error [2201E]: ERROR: cannot take logarithm of a negative number
Logarithms are defined only for strictly positive numbers, unless your database is capable of handling complex numbers as well. In case of which a single zero value would still break the aggregation.
But if your data set is defined to contain only strictly positive numbers, you’ll be fine – give or take some floating point rounding errors. Or, you’ll do some sign handling, which looks like this:
WITH v(i) AS (VALUES (-2), (-3), (-4))
SELECT
CASE
WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0
THEN -1
ELSE 1
END * EXP(SUM(LN(ABS(i)))) multiplication1
FROM v;
WITH v(i) AS (VALUES (-2), (-3), (-4), (-5))
SELECT
CASE
WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0
THEN -1
ELSE 1
END * EXP(SUM(LN(ABS(i)))) multiplication2
FROM v;
SQL Error [2201E]: ERROR: cannot take logarithm of zero
Zero is different from negative numbers. A product that has a zero operand is always zero, so we should be able to handle this. We’ll do it in two steps:
Exclude zero values from the actual aggregation that uses EXP() and LN()
Add an additional CASE expression that checks if any of the operands is zero
The first step might not be necessary depending on how your database optimiser executes the second step.
WITH v(i) AS (VALUES (2), (3), (0))
SELECT
CASE
WHEN SUM (CASE WHEN i = 0 THEN 1 END) > 0
THEN 0
WHEN SUM (CASE WHEN i < 0 THEN -1 END) % 2 < 0
THEN -1
ELSE 1
END * EXP(SUM(LN(ABS(NULLIF(i, 0))))) multiplication
FROM v;
Extension: DISTINCT
Calculating the product of all DISTINCT values requires to repeat the DISTINCT keyword in 2 out of the above 3 sums:
WITH v(i) AS (VALUES (2), (3), (3))
SELECT
CASE
WHEN SUM (CASE WHEN i = 0 THEN 1 END) > 0
THEN 0
WHEN SUM (DISTINCT CASE WHEN i < 0 THEN -1 END) % 2 < 0
THEN -1
ELSE 1
END * EXP(SUM(DISTINCT LN(ABS(NULLIF(i, 0))))) multiplication
FROM v;
The result is now:
multiplication |
---------------|
6 |
Notice that the first SUM() that checks for the presence of NULL values doesn’t require a DISTINCT keyword, so we omit it to improve performance.
Extension: Window functions
Of course, if we are able to emulate a PRODUCT() aggregate function, we’d love to turn it into a window function as well. This can be done simply by transforming each individual SUM() into a window function:
WITH v(i, j) AS (
VALUES (1, 2), (2, -3), (3, 4),
(4, -5), (5, 0), (6, 0)
)
SELECT i, j,
CASE
WHEN SUM (CASE WHEN j = 0 THEN 1 END)
OVER (ORDER BY i) > 0
THEN 0
WHEN SUM (CASE WHEN j < 0 THEN -1 END)
OVER (ORDER BY i) % 2 < 0
THEN -1
ELSE 1
END * EXP(SUM(LN(ABS(NULLIF(j, 0))))
OVER (ORDER BY i)) multiplication
FROM v;
Clock’s ticking. JDK 11 will remove a bunch of deprecated modules through JEP 320, which includes the Java EE modules, which again includes JAXB, a dependency of many libraries, including jOOQ. Thus far, few people have upgraded to Java 9 or 10, as these aren’t LTS releases. Unlike in the old days, however, people will be forced much earlier to upgrade to Java 11, because Java 8 (the free version) will reach end of life soon after Java 11 is released:
End of Public Updates for Oracle JDK 8
As outlined in the Oracle JDK Support Roadmap below, Oracle will not post further updates of Java SE 8 to its public download sites for commercial use after January 2019
So, we library developers must act and finally modularise our libraries. Which is, quite frankly, a pain. Not because of the module system itself, which works surprisingly well. But because of the toolchain, which is far from being production ready. This mostly includes:
It’s still almost not possible to maintain a modularised project in an IDE (I’ve tried Eclipse and IntelliJ, not Netbeans so far) as there are still tons of bugs. Some of which are showstoppers, halting compilation in the IDE (despite compilation working in Maven). For example:
But rather than just complaining, let’s complain and fix it
Let’s fix our own IDE by patching it
Disclaimer: The following procedure assumes that you have the right to modify your IDE’s source and binaries. To my understanding, this is the case with the EPL licensed Eclipse. It may not be the case for other IDEs.
Disclaimer2: Note, as reddit user fubarbazqux so eloquently put it, there are cleaner ways to apply patches (and contribute them) to the Eclipse community, if you have more time. This article just displays a very easy way to do things without spending too much time to figure out how the Eclipse development processes work, internally. It shows a QUICK FIX recipe
The first bug was already discovered and fixed for Eclipse 4.8, but its RC4 version seems to have tons of other problems, so let’s not upgrade to that yet. Instead, let’s apply the fix that can be seen here to our own distribution:
Or just add all the available dependencies, it doesn’t really matter.
You can now open the type that you want to edit:
Now, simply copy the source code from the editor and paste it in a new class inside of your project, which you put in the same package as the original (split packages are still possible in this case, yay)
Inside of your copy, apply the desired patch and build the project. Since you already included all the dependencies, it will be easy to compile your copy of the class, and you don’t have to build the entirety of Eclipse.
Now, go to your Windows Explorer or Mac OS X Finder, or Linux shell or whatever and find the compiled class:
This class can now be copied into the Eclipse plugin. How to find the appropriate Eclipse plugin? Just go to your plugin dependencies and check out the location of the class you’ve opened earlier:
Open that plugin from your Eclipse distribution’s /plugins folder using 7zip or whatever zipping tool you prefer, and overwrite the original class file(s). You may need to close Eclipse first, before you can write to the plugin zip file. And it’s always a good idea to make backup copies of the original plugin(s).
Be careful that if your class has any nested classes, you will need to copy them all, e.g.
MyClass.class
MyClass$1.class // Anonymous class
MyClass$Nested.class // Named, nested class
No problemo, we can hack our way around that as well. Launch your normal Eclipse instance (not the “Eclipse IDE for Eclipse Committers” one) with a debug agent running, by adding the following lines to your eclipse.ini file:
Launch Eclipse again, then connect to your Eclipse from your other “Eclipse IDE for Eclipse Committers” instance by connecting a debugger:
And start setting breakpoints wherever you need, e.g. here, in my case:
java.lang.NullPointerException
at org.eclipse.jdt.internal.compiler.problem.ProblemHandler.handle(ProblemHandler.java:145)
at org.eclipse.jdt.internal.compiler.problem.ProblemHandler.handle(ProblemHandler.java:226)
at org.eclipse.jdt.internal.compiler.problem.ProblemReporter.handle(ProblemReporter.java:2513)
at org.eclipse.jdt.internal.compiler.problem.ProblemReporter.deprecatedType(ProblemReporter.java:1831)
at org.eclipse.jdt.internal.compiler.problem.ProblemReporter.deprecatedType(ProblemReporter.java:1808)
at org.eclipse.jdt.internal.compiler.lookup.CompilationUnitScope.checkAndRecordImportBinding(CompilationUnitScope.java:960)
at org.eclipse.jdt.internal.compiler.lookup.CompilationUnitScope.faultInImports(CompilationUnitScope.java:471)
at org.eclipse.jdt.internal.compiler.lookup.CompilationUnitScope.faultInTypes(CompilationUnitScope.java:501)
at org.eclipse.jdt.internal.compiler.Compiler.process(Compiler.java:878)
at org.eclipse.jdt.internal.compiler.ProcessTaskManager.run(ProcessTaskManager.java:141)
at java.lang.Thread.run(Unknown Source)
And start analysing the problem like your own bugs. The nice thing is, you don’t have to fix the problem, just find it, and possibly comment out some lines of code if you think they’re not really needed. In my case, luckily, the regression was introduced by a new method that is applied to JDK 9+ projects only:
The method will check for the new @Deprecated(since="9") attribute on the @Deprecated annotation. Not an essential feature, so let’s just turn it off by adding this line to the source file:
String deprecatedSinceValue(Supplier<AnnotationBinding[]> annotations) {
if (true) return;
// ...
}
This will effectively prevent the faulty logic from ever running. Not a fix, but a workaround. For more details about this specific issue, see the report. Of course, never forget to actually report the issue to Eclipse (or whatever your IDE is), so it can be fixed thoroughly for everyone else as well
Compile. Patch. Restart. Done!
Conclusion
Java is a cool platform. It has always been a very dynamic language at runtime, where compiled class files can be replaced by new versions at any moment, and re-loaded by the class loaders. This makes patching code by other vendors very easy, just:
Create a project containing the vendors’ code (or if you don’t have the code, the binaries)
Apply a fix / workaround to the Java class that is faulty (or if you don’t have the code, decompile the binaries if you are allowed to)
Compile your own version
Replace the version of the class file from the vendor by yours
Restart
This works with all software, including IDEs. In the case of jOOQ, all our customers have the right to modification, and they get the sources as well. We know how useful it is to be able to patch someone else’s code. This article shows it. Now, I can continue modularising jOOQ, and as a side product, improve the tool chain for everybody else as well.
Again, this article displayed a QUICK FIX approach (some call it “hack”). There are more thorough ways to apply patches / fixes, and contribute them back to the vendor.
Another, very interesting option would be to instrument your runtime and apply the fix only to byte code:
Could have use a Java agent to modify the class even without fixing it in the Eclipse source. Makes it easier to upgrade.
Again, I haven’t tried NetBeans yet (although I’ve heard its Java 9 support has been working very well for quite a while).
While IntelliJ’s Jigsaw support seems more advanced than Eclipse’s (still with a few flaws as well), it currently has a couple of performance issues when compiling projects like jOOQ or jOOλ. In a future blog post, I will show how to “fix” those by using a profiler, like:
Java Mission Control (can be used as a profiler, too)
YourKit
JProfiler
Profilers can be used to very easily track down the main source of a performance problem. I’vereporteda ton toEclipse already. For instance, this one:
Where a lot of time is being spent in the processing of Task Tags, like:
TODO
FIXME
XXX
The great thing about profiling this is:
You can report a precise bug to the vendor
You can find the flawed feature and turn it off as a workaround. Turning off the above task tag feature was a no-brainer. I’m not even using the feature.
In this much overdue article, I will explain why I think that in almost all cases, you should implement a “database first” design in your application’s data models, rather than a “Java first” design (or whatever your client language is), the latter approach leading to a long road of pain and suffering, once your project grows.
To my surprise, a small group of first time jOOQ users seem to be appalled by the fact that jOOQ heavily relies on source code generation. No one keeps you from using jOOQ the way you want and you don’t have to use code generation, but the default way to use jOOQ according to the manual is to start with a (legacy) database schema, reverse engineer that using jOOQ’s code generator to get a bunch of classes representing your tables, and then to write type safe queries against those tables:
for (Record2<String, String> record : DSL.using(configuration)
// ^^^^^^^^^^^^^^^^^^^^^^^ Type information derived from the
// generated code referenced from the below SELECT clause
.select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
// vvvvv ^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ Generated names
.from(ACTOR)
.orderBy(1, 2)) {
// ...
}
There are different philosophies, advantages, and disadvantages regarding these manual/automatic approaches, which I don’t want to discuss in this article. But essentially, the point of generated code is that it provides a Java representation of something that we take for granted (a “truth”) either within or outside of our system. In a way, compilers do the same thing when they generate byte code, machine code, or some other type of source code from the original sources – we get a representation of our “truth” in a different language, for whatever reason.
There is some truth (internal or external), like a specification, data model, etc.
We need a local representation of that truth in our programming language
And it almost always makes sense to generate that latter, to avoid redundancy.
Type providers and annotation processing
Noteworthy: Another, more modern approach to jOOQ’s particular code generation use-case would be Type Providers, as implemented by F#, in case of which the code is generated by the compiler while compiling. It never really exists in its source form. A similar (but less sophisticated) tool in Java are annotation processors, e.g. Lombok.
In a way, this does the same thing except:
You don’t see the generated code (perhaps that’s less appalling to some?)
You must ensure the types can be provided, i.e. the “truth” must always be available. Easy in the case of Lombok, which annotates the “truth”. A bit more difficult with database models, which rely on an always available live connection.
What’s the problem with code generation?
Apart from the tricky question whether to trigger code generation manually or automatically, some people seem to think that code must not be generated at all. The reason I hear the most is the idea that it is difficult to set up in a build pipeline. And yes, that is true. There is extra infrastructure overhead. Especially if you’re new to a certain product (like jOOQ, or JAXB, or Hibernate, etc.), setting up an environment takes time you would rather spend learning the API itself and getting value out of it.
If the overhead of learning how the code generator works is too high, then indeed, the API failed to make the code generator easy to use (and later on, to customise). That should be a high priority for any such API. But that’s the only argument against code generation. Other than that, it makes absolutely no sense at all to hand-write the local representation of the internal or external truth.
Many people argue that they don’t have time for that stuff. They need to ship their MVPs. They can finalise their build pipelines later. I say:
“But Hibernate / JPA makes coding Java first easy”
Yes that’s true. And it’s both a bliss and a curse for Hibernate and its users. In Hibernate, you can just write a couple of entities, such as:
@Entity
class Book {
@Id
int id;
String title;
}
And you’re almost set. Let Hibernate generate the boring “details” of how to define this entity in your SQL dialect’s DDL:
CREATE TABLE book (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
title VARCHAR(50),
CONSTRAINT pk_book PRIMARY KEY (id)
);
CREATE INDEX i_book_title ON book (title);
… and start running the application. That’s really cool to get started quickly and to try out things.
But, huh, wait. I cheated.
Will Hibernate really apply that named primary key definition?
Will it create the index on TITLE, which I know we’ll need?
Will it add an identity specification?
Probably not. While you’re developing your greenfield project, it is convenient to always throw away your entire database and re-generate it from scratch, once you’ve added the additional annotations. So, the Book entity would eventually look like this:
@Entity
@Table(name = "book", indexes = {
@Index(name = "i_book_title", columnList = "title")
})
class Book {
@Id
@GeneratedValue(strategy = IDENTITY)
int id;
String title;
}
Cool. Re-generate. Again, this makes it really easy to get started.
But you’ll pay the price later on
At some point, you go to production. And that’s when this model no longer works. Because
Once you go live, you can no longer throw away your database, as your database has become legacy.
From now on, you have to write DDL migration scripts, e.g. using Flyway. And then, what happens to your entities? You can either adapt them manually (so you double the work), or have Hibernate re-generate them for you (how big are your chances of the generation matching your expectations?) You can only lose.
Because once you go to production, you need hotfixes. And those have to go live fast. And since you didn’t prepare for pipelining your migrations to production smoothly, you’ll patch things wildly. And then you run out of time to do it right™. And you’ll blame Hibernate, because it’s always someone else’s fault…
Instead, you could have done things entirely differently from the beginning. Like using those round wheels.
Go Database First
The real “truth” of your database schema, and the “sovereignty” over it, resides with your database. The database is the only place where the schema is defined, and all clients have a copy of the database schema, not vice versa. The data is in your database, not in your client, so it makes perfect sense to enforce the schema and its integrity in the database, right where the data is.
And that’s not where it ends. For instance, if you’re using Oracle, you may want to specify:
In what tablespace your table resides
What PCTFREE value it has
What the cache size of your sequence (behind the identity) is
Maybe, all of this doesn’t matter in small systems, but you don’t have to go “big data” before you can profit from vendor-specific storage optimisations as the above. None of the ORMs I’ve ever seen (including jOOQ) will allow you to use the full set of DDL options that you may want to use on your database. ORMs offer some tools to help you write DDL.
But ultimately, a well-designed schema is hand written in DDL. All generated DDL is only an approximation of that.
What about the client model?
As mentioned before, you will need a copy of your database schema in your client, a client representation. Needless to say that this client representation needs to be in-sync with the real model. How to best do that? By using a code generator.
All databases expose their meta information through SQL. Here’s how to get all tables from your database in various SQL dialects:
-- H2, HSQLDB, MySQL, PostgreSQL, SQL Server
SELECT table_schema, table_name
FROM information_schema.tables
-- DB2
SELECT tabschema, tabname
FROM syscat.tables
-- Oracle
SELECT owner, table_name
FROM all_tables
-- SQLite
SELECT name
FROM sqlite_master
-- Teradata
SELECT databasename, tablename
FROM dbc.tables
These queries (or similar ones, e.g. depending on whether views, materialised views, table valued functions should also be considered) are also run by JDBC’s DatabaseMetaData.getTables() call, or by the jOOQ-meta module.
From the result of such queries, it’s relatively easy to generate any client representation of your database model, regardless what your client technology is.
If you’re using JDBC or Spring, you can create a bunch of String constants
If you’re using JPA, you can generate the entities themselves
If you’re using jOOQ, you can generate the jOOQ meta model
Now, any database increment will automatically lead to updated client code. For instance, imagine:
ALTER TABLE book RENAME COLUMN title TO book_title;
Would you really want to do this work twice? No way. Just commit the DDL, run it through your build pipeline, and have an updated entity:
@Entity
@Table(name = "book", indexes = {
// Would you have thought of this?
@Index(name = "i_book_title", columnList = "book_title")
})
class Book {
@Id
@GeneratedValue(strategy = IDENTITY)
int id;
@Column("book_title")
String bookTitle;
}
Or an updated jOOQ class. Plus: Your client code might no longer compile, which can be a good thing! Most DDL changes are also semantic changes, not just syntactic ones. So, it’s great to be able to see in compiled client source code, what code is (or may be) affected by your database increment.
A single truth
Regardless what technology you’re using, there’s always one model that contains the single truth for a subsystem – or at least, we should aim for this goal and avoid the enterprisey mess where “truth” is everywhere and nowhere. It just makes everything much simpler. If you exchange XML files with some other system, you’re going to use XSD. Like jOOQ’s INFORMATION_SCHEMA meta model in XML form: https://www.jooq.org/xsd/jooq-meta-3.10.0.xsd
XSD is well understood
XSD specifies XML content very well, and allows for validation in all client languages
XSD can be versioned easily, and evolved backwards compatibly
XSD can be translated to Java code using XJC
The last bullet is important. When communicating with an external system through XML messages, we want to be sure our messages are valid. That’s really really easy to do with JAXB, XJC, and XSD. It would be outright nuts to think that a Java-first approach where we design our messages as Java objects could somehow be reasonably mapped to XML for someone else to consume. That generated XML would be of very poor quality, undocumented, and hard to evolve. If there’s an SLA on such an interface, we’d be screwed.
Frankly, that’s what happens to JSON APIs all the time, but that’s another story, another rant…
Databases: Same thing
When you’re using databases, it’s the same thing. The database owns its data and it should be the master of the schema. All modifications to the schema should be implemented using DDL directly, to update the single truth.
Once that truth is updated, all clients need to update their copies of the model as well. Some clients may be written in Java, using either (or both of) jOOQ and Hibernate, or JDBC. Other clients may be written in Perl (good luck to them). Even other clients may be written in C#. It doesn’t matter. The main model is in the database. ORM-generated models are of poor quality, not well documented, and hard to evolve.
So, don’t do it. And, don’t do it from the very beginning. Instead, go database first. Build a deployment pipeline that can be automated. Include code generators to copy your database model back into the clients. And stop worrying about code generation. It’s a good thing. You’ll be productive. All it takes is a bit of initial effort to set it up, and you’ll get years of improved productivity for the rest of your project.
Thank me later.
Clarification
Just to be sure: This article in no way asserts that your database model should be imposed on your entire system (e.g. your domain, your business logic, etc. etc.). The claim I made here is that client code interacting with the database should act upon the database model, and not have its own first class model of the database instead. This logic typically resides in the data access layer of your client.
In 2-tier architectures, which still have their place sometimes, that may be the only model of your system. In most systems, however, I consider the data access layer a “subsystem” that encapsulates the database model. So, there.
Exceptions
There are always exceptions, and I promised that the database first and code generation approach may not always be the right choice. These exceptions are (probably not exhaustive):
When the schema is unknown and must be discovered. E.g. you’re a tool vendor helping users navigate any schema. Duh… No code generation. But still database first.
When the schema needs to be generated on the fly for some task. This sounds a lot like a more or less sophisticated version of the entity attribute value pattern, i.e. you don’t really have a well-defined schema. In that case, it’s often not even sure if an RDBMS will be the right choice.
The nature of exceptions is that they’re exceptional. In the majority of RDBMS usage, the schema is known in advance, placed inside the RDBMS as the single source of “truth”, and clients will have derived copies from it – ideally generated using a code generator.
When inserting records into SQL databases, we often want to fetch back generated IDs and possibly other trigger, sequence, or default generated values. Let’s assume we have the following table:
-- DB2
CREATE TABLE x (
i INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
j VARCHAR(50),
k DATE DEFAULT CURRENT_DATE
);
-- PostgreSQL
CREATE TABLE x (
i SERIAL4 PRIMARY KEY,
j VARCHAR(50),
k DATE DEFAULT CURRENT_DATE
);
-- Oracle
CREATE TABLE x (
i INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
j VARCHAR2(50),
k DATE DEFAULT SYSDATE
);
DB2
DB2 is the only database currently supported by jOOQ, which implements the SQL standard according to which we can SELECT from any INSERT statement, including:
SELECT *
FROM FINAL TABLE (
INSERT INTO x (j)
VALUES ('a'), ('b'), ('c')
);
Pretty neat! This query can simply be run like any other query in JDBC, and you don’t have to go through any hassles.
PostgreSQL and Firebird
These databases have a vendor specific extension that does the same thing, almost as powerful:
-- Simple INSERT .. RETURNING query
INSERT INTO x (j)
VALUES ('a'), ('b'), ('c')
RETURNING *;
-- If you want to do more fancy stuff
WITH t AS (
INSERT INTO x (j)
VALUES ('a'), ('b'), ('c')
RETURNING *
)
SELECT * FROM t;
Both syntaxes work equally well, the latter is just as powerful as DB2’s, where the result of an insertion (or update, delete, merge) can be joined to other tables. Again, no problem with JDBC
Oracle
In Oracle, this is a bit more tricky. The Oracle SQL language doesn’t have an equivalent of DB2’s FINAL TABLE (DML statement). The Oracle PL/SQL language, however, does support the same syntax as PostgreSQL and Firebird. This is perfectly valid PL/SQL
-- Create a few auxiliary types first
CREATE TYPE t_i AS TABLE OF NUMBER(38);
/
CREATE TYPE t_j AS TABLE OF VARCHAR2(50);
/
CREATE TYPE t_k AS TABLE OF DATE;
/
DECLARE
-- These are the input values
in_j t_j := t_j('a', 'b', 'c');
out_i t_i;
out_j t_j;
out_k t_k;
c1 SYS_REFCURSOR;
c2 SYS_REFCURSOR;
c3 SYS_REFCURSOR;
BEGIN
-- Use PL/SQL's FORALL command to bulk insert the
-- input array type and bulk return the results
FORALL i IN 1 .. in_j.COUNT
INSERT INTO x (j)
VALUES (in_j(i))
RETURNING i, j, k
BULK COLLECT INTO out_i, out_j, out_k;
-- Fetch the results and display them to the console
OPEN c1 FOR SELECT * FROM TABLE(out_i);
OPEN c2 FOR SELECT * FROM TABLE(out_j);
OPEN c3 FOR SELECT * FROM TABLE(out_k);
dbms_sql.return_result(c1);
dbms_sql.return_result(c2);
dbms_sql.return_result(c3);
END;
/
A bit verbose, but it has the same effect. Now, from JDBC:
try (Connection con = DriverManager.getConnection(url, props);
Statement s = con.createStatement();
// The statement itself is much more simple as we can
// use OUT parameters to collect results into, so no
// auxiliary local variables and cursors are needed
CallableStatement c = con.prepareCall(
"DECLARE "
+ " v_j t_j := ?; "
+ "BEGIN "
+ " FORALL j IN 1 .. v_j.COUNT "
+ " INSERT INTO x (j) VALUES (v_j(j)) "
+ " RETURNING i, j, k "
+ " BULK COLLECT INTO ?, ?, ?; "
+ "END;")) {
try {
// Create the table and the auxiliary types
s.execute(
"CREATE TABLE x ("
+ " i INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,"
+ " j VARCHAR2(50),"
+ " k DATE DEFAULT SYSDATE"
+ ")");
s.execute("CREATE TYPE t_i AS TABLE OF NUMBER(38)");
s.execute("CREATE TYPE t_j AS TABLE OF VARCHAR2(50)");
s.execute("CREATE TYPE t_k AS TABLE OF DATE");
// Bind input and output arrays
c.setArray(1, ((OracleConnection) con).createARRAY(
"T_J", new String[] { "a", "b", "c" })
);
c.registerOutParameter(2, Types.ARRAY, "T_I");
c.registerOutParameter(3, Types.ARRAY, "T_J");
c.registerOutParameter(4, Types.ARRAY, "T_K");
// Execute, fetch, and display output arrays
c.execute();
Object[] i = (Object[]) c.getArray(2).getArray();
Object[] j = (Object[]) c.getArray(3).getArray();
Object[] k = (Object[]) c.getArray(4).getArray();
System.out.println(Arrays.asList(i));
System.out.println(Arrays.asList(j));
System.out.println(Arrays.asList(k));
}
finally {
try {
s.execute("DROP TYPE t_i");
s.execute("DROP TYPE t_j");
s.execute("DROP TYPE t_k");
s.execute("DROP TABLE x");
}
catch (SQLException ignore) {}
}
}
This will correctly emulate the query for all of the databases that natively support the syntax. In the case of Oracle, since jOOQ cannot create nor assume any SQL TABLE types, PL/SQL types from the DBMS_SQL package will be used
Something that has been said many times, but needs constant repeating until every developer is aware of the importance of this is the performance difference between row-by-row updating and bulk updating. If you cannot guess which one will be much faster, remember that row-by-row kinda rhymes with slow-by-slow (hint hint).
Disclaimer: This article will discuss only non-concurrent updates, which are much easier to reason about. In a concurrent update situation, a lot of additional factors will add complexity to the problem, including the locking strategy, transaction isolation levels, or simply how the database vendor implements things in detail. For the sake of simplicity, I’ll assume no concurrent updates are being made.
Example query
Let’s say we have a simple table for our blog posts (using Oracle syntax, but the effect is the same on all databases):
CREATE TABLE post (
id INT NOT NULL PRIMARY KEY,
text VARCHAR2(1000) NOT NULL,
archived NUMBER(1) NOT NULL CHECK (archived IN (0, 1)),
creation_date DATE NOT NULL
);
CREATE INDEX post_creation_date_i ON post (creation_date);
Now, let’s add some 10000 rows:
INSERT INTO post
SELECT
level,
lpad('a', 1000, 'a'),
0 AS archived,
DATE '2017-01-01' + (level / 100)
FROM dual
CONNECT BY level <= 10000;
EXEC dbms_stats.gather_table_stats('TEST', 'POST');
Now imagine, we want to update this table and set all posts to ARCHIVED = 1 if they are from last year, e.g. CREATION_DATE < DATE '2018-01-01'. There are various ways to do this, but you should have built an intuition that doing the update in one single UPDATE statement is probably better than looping over each individual row and updating each individual row explicitly. Right?
Right.
Then, why do we keep doing it?
Let me ask this differently:
Does it matter?
The best way to find out is to benchmark. I’m doing two benchmarks for this:
One that is run in PL/SQL, showing the performance difference between different approaches that are available to PL/SQL (namely looping, the FORALL syntax, and a single bulk UPDATE)
One that is run in Java, doing JDBC calls, showing the performance difference between different approaches available to Java (namely looping, caching PreparedStatement but still looping, batching, and a single bulk UPDATE)
Run 1, Statement 1 : .01457 (avg : .0098)
Run 1, Statement 2 : .0133 (avg : .01291)
Run 1, Statement 3 : .02351 (avg : .02519)
Run 2, Statement 1 : .00882 (avg : .0098)
Run 2, Statement 2 : .01159 (avg : .01291)
Run 2, Statement 3 : .02348 (avg : .02519)
Run 3, Statement 1 : .01012 (avg : .0098)
Run 3, Statement 2 : .01453 (avg : .01291)
Run 3, Statement 3 : .02544 (avg : .02519)
Run 4, Statement 1 : .00799 (avg : .0098)
Run 4, Statement 2 : .01346 (avg : .01291)
Run 4, Statement 3 : .02958 (avg : .02519)
Run 5, Statement 1 : .00749 (avg : .0098)
Run 5, Statement 2 : .01166 (avg : .01291)
Run 5, Statement 3 : .02396 (avg : .02519)
The difference between Statement 1 and 3 is a factor of 2.5x
Showing the time it takes for each statement type to complete, each time updating 3649 / 10000 rows. The winner is:
Statement 1, running a bulk update
It looks like this:
UPDATE post
SET archived = 1
WHERE archived = 0 AND creation_date < DATE '2018-01-01';
Runner-up (not too far away) is:
Statement 2, using the PL/SQL FORALL syntax
It works like this:
DECLARE
TYPE post_ids_t IS TABLE OF post.id%TYPE;
v_post_ids post_ids_t;
BEGIN
SELECT id
BULK COLLECT INTO v_post_ids
FROM post
WHERE archived = 0 AND creation_date < DATE '2018-01-01';
FORALL i IN 1 .. v_post_ids.count
UPDATE post
SET archived = 1
WHERE id = v_post_ids(i);
END;
Loser (by a factor of 2.5x on our specific data set) is:
Statement 3, using an ordinary LOOP and running row-by-row updates
FOR rec IN (
SELECT id
FROM post
WHERE archived = 0 AND creation_date < DATE '2018-01-01'
) LOOP
UPDATE post
SET archived = 1
WHERE id = rec.id;
END LOOP;
It does not really come as a surprise. We’re switching between the PL/SQL engine and the SQL engine many many times, and also, instead of running through the post table only once in O(N) time, we’re looking up individual ID values in O(log N) time, N times, so the complexity went from
O(N) -> O(N log N)
We’d get far worse results for larger tables!
What about doing this from Java?
The difference is much more drastic if each call to the SQL engine has to be done over the network from another process. Again, the benchmark code is available from a gist, and I will paste it to the end of this blog post as well.
The result is (same time unit):
Run 0, Statement 1: PT4.546S
Run 0, Statement 2: PT3.52S
Run 0, Statement 3: PT0.144S
Run 0, Statement 4: PT0.028S
Run 1, Statement 1: PT3.712S
Run 1, Statement 2: PT3.185S
Run 1, Statement 3: PT0.138S
Run 1, Statement 4: PT0.025S
Run 2, Statement 1: PT3.481S
Run 2, Statement 2: PT3.007S
Run 2, Statement 3: PT0.122S
Run 2, Statement 4: PT0.026S
Run 3, Statement 1: PT3.518S
Run 3, Statement 2: PT3.077S
Run 3, Statement 3: PT0.113S
Run 3, Statement 4: PT0.027S
Run 4, Statement 1: PT3.54S
Run 4, Statement 2: PT2.94S
Run 4, Statement 3: PT0.123S
Run 4, Statement 4: PT0.03S
The difference between Statement 1 and 4 is a factor of 100x !!
So, who’s winning? Again (by far):
Statement 4, running the bulk update
In fact, the time is not too far away from the time taken by PL/SQL. With larger data sets being updated, the two results will converge. The code is:
Followed by the not that much worse (but still 3.5x worse):
Statement 3, running the batch update
Batching can be compared to PL/SQL’s FORALL statement. While we’re running individual row-by-row updates, we’re sending all the update statements in one batch to the SQL engine. This does save a lot of time on the network and all the layers in between.
The code looks like this:
try (Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT id FROM post WHERE archived = 0\n"
+ "AND creation_date < DATE '2018-01-01'"
);
PreparedStatement u = c.prepareStatement(
"UPDATE post SET archived = 1 WHERE id = ?"
)) {
while (rs.next()) {
u.setInt(1, rs.getInt(1));
u.addBatch();
}
u.executeBatch();
}
Followed by the losers:
Statement 1 and 2, running row by row updates
The difference between statement 1 and 2 is that 2 caches the PreparedStatement, which allows for reusing some resources. This can be a good thing, but didn’t have a very significant effect in our case, compared to the batch / bulk alternatives. The code is:
// Statement 1:
try (Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT id FROM post\n"
+ "WHERE archived = 0\n"
+ "AND creation_date < DATE '2018-01-01'"
)) {
while (rs.next()) {
try (PreparedStatement u = c.prepareStatement(
"UPDATE post SET archived = 1 WHERE id = ?"
)) {
u.setInt(1, rs.getInt(1));
u.executeUpdate();
}
}
}
// Statement 2:
try (Statement s = c.createStatement();
ResultSet rs = s.executeQuery(
"SELECT id FROM post\n"
+ "WHERE archived = 0\n"
+ "AND creation_date < DATE '2018-01-01'"
);
PreparedStatement u = c.prepareStatement(
"UPDATE post SET archived = 1 WHERE id = ?"
)) {
while (rs.next()) {
u.setInt(1, rs.getInt(1));
u.executeUpdate();
}
}
Conclusion
As shown previously on this blog, there is a significant cost of JDBC server roundtrips, which can be seen in the JDBC benchmark. This cost is much more severe if we unnecessarily create many server roundtrips for a task that could be done in a single roundtrip, namely by using a SQL bulk UPDATE statement.
This is not only true for updates, but also for all the other statements, including SELECT, DELETE, INSERT, and MERGE. If doing everything in a single statement isn’t possible due to the limitations of SQL, we can still save roundtrips by grouping statements in a block, either by using an anonymous block in databases that support them:
BEGIN
statement1;
statement2;
statement3;
END;
(you can easily send these anonymous blocks over JDBC, as well!)
Or, by emulating anonymous blocks using the JDBC batch API (has its limitations), or by writing stored procedures.
The performance gain is not always worth the trouble of moving logic from the client to the server, but very often (as in the above case), the move is a no-brainer and there’s absolutely no reason against it.
So, remember: Stop doing row-by-row (slow-by-slow) operations when you could run the same operation in bulk, in a single SQL statement.
Hint: Always know what your ORM (if you’re using one) is doing, because the ORM can help you with automatic batching / bulking in many cases. But it often cannot, or it is too difficult to make it do so, so resorting to SQL is the way to go.
SET SERVEROUTPUT ON
DROP TABLE post;
CREATE TABLE post (
id INT NOT NULL PRIMARY KEY,
text VARCHAR2(1000) NOT NULL,
archived NUMBER(1) NOT NULL CHECK (archived IN (0, 1)),
creation_date DATE NOT NULL
);
CREATE INDEX post_creation_date_i ON post (creation_date);
ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;
CREATE TABLE results (
run NUMBER(2),
stmt NUMBER(2),
elapsed NUMBER
);
DECLARE
v_ts TIMESTAMP WITH TIME ZONE;
PROCEDURE reset_post IS
BEGIN
EXECUTE IMMEDIATE 'TRUNCATE TABLE post';
INSERT INTO post
SELECT
level AS id,
lpad('a', 1000, 'a') AS text,
0 AS archived,
DATE '2017-01-01' + (level / 100) AS creation_date
FROM dual
CONNECT BY level <= 10000;
dbms_stats.gather_table_stats('TEST', 'POST');
END reset_post;
BEGIN
-- Repeat the whole benchmark several times to avoid warmup penalty
FOR r IN 1..5 LOOP
reset_post;
v_ts := SYSTIMESTAMP;
UPDATE post
SET archived = 1
WHERE archived = 0 AND creation_date < DATE '2018-01-01';
INSERT INTO results VALUES (r, 1, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
reset_post;
v_ts := SYSTIMESTAMP;
DECLARE
TYPE post_ids_t IS TABLE OF post.id%TYPE;
v_post_ids post_ids_t;
BEGIN
SELECT id
BULK COLLECT INTO v_post_ids
FROM post
WHERE archived = 0 AND creation_date < DATE '2018-01-01';
FORALL i IN 1 .. v_post_ids.count
UPDATE post
SET archived = 1
WHERE id = v_post_ids(i);
END;
INSERT INTO results VALUES (r, 2, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
reset_post;
v_ts := SYSTIMESTAMP;
FOR rec IN (
SELECT id
FROM post
WHERE archived = 0 AND creation_date < DATE '2018-01-01'
) LOOP
UPDATE post
SET archived = 1
WHERE id = rec.id;
END LOOP;
INSERT INTO results VALUES (r, 3, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE);
END LOOP;
FOR rec IN (
SELECT
run, stmt,
CAST(elapsed AS NUMBER(10, 5)) ratio,
CAST(AVG(elapsed) OVER (PARTITION BY stmt) AS NUMBER(10, 5)) avg_ratio
FROM results
ORDER BY run, stmt
)
LOOP
dbms_output.put_line('Run ' || rec.run ||
', Statement ' || rec.stmt ||
' : ' || rec.ratio || ' (avg : ' || rec.avg_ratio || ')');
END LOOP;
dbms_output.put_line('');
dbms_output.put_line('Copyright Data Geekery GmbH');
dbms_output.put_line('https://www.jooq.org/benchmark');
END;
/
DROP TABLE results;