Archive by Author | lukaseder

How to Access a Method’s Result Value From the Finally Block


While the JVM is a stack-based machine, the Java language doesn’t really offer you any way to access that Stack. Even if sometimes, in rare occasions, it would be very useful.

An example

Method result values are put on the stack. If you look at the following example:

public int method() {
    if (something)
        return 1;

    ...
    if (somethingElse)
        return 2;

    ...
    return 0;
}

If we ignore the halting problem, error handling, and other academic discussions, we can say that the above method will “certainly” return any value of 1, 2, or 0. And that value is put on the stack prior to jumping out of the method.

Now, sometimes, it may be a use-case to take some action only when a given result value is returned. People might then be lured into starting the old flame-war discussion about whether multiple return statements are EVIL™ and the whole method should have been phrased like this, instead:

public int method() {
    int result = 0;

    if (something)
        result = 1;

    ...
    if (somethingElse)
        result = 2;

    ...
    // Important action here prior to return
    if (result == 1337)
        log.info("hehehe ;-)");

    return result;
}

Of course, the above example is wrong, because, previously, the if (something) return 1 and if (something) return 2 statements immediately aborted method execution. In order to achieve the same with the “single-return-statement” technique, we’ll have to rewrite our code like this:

public int method() {
    int result = 0;

    if (something)
        result = 1;
    else {

        ...
        if (somethingElse)
            result = 2;
        else {
            ...
        }
    }

    // Important action here prior to return
    if (result == 1337)
        log.info("hehehe ;-)");

    return result;
}

… and, of course, we can continue bike-shedding and flame-waring the use of curly braces and/or indentation levels, which shows we haven’t gained anything.

Accessing the return value from the stack

What we really wanted to do in our original implementation is a check just before returning to see what value is on the stack, i.e. what value will be returned. Here’s some pseudo-Java:

public int method() {
    try {
        if (something)
            return 1;

        ...
        if (somethingElse)
            return 2;

        ...
        return 0;
    }

    // Important action here prior to return
    finally {
        if (reflectionMagic.methodResult == 1337)
            log.info("hehehe ;-)");
    }
}

The good news is: Yes we can! Here’s a simple trick that can be done to achieve the above:

public int method() {
    int result = 0;

    try {
        if (something)
            return result = 1;

        ...
        if (somethingElse)
            return result = 2;

        ...
        return result = 0;
    }

    // Important action here prior to return
    finally {
        if (result == 1337)
            log.info("hehehe ;-)");
    }
}

The less good news is: You must never forget to explicitly assign the result. But every once in a while, this technique can be very useful to “access the method stack” when the Java language doesn’t really allow you to.

Of course…

Of course you could also just resort to this boring solution here:

public int method() {
    int result = actualMethod();

    if (result == 1337)
        log.info("hehehe ;-)");

    return result;
}

public int actualMethod() {
    if (something)
        return result = 1;

    ...
    if (somethingElse)
        return result = 2;

    ...
    return result = 0;
}

… and probably, most often, this technique is indeed better (because slightly more readable). But sometimes, you want to do more stuff than just logging in that finally block, or you want to access more than just the result value, and you don’t want to refactor the method.

Other approaches?

Now it’s your turn. What would be your preferred, alternative approach (with code examples?) E.g. using a Try monad? Or aspects?

Use This Preference to Speed up Your Eclipse m2e Configuration


Who doesn’t know them. The good old JFace dialogs in Eclipse that give you a visual representation of what is really a rather simple XML or properties file. In the case of m2e, it looks like this:

m2e-default

Unfortunately, this screen is a bit slow to load, and it doesn’t offer much value beyond checking version numbers and some other stuff that you’ll never change again. If you’re deep into using Maven, you’ll put plugins all over the place, and there is no way to visually manage the plugins in this screen. There can’t be, because plugins may contain “schemaless” XML, for which it would be impossible to design a non-XML editor.

So, most Eclipse/Maven/m2e users will immediately jump to the pom.xml representation, i.e. the tab on the far right, to edit the XML sources of their pom.xml files.

Luckily, you can change your workspace settings and make opening the XML tab the default for your pom.xml files:

m2e-change-default

A setting, which I believe should be the default in new installations of Eclipse, anyway. If you agree with this, show this issue some love:

https://bugs.eclipse.org/bugs/show_bug.cgi?id=465882

Functional Programming in Java 8 with Javaslang


We’re very happy to announce a guest post on the jOOQ Blog written by Daniel Dietrich, Senior Software Engineer at HSH Nordbank, husband and father of three. He currently creates a pricing framework for financial products as project lead and lead developer.

Daniel Dietrich

Besides his work, he is interested in programming languages, efficient algorithms and data structures. Daniel wrote the short book Play Framework Starter on building web-applications with the Play Framework for Java and Scala – and has been creating Javaslang, a functional component library for Java 8 recently, which has triggered our interest in particular.


It was a really exciting moment as I heard that Java will get lambdas. The fundamental idea of using functions as a means of abstraction has its origin in the ‘lambda calculus’, 80 years ago. Now, Java developers are able to pass behavior using functions.

List<Integer> list = Arrays.asList(2, 3, 1);

// passing the comparator as lambda expression
Collections.sort(list, (i1, i2) -> i1 - i2);

Lambda expressions reduce the verbosity of Java a lot. The new Stream API closes the gap between lambdas and the Java collection library. Taking a closer look shows, that parallel Streams are used rarely or at least with caution. A Stream cannot be reused and it is annoying that collections have to be converted forth and back.

// stream a list, sort it and collect results
Arrays.asList(2, 3, 1)
  .stream()
  .sorted()
  .collect(Collectors.toList());
        
// a little bit shorter
Stream.of(2, 3, 1)
  .sorted()
  .collect(Collectors.toList());

// or better use an IntStream?
IntStream.of(2, 3, 1)
  .sorted()
  .collect(ArrayList::new, List::add, List::addAll);

// slightly simplified
IntStream.of(2, 3, 1)
  .sorted()
  .boxed()
  .collect(Collectors.toList());

Wow! These are quite some variants for sorting a list of integers. Generally we want to focus on the what rather than wrapping our heads around the how. This extra dimension of complexity isn’t necessary. Here is how to achieve the same result with Javaslang:

List.of(2, 3, 1).sort();

Typically every object oriented language has an imperative core, so does Java. We control the flow of our applications using conditional statements and loops.

String getContent(String location) throws IOException {
    try {
        final URL url = new URL(location);
        if (!"http".equals(url.getProtocol())) {
            throw new UnsupportedOperationException(
                "Protocol is not http");
        }
        final URLConnection con = url.openConnection();
        final InputStream in = con.getInputStream();
        return readAndClose(in);
    } catch(Exception x) {
        throw new IOException(
            "Error loading location " + location, x);
    }
}

Functional languages have expressions instead of statements, we think in values. Lambda expressions help us transforming values. Here is one example, using Javaslang’s Try:

Try<String> getContent(String location) {
    return Try
        .of(() -> new URL(location))
        .filter(url -> "http".equals(url.getProtocol()))
        .flatMap(url -> Try.of(url::openConnection))
        .flatMap(con -> Try.of(con::getInputStream))
        .map(this::readAndClose);
}

The result is either a Success containing the content or a Failure containing an exception. In general, this notion is more concise compared to the imperative style and leads to robust programs we are able to reason about.

I hope this brief introduction has peaked your interest in javaslang.com! Please visit the site to learn more about functional programming with Java 8 and Javaslang.

Use this Neat Window Function Trick to Calculate Time Differences in a Time Series


Whenever you feel that itch…

Can’t I calculate this with SQL?

The answer is: Yes you can! And you should! Let’s see how…

Calculating time differences between rows

Let’s consider the following database containing timestamps (e.g. in a log database). We’re using PostgreSQL syntax for this:

CREATE TABLE timestamps (
  ts timestamp
);

INSERT INTO timestamps VALUES
  ('2015-05-01 12:15:23.0'),
  ('2015-05-01 12:15:24.0'),
  ('2015-05-01 12:15:27.0'),
  ('2015-05-01 12:15:31.0'),
  ('2015-05-01 12:15:40.0'),
  ('2015-05-01 12:15:55.0'),
  ('2015-05-01 12:16:01.0'),
  ('2015-05-01 12:16:03.0'),
  ('2015-05-01 12:16:04.0'),
  ('2015-05-01 12:16:04.0');

Obviously, you’ll be adding constraints and indexes, etc. Now, let’s assume that each individual timestamp represents an event in your system, and you’d like to keep track of how long ago the previous event has happened. I.e. you’d like the following result:

ts                   delta
-------------------------------
2015-05-01 12:15:23
2015-05-01 12:15:24  00:00:01
2015-05-01 12:15:27  00:00:03
2015-05-01 12:15:31  00:00:04
2015-05-01 12:15:40  00:00:09
2015-05-01 12:15:55  00:00:15
2015-05-01 12:16:01  00:00:06
2015-05-01 12:16:03  00:00:02
2015-05-01 12:16:04  00:00:01
2015-05-01 12:16:04  00:00:00

In other words

  • ts1 (12:15:23) + delta (00:00:01) = ts2 (12:15:24)
  • ts2 (12:15:24) + delta (00:00:03) = ts3 (12:15:27)

This can be achieved very easily with the LAG() window function:

SELECT 
  ts, 
  ts - lag(ts, 1) OVER (ORDER BY ts) delta
FROM timestamps
ORDER BY ts;

The above reads simply:

Give me the difference between the ts value of the current row and the ts value of the row that “lags” behind this row by one, with rows ordered by ts.

Easy, right? With LAG() you can actually access any row from another row within a “sliding window” by simply specifying the lag index.

We’ve already described this wonderful window function in a previous blog post.

Bonus: A running total interval

In addition to the difference between this timestamp and the previous one, we might be interested in the total difference between this timestamp and the first timestamp. This may sound like a running total (see our previous article about running totals using SQL), but it can be calculated much more easily using FIRST_VALUE() – a “cousin” of LAG()

SELECT 
  ts, 
  ts - lag(ts, 1) OVER w delta,
  ts - first_value(ts) OVER w total
FROM timestamps
WINDOW w AS (ORDER BY ts)
ORDER BY ts;

… the above query then yields

ts                   delta     total
---------------------------------------
2015-05-01 12:15:23            00:00:00
2015-05-01 12:15:24  00:00:01  00:00:01
2015-05-01 12:15:27  00:00:03  00:00:04
2015-05-01 12:15:31  00:00:04  00:00:08
2015-05-01 12:15:40  00:00:09  00:00:17
2015-05-01 12:15:55  00:00:15  00:00:32
2015-05-01 12:16:01  00:00:06  00:00:38
2015-05-01 12:16:03  00:00:02  00:00:40
2015-05-01 12:16:04  00:00:01  00:00:41
2015-05-01 12:16:04  00:00:00  00:00:41

Extra bonus: The total since a “reset” event

We can take this as far as we want. Let’s assume that we want to reset the total from time to time:

CREATE TABLE timestamps (
  ts timestamp,
  event varchar(50)
);

INSERT INTO timestamps VALUES
  ('2015-05-01 12:15:23.0', null),
  ('2015-05-01 12:15:24.0', null),
  ('2015-05-01 12:15:27.0', 'reset'),
  ('2015-05-01 12:15:31.0', null),
  ('2015-05-01 12:15:40.0', null),
  ('2015-05-01 12:15:55.0', 'reset'),
  ('2015-05-01 12:16:01.0', null),
  ('2015-05-01 12:16:03.0', null),
  ('2015-05-01 12:16:04.0', null),
  ('2015-05-01 12:16:04.0', null);

We can now run the following query:

SELECT
  ts, 
  ts - lag(ts, 1) 
       OVER (ORDER BY ts) delta,
  ts - first_value(ts) 
       OVER (PARTITION BY c ORDER BY ts) total
FROM (
  SELECT 
    COUNT(*) FILTER (WHERE EVENT = 'reset') 
             OVER (ORDER BY ts) c,
    ts
  FROM timestamps
) timestamps
ORDER BY ts;

… to produce

ts                   delta     total
---------------------------------------
2015-05-01 12:15:23            00:00:00
2015-05-01 12:15:24  00:00:01  00:00:01
2015-05-01 12:15:27  00:00:03  00:00:00 <-- reset
2015-05-01 12:15:31  00:00:04  00:00:04
2015-05-01 12:15:40  00:00:09  00:00:13
2015-05-01 12:15:55  00:00:15  00:00:00 <-- reset
2015-05-01 12:16:01  00:00:06  00:00:06
2015-05-01 12:16:03  00:00:02  00:00:08
2015-05-01 12:16:04  00:00:01  00:00:09
2015-05-01 12:16:04  00:00:00  00:00:09

The beautiful part is in the derived table

  SELECT 
    COUNT(*) FILTER (WHERE EVENT = 'reset') 
             OVER (ORDER BY ts) c,
    ts
  FROM timestamps

This derived table just adds the “partition” to each set of timestamps given the most recent “reset” event. The result of the above subquery is:

c  ts
----------------------
0  2015-05-01 12:15:23
0  2015-05-01 12:15:24
1  2015-05-01 12:15:27 <-- reset
1  2015-05-01 12:15:31
1  2015-05-01 12:15:40
2  2015-05-01 12:15:55 <-- reset
2  2015-05-01 12:16:01
2  2015-05-01 12:16:03
2  2015-05-01 12:16:04
2  2015-05-01 12:16:04

As you can see, the COUNT(*) window function counts all the previous “reset” events, ordered by timestamp. This information can then be used as the PARTITION for the FIRST_VALUE() window function in order to find the first timestamp in each partition, i.e. at the time of the most recent “reset” event:

  ts - first_value(ts) 
       OVER (PARTITION BY c ORDER BY ts) total

Conclusion

It’s almost a running gag on this blog to say that…

There was SQL before window functions and SQL after window functions

Window functions are extremely powerful and they’re a part of the SQL standard, supported in most commercial databases, in PostgreSQL, in Firebird 3.0, and in CUBRID. If you aren’t using them already, start using them today!

If you’ve liked this article, find out more about window functions in any of the following articles:

PostgreSQL’s Best-Kept Secret, and how to Use it with jOOQ


PostgreSQL has a lot of secret data types. In recent times, PostgreSQL’s JSON and JSONB support was hyped as being the NoSQL on SQL secret (e.g. as advertised by ToroDB) that allows you to get the best out of both worlds. But there are many other useful data types, among which the range type.

How does the range type work?

Ranges are very useful for things like age, date, price intervals etc. Let’s assume the following table:

CREATE TABLE age_categories (
  name VARCHAR(50),
  ages INT4RANGE
);

We can now fill the above table as such:

INSERT INTO age_categories
VALUES ('Foetus',   int4range(-1, 0)),
       ('Newborn',  int4range(0, 1)),
       ('Infant',   int4range(1, 2)),
       ('Kid',      int4range(2, 12)),
       ('Teenager', int4range(12, 20)),
       ('Tween',    int4range(20, 30)),
       ('Adult',    int4range(30, 60)),
       ('Senior',   int4range(60, 90)),
       ('Ancient',  int4range(90, 999));

And query it, e.g. by taking my age:

SELECT name
FROM age_categories
WHERE range_contains_elem(ages, 33);

… yielding

name
-----
Adult

There are a lot of other useful functions involved with range calculations. For instance, we can get the age span of a set of categories. Let’s assume we want to have the age span of Kid, Teenager, Senior. Let’s query:

SELECT int4range(min(lower(ages)), max(upper(ages)))
FROM age_categories
WHERE name IN ('Kid', 'Teenager', 'Senior');

… yielding

int4range
---------
[2,90]

And now, let’s go back to fetching all the categories that are within that range:

SELECT name
FROM age_categories
WHERE range_overlaps(ages, (
  SELECT int4range(min(lower(ages)), max(upper(ages)))
  FROM age_categories
  WHERE name IN ('Kid', 'Teenager', 'Senior')
));

… yielding

name
--------
Kid
Teenager
Tween
Adult
Senior

All of this could have been implemented with values contained in two separate columns, but using ranges is just much more expressive for range arithmetic.

How to use these types with jOOQ?

jOOQ doesn’t include built-in support for these advanced data types, but it allows you to bind these data types to your own, custom representation.

A good representation of PostgreSQL’s range types in Java would be jOOλ’s org.jooq.lambda.tuple.Range type, but you could also simply use int[] or Map.Entry for ranges. When we’re using jOOλ’s Range type, the idea is to be able to run the following statements using jOOQ:

// Assuming this static import:
import static org.jooq.lambda.tuple.Tuple.*;

DSL.using(configuration)
   .insertInto(AGE_CATEGORIES)
   .columns(AGE_CATEGORIES.NAME, AGE_CATEGORIES.AGES)
   .values("Foetus",   range(-1, 0))
   .values("Newborn",  range(0, 1))
   .values("Infant",   range(1, 2))
   .values("Kid",      range(2, 12))
   .values("Teenager", range(12, 20))
   .values("Tween",    range(20, 30))
   .values("Adult",    range(30, 60))
   .values("Senior",   range(60, 90))
   .values("Ancient",  range(90, 999))
   .execute();

And querying…

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .from(AGE_CATEGORIES)
   .where(rangeContainsElem(AGE_CATEGORIES.AGES, 33))
   .fetch();

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .where(rangeOverlaps(AGE_CATEGORIES.AGES,
      select(int4range(min(lower(AGE_CATEGORIES.AGES)),
                       max(upper(AGE_CATEGORIES.AGES))))
     .from(AGE_CATEGORIES)
     .where(AGE_CATEGORIES.NAME.in(
       "Kid", "Teenager", "Senior"
     ))
   ))
   .fetch();

As always, the idea with jOOQ is that the SQL you want to get as output is the SQL you want to write in Java, type safely. In order to be able to write the above, we will need to implement 1-2 missing pieces. First off, we’ll need to create a data type binding (org.jooq.Binding) as described in this section of the manual. The binding can be written as such, using the following Converter:

public class Int4RangeConverter 
implements Converter<Object, Range<Integer>> {
    private static final Pattern PATTERN = 
        Pattern.compile("\\[(.*?),(.*?)\\)");

    @Override
    public Range<Integer> from(Object t) {
        if (t == null)
            return null;

        Matcher m = PATTERN.matcher("" + t);
        if (m.find())
            return Tuple.range(
                Integer.valueOf(m.group(1)), 
                Integer.valueOf(m.group(2)));

        throw new IllegalArgumentException(
            "Unsupported range : " + t);
    }

    @Override
    public Object to(Range<Integer> u) {
        return u == null 
            ? null 
            : "[" + u.v1 + "," + u.v2 + ")";
    }

    @Override
    public Class<Object> fromType() {
        return Object.class;
    }

    @SuppressWarnings({ "unchecked", "rawtypes" })
    @Override
    public Class<Range<Integer>> toType() {
        return (Class) Range.class;
    }
}

… and the Converter can then be re-used in a Binding:

public class PostgresInt4RangeBinding 
implements Binding<Object, Range<Integer>> {

    @Override
    public Converter<Object, Range<Integer>> converter() {
        return new Int4RangeConverter();
    }

    @Override
    public void sql(BindingSQLContext<Range<Integer>> ctx) throws SQLException {
        ctx.render()
           .visit(DSL.val(ctx.convert(converter()).value()))
           .sql("::int4range");
    }

    // ...
}

The important thing in the binding is just that every bind variable needs to be encoded in PostgreSQL’s string format for range types (i.e. [lower, upper)) and cast explicitly to ?::int4range, that’s all.

You can then configure your code generator to use those types, e.g. on all columns that are called [xxx]_RANGE

<customType>
    <name>com.example.PostgresInt4RangeBinding</name>
    <type>org.jooq.lambda.tuple.Range&lt;Integer></type>
    <binding>com.example.PostgresInt4RangeBinding</binding>
</customType>
<forcedType>
    <name>com.example.PostgresInt4RangeBinding</name>
    <expression>.*?_RANGE</expression>
</forcedType>

The last missing thing now are the functions that you need to compare ranges, i.e.:

  • rangeContainsElem()
  • rangeOverlaps()
  • int4range()
  • lower()
  • upper()

These are written quickly:

static <T extends Comparable<T>> Condition 
    rangeContainsElem(Field<Range<T>> f1, T e) {
    return DSL.condition("range_contains_elem({0}, {1})", f1, val(e));
}

static <T extends Comparable<T>> Condition 
    rangeOverlaps(Field<Range<T>> f1, Range<T> f2) {
    return DSL.condition("range_overlaps({0}, {1})", f1, val(f2, f1.getDataType()));
}

Conclusion

Writing an extension for jOOQ data types takes a bit of time and effort, but it is really easy to achieve and will allow you to write very powerful queries in a type safe way. Once you’ve set up all the data types and the bindings, your generated source code will reflect the actual data type from your database and you will be able to write powerful queries in a type safe way directly in Java. Let’s consider again the plain SQL and the jOOQ version:

SQL

SELECT name
FROM age_categories
WHERE range_overlaps(ages, (
  SELECT int4range(min(lower(ages)), max(upper(ages)))
  FROM age_categories
  WHERE name IN ('Kid', 'Teenager', 'Senior')
));

jOOQ

DSL.using(configuration)
   .select(AGE_CATEGORIES.NAME)
   .where(rangeOverlaps(AGE_CATEGORIES.AGES,
      select(int4range(min(lower(AGE_CATEGORIES.AGES)),
                       max(upper(AGE_CATEGORIES.AGES))))
     .from(AGE_CATEGORIES)
     .where(AGE_CATEGORIES.NAME.in(
       "Kid", "Teenager", "Senior"
     ))
   ))
   .fetch();

More information about data type bindings can be found in the jOOQ manual.

How SQL GROUP BY Should Have Been Designed – Like Neo4j’s Implicit GROUP BY


In the recent past, we’ve explained the syntactic implications of the SQL GROUP BY clause. If you haven’t already, you should read our article “Do You Really Understand SQL’s GROUP BY and HAVING clauses?“.

In essence, adding a GROUP BY clause to your query transforms your query on very implicit levels. The following reminder summarises the previous article:

  • Only column expressions referenced in the GROUP BY clause, or aggregations of other column expressions may appear in the SELECT clause
  • Aggregations without explicit GROUP BY clause imply the “grand total” GROUP BY () clause
  • Some databases (e.g. MySQL, and to some extent: the SQL standard) don’t follow these rules and allow for arbitrary column expressions (or at least functionally dependent column expressions) in the SELECT clause

How SQL GROUP BY should have been designed

There is another way of looking at GROUP BY, and it has been implemented in the equally fascinating, beautiful, and weird Cypher query language (those are good attributes) as supported by the Neo4j graph database. This alternative (yet SQL inspired) query language probably deserves a whole blog post series on its own, but let’s focus on aggregation. Because aggregation is the primary use case for grouping.

(for the record, check out the Neo4j docs about aggregation for details)

A quick wrap-up to understand Cypher:

Consider this simple Cypher query:

MATCH (me:Person)-->(friend:Person)
                 -->(friend_of_friend:Person)
WHERE me.name = 'A'
RETURN count(DISTINCT friend_of_friend), 
       count(friend_of_friend)

Furthermore

  • Cypher
    (me:Person)-->(friend:Person)
               -->(friend_of_friend:Person)
    

    corresponds roughly to SQL

         Person AS me 
    JOIN Person AS friend 
      ON [ implicit equi-join predicate ]
    JOIN Person as friend_of_friend
      ON [ implicit equi-join predicate ]
    

Cypher’s way of writing JOIN is actually extremely useful and could also be applied to SQL. It is only a matter of time until someone will write a Cypher-to-SQL transformer that implements the syntax, at least as syntactic sugar for the equivalent ANSI equi-join notation.

Let’s investigate aggregation in Cypher

Here’s the query again:

MATCH (me:Person)-->(friend:Person)
                 -->(friend_of_friend:Person)
WHERE me.name = 'A'
RETURN count(DISTINCT friend_of_friend), 
       count(friend_of_friend)

So, in SQL terms, this is exactly the same as:

SELECT count(DISTINCT friend_of_friend), 
       count(friend_of_friend)
FROM   [ Persons ... ]
WHERE  me.name = 'A'

In other words, the same implicit grand total GROUP BY () is implied and all values are aggregated into a single row.

The next example from the Neo4j docs is more intriguing. This will count the number of nodes connected to a node n with name = 'A':

MATCH (n { name: 'A' })-->(x)
RETURN n, count(*)

Which is a shorter form for writing:

MATCH (n)-->(x)
WHERE n.name = 'A'
RETURN n, count(*)

This example will also perform aggregation, but this time with an implicit GROUP BY n clause. In SQL, you’d write something like:

SELECT   n.id, count(*)
FROM     n
JOIN     x
  ON     [ implicit equi-join predicate ]
WHERE    n.name = 'A'
GROUP BY n.id

The nice thing in Cypher is that the obvious GROUP BY clause (it can only be GROUP BY n.id) is implied. It doesn’t have to be written explicitly.

Takeaway for SQL

We’ve seen a couple of nice Cypher language features, especially the incredibly nice way to write “JOIN” (or rather graph traversal in Neo4j). But a much more obvious, low-hanging fruit with actual chances to make it into the SQL standard would be to make the SQL GROUP BY clause optional, and dependent on the SELECT clause using the following rules:

  • If SELECT contains no aggregation functions, there shall be no implied GROUP BY clause
  • If SELECT contains 1-N aggregation functions, there shall be an implied GROUP BY clause formed from the remaining columns
  • If SELECT contains only aggregation functions, the “grand total” GROUP BY () shall apply
  • An explicit GROUP BY clause will always be preferred to any implied GROUP BY clause

If any of you ISO / IEC committee members are reading this, this is on my wish list for a future SQL standard. And please, PostgreSQL. Implement this right away.

Liked this article?

Here’s some further reading about the SQL GROUP BY clause and aggregation:

This Common API Technique is Actually an Anti-Pattern


I admit, we’ve been lured into using this technique as well. It’s just so convenient, as it allows for avoiding a seemingly unnecessary cast. It’s the following technique here:

interface SomeWrapper {
  <T> T get();
}

Now you can type safely assign anything from the wrapper to any type:

SomeWrapper wrapper = ...

// Obviously
Object a = wrapper.get();

// Well...
Number b = wrapper.get();

// Risky
String[][] c = wrapper.get();

// Unprobable
javax.persistence.SqlResultSetMapping d = 
    wrapper.get();

This is actually the API you can use when you’re using jOOR, our reflection library that we’ve written and open sourced to improve our integration tests. With jOOR, you can write things like:

Employee[] employees = on(department)
    .call("getEmployees").get();
 
for (Employee employee : employees) {
    Street street = on(employee)
        .call("getAddress")
        .call("getStreet")
        .get();
    System.out.println(street);
}

The API is rather simple. The on() method wraps an Object or a Class. The call() methods then call a method on that object using reflection (but without requiring exact signatures, without requiring the method to be public, and without throwing any checked exceptions). And without the need for casting, you can then call get() to assign the result to any arbitrary reference type.

This is probably OK with a reflection library like jOOR, because the whole library is not really type safe. It can’t be, because it’s reflection.

But the “dirty” feeling remains. The feeling of giving the call-site a promise with respect to the resulting type, a promise that cannot be kept, and that will result in ClassCastException – a thing of the past that junior developers who have started after Java 5 and generics hardly know.

But the JDK libraries also do that…

Yes, they do. But very seldomly, and only if the generic type parameter is really irrelevant. For instance, when getting a Collection.emptyList(), whose implementation looks like this:

@SuppressWarnings("unchecked")
public static final <T> List<T> emptyList() {
    return (List<T>) EMPTY_LIST;
}

It’s true that the EMPTY_LIST is cast unsafely from List to List<T>, but from a semantic perspective, this is a safe cast. You cannot modify this List reference, and because it’s empty, there is no method in List<T> that will ever give you an instance of T or T[] that does not correspond to your target type. So, all of these are valid:

// perfectly fine
List<?> a = emptyList();

// yep
List<Object> b = emptyList();

// alright
List<Number> c = emptyList();

// no problem
List<String[][]> d = emptyList();

// if you must
List<javax.persistence.SqlResultSetMapping> e 
    = emptyList();

So, as always (or mostly), the JDK library designers have taken great care not to make any false promises about the generic type that you might get. This means that you will often get an Object type where you know that another type would be more suitable.

But even if YOU know this, the compiler won’t. Erasure comes at a price and the price is paid when your wrapper or collection is empty. There is no way of knowing the contained type of such an expression, so don’t pretend you do. In other words:

Do not use the just-to-avoid-casting generic method anti pattern

Do Not Think That One Second is Fast for Query Execution


I keep encountering situations where RDBMS users think that one second for query execution is anything near fast. Most recently, in this Stack Overflow question:

Hibernate SQL In clause making CPU usage to 100%

The poster’s original question was why a similar query executes in one second when executed in SQL Server Management Studio whereas the (seemingly) same query executes in 60 seconds when executed from Hibernate. The query looks similar to this:

select Student_Id 
from Student_Table 
where Roll_No in ('A101','A102','A103',.....'A250');

There might be many reasons for this difference. But the most striking message here is:

Please do not believe that 1 second is fast.

Databases are incredibly fast and simple queries like the one above should execute in virtually no time, even on mediocre servers. Even on your laptop! Markus Winand has made it a point to tell you that in 80% of all performance issues, all you have to do is to add that missing index. And that’s the case here as well!

The poster’s original table contains only two indexes:

CREATE TABLE student_table (
       Student_Id BIGINT NOT NULL IDENTITY
     , Class_Id BIGINT NOT NULL
     , Student_First_Name VARCHAR(100) NOT NULL
     , Student_Last_Name VARCHAR(100)
     , Roll_No VARCHAR(100) NOT NULL
     , PRIMARY KEY (Student_Id)
     , CONSTRAINT UK_StudentUnique_1 
         UNIQUE  (Class_Id, Roll_No)
);

There is an index to implement the PRIMARY KEY, and there’s another index for the UNIQUE constraint, but both indexes aren’t really very useful, as the query predicate filters on Roll_No, which is only the second column of the UNIQUE constraint. When executing the above query on roughly 8 million rows, I’m getting an index scan on the UNIQUE index and the query runs in three seconds:

plan-1

This “Index Scan” operation is not good at all. I’m actually running through all of the index to find all the applicable Roll_No values in each index leaf node. This is explained well in Use The Index Luke’s page about concatenated indexes

The solution

But the good news is, SQL Server Management Studio gives you immediate tuning advice. Just right click on the execution plan and choose “Missing Index Details…” to get the following advice:

/*
Missing Index Details from SQLQuery1.sql -
    LUKAS-ENVY\SQLEXPRESS.test
The Query Processor estimates that implementing the 
    following index could improve the query cost 
    by 87.5035%.
*/

/*
USE [test]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index>]
ON [dbo].[student_table] ([Roll_No])
INCLUDE ([Student_Id])
GO
*/

This doesn’t necessarily mean that the above index is the optimal choice for all your queries, but the fact that you’re querying using a predicate on Roll_No should be a strong-enough indicator that you should have an index on at least this Roll_No column. The simplest possible index here is simply:

CREATE INDEX i_student_roll_no
ON student_table (Roll_No);

With that index in place, we’ll now get an “Index Seek” operation, which runs instantly:

plan-2

Covering index

In this particular case, a “covering index” as suggested by Vlad Mihalcea in his answer might be appropriate. For instance:

CREATE INDEX i_student_roll_no
ON student_table (Roll_No, Student_Id);

The advantage of a covering index is that all the information needed for the query execution is already contained in the index. This is true in this particular case, but it can also be dangerous as:

  • The covering index needs more space, and if the table is already large, space might become an issue
  • The covering index only adds value as long as the query doesn’t also use additional columns (e.g. for projections, calculations, additional filtering, sorting, etc.). This is probably not the case in this simple example, which might change quickly in the near future

Thus, a covering index shouldn’t be your default choice in such cases. Better be conservative and add only those columns in the index, that add immediate value for filtering.

Conclusion

I’d like to repeat this time and again:

Do NOT think that one second is fast tweet this

In fact:

Do NOT think that anything beyond 2-3ms is fast!

Unless you’re doing heavy reporting or batch processing, where processing time obviously might take a bit longer, simple queries like these should never be anything slower than instant. And most of the time, you can achieve this speed by adding an index.

On a side-note

The poster of the aforementioned question obviously has other issues as well. None of the above explain the execution speed difference between Hibernate and “plain SQL”. But again, the message here is that if your “plain SQL” already takes more than one second, you have a very low-hanging fruit to fix.

Let’s Review How to Insert Clob or Blob via JDBC


LOBs are a PITA in all databases, as well as in JDBC. Handling them correctly takes a couple of lines of code, and you can be sure that you’ll get it wrong eventually. Because you have to think of a couple of things:

  • Foremost, LOBs are heavy resources that need special lifecycle management. Once you’ve allocated a LOB, you better “free” it as well to decrease the pressure on your GC. This article shows more about why you need to free lobs
  • The time when you allocate and free a lob is crucial. It might have a longer life span than any of your ResultSet, PreparedStatement, or Connection / transaction. Each database manages such life spans individually, and you might have to read up the specifications in edge cases
  • While you may use String instead of Clob, or byte[] instead of Blob for small to medium size LOBs, this may not always be the case, and may even lead to some nasty errors, like Oracle’s dreaded ORA-01461: can bind a LONG value only for insert into a LONG column

So, if you’re working on a low level using JDBC (instead of abstracting JDBC via Hibernate or jOOQ), you better write a small utility that takes care of proper LOB handling.

We’ve recently re-discovered our own utility that we’re using for jOOQ integration testing, at least in some databases, and thought this might be very useful to a couple of our readers who operate directly with JDBC. Consider the following class:

public class LOB implements AutoCloseable {

    private final Connection connection;
    private final List<Blob> blobs;
    private final List<Clob> clobs;

    public LOB(Connection connection) {
        this.connection = connection;
        this.blobs = new ArrayList<>();
        this.clobs = new ArrayList<>();
    }

    public final Blob blob(byte[] bytes) 
    throws SQLException {
        Blob blob;

        // You may write more robust dialect 
        // detection here
        if (connection.getMetaData()
                      .getDatabaseProductName()
                      .toLowerCase()
                      .contains("oracle")) {
            blob = BLOB.createTemporary(connection, 
                       false, BLOB.DURATION_SESSION);
        }
        else {
            blob = connection.createBlob();
        }

        blob.setBytes(1, bytes);
        blobs.add(blob);
        return blob;
    }

    public final Clob clob(String string) 
    throws SQLException {
        Clob clob;

        if (connection.getMetaData()
                      .getDatabaseProductName()
                      .toLowerCase()
                      .contains("oracle")) {
            clob = CLOB.createTemporary(connection, 
                       false, CLOB.DURATION_SESSION);
        }
        else {
            clob = connection.createClob();
        }

        clob.setString(1, string);
        clobs.add(clob);
        return clob;
    }


    @Override
    public final void close() throws Exception {
        blobs.forEach(JDBCUtils::safeFree);
        clobs.forEach(JDBCUtils::safeFree);
    }
}

This simple class has some nice treats:

  • It’s AutoCloseable, so you can free your lobs with the try-with-resources statement
  • It abstracts over the creation of LOBs across SQL dialects. No need to remember the Oracle way

To use this class, simply write something like the following:

try (
    LOB lob = new LOB(connection);
    PreparedStatement stmt = connection.prepareStatement(
        "insert into lobs (id, lob) values (?, ?)")
) {
    stmt.setInt(1, 1);
    stmt.setClob(2, lob.clob("abc"));
    stmt.executeUpdate();
}

That’s it! No need to keep references to the lob, safely freeing it if it’s not null, correctly recovering from exceptions, etc. Just put the LOB container in the try-with-resources statement, along with the PreparedStatement and done.

If you’re interested in why you have to call Clob.free() or Blob.free() in the first place, read our article about it. It’ll spare you one or two OutOfMemoryErrors

Is Your Eclipse Running a Bit Slow? Just Use This Simple Trick!


You wouldn’t believe it until you try it yourself. I’ve been using the Eclipse Mars developer milestones lately, and I’ve been having some issues with slow compilation. I always thought it was because of the m2e integration, which has never been famous for working perfectly. But then, it dawned upon me when I added a JPA persistence.xml file to run some jOOQ + Hibernate tests… I ran into this issue, and googled it to learn that many people are complaining about JPA validation running forever in their Eclipses.

So I searched for how to deactivate that, and boom!

All of my Eclipse got much much faster

In fact, I didn’t just deactivate JPA validation, but all validation:

deactivate all validation in your Eclipse to boost performance

I don’t remember the last time I ever needed validation, or thought that it was a useful feature in the first place. If you want to help your whole team, you can also check in the following file in each of your projects’ .settings/org.eclipse.wst.validation.prefs files:

DELEGATES_PREFERENCE=delegateValidatorList
USER_BUILD_PREFERENCE=enabledBuildValidatorListorg.eclipse.wst.wsi.ui.internal.WSIMessageValidator;
USER_MANUAL_PREFERENCE=enabledManualValidatorListorg.eclipse.wst.wsi.ui.internal.WSIMessageValidator;
USER_PREFERENCE=overrideGlobalPreferencestruedisableAllValidationtrueversion1.2.600.v201501211647
eclipse.preferences.version=1
override=true
suspend=true
vf.version=3

This has the same effect, but can be checked into version control.

Found this tip useful? See also our list of Top 5 Useful Hidden Eclipse Features

Follow

Get every new post delivered to your Inbox.

Join 2,859 other followers

%d bloggers like this: