There is no Such Thing as Object-Relational Impedance Mismatch


Much of the ORM criticism of the last decade missed the point, being inaccurate. By the end of this article, we will conclude with the following:

There is no significant difference between the relational (data) model and object oriented models

How to come to this conclusion? Read on!

How we came to believe in this fallacy

Many popular bloggers and opinion leaders have missed no chance to bash ORMs for their “obvious” impedance mismatch with the relational world. N+1, inefficient queries, library complexity, leaky abstractions, all sorts of buzzwords have been employed to dismiss ORMs – often containing a lot of truth, albeit without providing a viable alternative.

But are these articles really criticising the right thing?

Few of the above articles recognise a central fact, which has been elicited eloquently and humorously by Erik Meijer and Gavin Bierman in his very interesting paper “A co-Relational Model of Data for Large Shared Data Banks“, subtitled:

Contrary to popular belief, SQL and noSQL are really just two sides of the same coin.

Or in other words: The “hierarchical” object world and the “relational” database world model the exact same thing. The only difference is the direction of the arrows that you draw in your diagrams.

Let this sink in.

  • In the relational model, children point to their parent.
  • In the hierarchical model, parents point to their children.

That’s all there is to it.

hierarchical-vs-relational

What is an ORM?

ORMs fill the bridge between the two worlds. They’re the inverters of arrows, if you will. They will make sure that every “relation” in your RDBMS can be materialised as an “aggregation” or “composition” in your “hierarchical” world (this works for objects, XML, JSON, and any other format). They make sure that such materialisation is properly transacted. That changes to individual attributes or to relational (aggregational, compositional) attributes are properly tracked and purged back into the master model, the database – where the model is persisted. Individual ORMs differ in terms of offered features and in how much mapping logic they offer in addition to mapping individual entities to individual types.

  • Some ORMs may help you implement locking
  • Some may help you to patch model mismatches
  • Some may focus merely on a 1:1 mapping between these classes and tables

But all ORMs do one very simple thing. Ultimately, they take rows from your tables and materialise them as objects in your class model and vice-versa.

A very nice overview of different ORMs has been compiled on the Vertabelo blog, recently, by the way.

Tables and classes are the same thing

Give or take 1-2 implementation details, an RDBMS’s table and an OO language’s class is the same thing. A specification of a set of grouped attributes, each with their associated type. Consider the following example, using SQL and Java:

SQL

CREATE TABLE author (
  first_name VARCHAR(50),
  last_name VARCHAR(50)
);

Java

class Author {
  String firstName;
  String lastName;
}

There is absolutely no conceptual difference between the two – the mapping is straightforward. The mapping is even straightforward when you consider “relations” / “compositions” between different entities / types:

SQL (let’s leave away constraints for simplicity)

CREATE TABLE author (
  id BIGINT,
  first_name VARCHAR(50),
  last_name VARCHAR(50)
);

CREATE TABLE book (
  id BIGINT,
  author_id BIGINT,
  title VARCHAR(50),
);

Java

class Author {
  Long id;
  String firstName;
  String lastName;
  Set<Book> books;
}

class Book {
  Long id;
  Author author;
  String title;
}

The implementation details are omitted (and probably account for half of the criticism). But omitting further details allows for straight-forward 1:1 mapping of individual rows from your database to your Java model, without any surprises. Most ORMs – in the Java ecosystem Hibernate in particular – have managed to implement the above idea very well, hiding away all the technical details of actually doing such a model transfer between the RDBMS and Java.

In other words:

There is absolutely nothing wrong with this mapping approach!

Yet: There *IS* an impedance mismatch, somewhere

The “problems” that many bloggers criticise arise not from the non-existing mismatch between the two model representations (“relational” vs. “hierarchical”). The problems arise from SQL, which is a decent implementation of relational algebra.

In fact, the very same mismatch that everyone criticises is also present between:

Relational algebra has been defined in order to be able to query relations and to form new ad-hoc relations as an output of such queries. Depending on the operations and transformations that are applied, the resulting tuples may have absolutely nothing to do with the individual entities involved in a query. In other, ORM-y words: The product of relational algebra, and in particular of SQL has no use, as it can no longer be further processed by the ORM, let alone persisted back into the database.

To make things “worse”, SQL today is a large super-set of the features offered by relational algebra. It has gotten much more useful than when it was conceived.

Why this mismatch still affects modern ORMs

The previous paragraphs outlined the single main reason why ORMs are really criticised, even if such criticism often doesn’t mention this exact reason:

SQL / relational algebra is not really appropriate to partially materialise relations into a client / store changes back into the database. Yet, most RDBMS offer only SQL for that job.

Back to the author / book example. When you want to load and display an author and their books to a web application’s user, you’d like to simply fetch that author and their books, call simple methods like author.add(book) as well as author.remove(book) and let some magic flush your data back into the storage system.

Thinking about the amount of SQL code to be written for such a simple CRUD task makes everyone squeal.

tweet thisLife’s too short to spend time on CRUD

Perhaps QUEL might have been a better language for CRUD, but that ship has sailed. And unfortunately, because of SQL being an inappropriate language for this job, you cannot ignore that “magic” but have to know well what happens behind the scenes, e.g. by tweaking Hibernate’s fetching strategies.

Translated to SQL, this may be implemented in several ways:

1. Fetching with JOIN

Using outer joins, all the involved entities can be queried in one go:

SELECT author.*, book.*
FROM author
LEFT JOIN book ON author.id = book.author_id
WHERE author.id = ?

Advantages:

  • A single query can be issued and all the data can be transferred at once

Disadvantages:

  • The author attributes are repeated in every tuple. The client (ORM) has to de-duplicate authors first, before populating the author-book relationship. This can be particularly bad when you have many nested relations that should be fetched at once.

2. Fetching with SELECT

A single query is issued for each entity:

SELECT *
FROM author
WHERE id = ?

SELECT *
FROM book
WHERE author_id = ?

Advantages:

  • The amount of data to be transferred is minimal: Each row is transferred exactly once.

Disadvantages:

  • The amount of queries that are issued may explode into the well-known N+1 problem.

Hibernate in particular knows other types of fetch strategies, although they are essentially a variant / optimisation of one of the above.

Why not use SQL MULTISET?

The ideal way to fetch all data in this case using advanced SQL would be by using MULTISET:

SELECT author.*, MULTISET (
  SELECT book.*
  FROM book
  WHERE book.author_id = author.id
) AS books
FROM author
WHERE id = ?

The above will essentially create a nested collection for each author:

first_name  last_name   books (nested collection)
--------------------------------------------------

Leonard     Cohen       title
                        --------------------------
                        Book of Mercy
                        Stranger Music
                        Book of Longing

Ernest      Hemingway   title
                        --------------------------
                        For Whom the Bell Tolls
                        The Old Man and the Sea

If you add another nested entity, it is easy to see how another MULTISET could allow for additionally nested data:

SELECT author.*, MULTISET (
  SELECT book.*, MULTISET (
    SELECT c.*
    FROM language AS t
    JOIN book_language AS bl
    ON c.id = bc.language_id
    AND book.id = bc.book_id
  ) AS languages
  FROM book
  WHERE book.author_id = author.id
) AS books
FROM author
WHERE id = ?

The outcome would now be along the lines of:

first_name  last_name   books
-----------------------------------------------------

Leonard     Cohen       title            languages
                        -----------------------------
                        Book of Mercy    language
                                         ------------
                                         en

                        Stranger Music   language
                                         ------------
                                         en
                                         de

                        Book of Longing  language
                                         ------------
                                         en
                                         fr
                                         es

Advantages:

  • A single query can materialise all eager-loaded rows with minimal bandwidth usage.

Disadvantages:

  • None.

Unfortunately, MULTISET is poorly supported by RDBMS.

MULTISET (as well as arrays and other collection types) have been introduced formally into the SQL standard as of SQL:2003, as a part of an initiative to embed OO features into the SQL language. Oracle, for instance, has implemented much of it, much like Informix did, or the lesser-known CUBRID (although using vendor-specific syntax).

Other databases like PostgreSQL allow for aggregating nested rows into typed arrays, which works the same way although with a bit more syntactic effort.

MULTISET and other ORDBMS SQL features are the perfect compromise, allowing for combining the best of the “relational” model with the best of the “hierarchical” model. Allowing for combining CRUD operations with querying in one go, removing the need for sophisticated ORMs, as the SQL language can be used directly to map all your data from your (relational) database to your (hierarchical) client representation with no friction.

Conclusion and call to action!

We’re living through exciting times in our industry. The elephant (SQL) in the room is still here, learning new tricks all the time. The relational model has served us well, and has been enriched with hierarchical models in various implementations. Functional programming is gaining traction, complementing object orientation in very useful ways.

Think of the glue, putting all these great technological concepts together, allowing for:

  • Storing data in the relational model
  • Materialising data in the hierarchical model
  • Processing data using functional programming

That awesome combination of techniques is hard to beat – we’ve shown how SQL and functional programming can work with jOOQ. All that’s missing – in our opinion – is better support for MULTISET and other ORDBMS features from RDBMS vendors.

Thus, we urge you, PostgreSQL developers: You’re creating one of the most innovative databases out there. Oracle is ahead of you in this area – but their implementation is too strongly tied to PL/SQL, which makes it clumsy. Yet, you’re missing out on one of the most awesome SQL feature sets. The ability to construct nested collections (not just arrays), and to query them efficiently. If you lead the way, other RDBMS will follow.

And we can finally stop wasting time talking about the object-relational impedance non-mismatch.

Divided we Stand: Optional


Our recent article “NULL is Not The Billion Dollar Mistake. A Counter-Rant” got us a lot of reads, controversial comments, and a 50/50 upvote / downvote ratio pretty much everywhere a blog post can be posted and voted on. This was expected.

Objectively, NULL is just a “special” value that has been implemented in a variety of languages and type systems, and in a variety of ways – including perhaps the set of natural numbers (a.k.a. “zero”, the original null – them Romans sure didn’t like that idea).

Or, as Charles Roth has put it adequately in the comments:

Chuckle. Occasionally a mathematics background comes in handy. Now we could argue about whether NULL was “invented” or “discovered”…

Now, Java’s null is a particularly obnoxious implementation of that “special value” for reasons like:

Compile-time typing vs. runtime typing

// We can assign null to any reference type
Object s = null;

// Yet, null is of no type at all
if (null instanceof Object)
    throw new Error("Will never happen");

The null literal is even more special

// Nothing can be put in this list, right?
List<?> list = new ArrayList<Void>();

// Yet, null can:
list.add(null);

Methods are present on the null literal

// This compiles, but does it run?
((Object) null).getClass();

Java 8’s Optional

The introduction of Optional might have changed everything. Many functional programmers love it so much because the type clearly communicates the cardinality of an attribute. In a way:

// Cardinality 1:
Type t1;

// Cardinality 0-1:
Optional<Type> t01;

// Cardinality 0..n:
Iterable<Type> tn;

A lot of Java 8’s Optional‘s interesting history has been dug out by Nicolai Parlog on his excellent blog.

Be sure to check it out:
http://blog.codefx.org/tag/optional

In the Java 8 expert groups, Optional wasn’t an easy decision:

[…] There has been a lot of discussion about [Optional] here and there over the years. I think they mainly amount to two technical problems, plus at least one style/usage issue:

  1. Some collections allow null elements, which means that you cannot unambiguously use null in its otherwise only reasonable sense of “there’s nothing there”.
  2. If/when some of these APIs are extended to primitives, there is no value to return in the case of nothing there. The alternative to Optional is to return boxed types, which some people would prefer not to do.
  3. Some people like the idea of using Optional to allow more fluent APIs.
    As in
    x = s.findFirst().or(valueIfEmpty)
    vs
    if ((x = s.findFirst()) == null) x = valueIfEmpty;
    Some people are happy to create an object for the sake of being able to do this. Although sometimes less happy when they realize that Optionalism then starts propagating through their designs, leading to Set<Optional<T>>’s and so on.

It’s hard to win here.

Doug Lea

Arguably, the main true reason for the JDK to have introduced Optional is the lack of availability of project valhalla’s specialization in Java 8, which meant that a performant primitive type stream (such as IntStream) needed some new type like OptionalInt to encode absent values as returned from IntStream.findAny(), for instance. For API consistency, such an OptionalInt from the IntStream type must be matched by a “similar” Optional from the Stream type.

Can Optional be introduced late in a platform?

While Doug’s concerns are certainly valid, there are some other, more significant arguments that make me wary of Optional (in Java). While Scala developers embrace their awesome Option type as they have no alternative and hardly ever see any null reference or NullPointerException – except when working with some Java libraries – this is not true for Java developers. We have our legacy collections API, which (ab-)uses null all over the place. Take java.util.Map, for instance. Map.get()‘s Javadoc reads:

Returns the value to which the specified key is mapped, or null if this map contains no mapping for the key.

[…]

If this map permits null values, then a return value of null does not necessarily indicate that the map contains no mapping for the key; it’s also possible that the map explicitly maps the key to null. The containsKey operation may be used to distinguish these two cases.

This is how much of the pre-Java 8 collection API worked, and we’re still using it actively with Java 8, with new APIs such as the Streams API, which makes extensive use of Optional.

A contrived (and obviously wrong) example:

Map<Integer, List<Integer>> map =
Stream.of(1, 1, 2, 3, 5, 8)
      .collect(Collectors.groupingBy(n -> n % 5));

IntStream.range(0, 5)
         .mapToObj(map::get)
         .map(List::size)
         .forEach(System.out::println);

Boom, NullPointerException. Can you spot it?

The map contains remainders of a modulo-5 operation as keys, and the associated, collected dividends as a value.

We then go through all numbers from 0 to 5 (the only possible remainders), extract the list of associated dividends, List::size them… wait. Oh. Map.get may return null.

You’re getting used to the fluent style of Java 8’s new APIs, you’re getting used to the functional and monadic programming style where streams and optional behave similarly, and you may be quickly surprised that anything passed to a Stream.map() method can be null.

In fact, if APIs were allowed to be retrofitted, then the Map.get method might look like this:

public interface Map<K,V> {
    Optional<V> get(Object key);
}

(it probably still wouldn’t because most maps allow for null values or even keys, which is hard to retrofit)

If we had such a retrofitting, the compiler would be complaining that we have to unwrap Optional before calling List::size. We’d fix it and write

IntStream.range(0, 5)
         .mapToObj(map::get)
         .map(l -> l.orElse(Collections.emptyList()))
         .map(List::size)
         .forEach(System.out::println);

Java’s Crux – Backwards compatibility

Backwards compatibility will lead to a mediocre adoption of Optional. Some parts of JDK API make use of it, others use null to encode the absent value. You can never be sure and you always have to remember both possibilities, because you cannot trust a non-Optional type to be truly “@NotNull“.

If you prefer using Optional over null in your business logic, that’s fine. But you will have to make very sure to apply this strategy thoroughly. Take the following blog post, for instance, which has gotten lots of upvotes on reddit:
http://shekhargulati.com/2015/07/28/day-4-lets-write-null-free-java-code

It inadvertently introduces a new anti-pattern:

public class User {
 
    private final String username;
    private Optional<String> fullname;
 
    public User(String username) {
        this.username = username;
        this.fullname = Optional.empty();
    }
 
    public String getUsername() {
        return username;
    }
 
    public Optional<String> getFullname() {
        return fullname;
    }

    //      good--------^^^
    // vvvv--------bad
 
    public void setFullname(String fullname) {
        this.fullname = Optional.of(fullname);
    }
}

The domain object establishes an “Optional opt-in” contract, without opting out of null entirely. While getFullname() forces API consumers to reason about the possible absence of a full name, setFullname() doesn’t accept such an Optional argument type, but a nullable one. What was meant as a clever convenience will result only in confusion at the consumer site.

The anti-pattern is repeated by Steven Colebourne (who brought us Joda Time and JSR-310) on his blog, calling this a “pragmatic” approach:

public class Address {
    private final String addressLine;  // never null
    private final String city;         // never null
    private final String postcode;     // optional, thus may be null

    // constructor ensures non-null fields really are non-null
    // optional field can just be stored directly, as null means optional
    public Address(String addressLine, String city, String postcode) {
      this.addressLine = Preconditions.chckNotNull(addressLine);
      this.city = Preconditions.chckNotNull(city);
      this.postcode = postcode;
    }

    // normal getters
    public String getAddressLine() { return addressLine; }
    public String getCity() { return city; }

    // special getter for optional field
    public Optional getPostcode() {
      return Optional.ofNullable(postcode);
    }

    // return optional instead of null for business logic methods that may not find a result
    public static Optional<Address> findAddress(String userInput) {
      return ... // find the address, returning Optional.empty() if not found
    }
}

See the full article here:
http://blog.joda.org/2015/08/java-se-8-optional-pragmatic-approach.html

Choose your poison

We cannot change the JDK. JDK API are a mix of nullable and Optional. But we can change our own business logic. Think carefuly before introducing Optional, as this new optional type – unlike what its name suggests – is an all-or-nothing type. Remember that by introducing Optional into your code-base, you implicitly assume the following:

// Cardinality 1:
Type t1;

// Cardinality 0-1:
Optional<Type> t01;

// Cardinality 0..n:
Iterable<Type> tn;

From there on, your code-base should no longer use the simple non-Optional Type type for 0-1 cardinalities. Ever.

jOOQ Tuesdays: Thomas Müller Unveils How HSQLDB Evolved into the Popular H2 Database


Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

tom_grayscale

We have the pleasure of talking to Thomas Müller in this fifth edition who will be telling us about the exciting history of Java’s most popular embedded database H2

Hi Thomas – Your H2 database is virtually everywhere. It has become many Java developers’ favourite integration test database. How did it grow so popular?

I guess because it’s easy to use, relatively fast, and, up to some point, compatible with popular databases.

I understand that you have a common, personal history with HSQLDB, previously also known as Hypersonic SQL. How and why did H2 evolve from HSQLDB?

Back in 1998, I wanted to learn Java. For fun, I implemented the relatively new skip list algorithm, added a SQL interface, and published it as open source. I got feedback from people who thought it’s useful, so I continued and gave it the name Hypersonic SQL. In 2000 I got a job offer from a Silicon Valley startup, PointBase Inc. The plan was to continue with Hypersonic SQL, and keep it open source. But after I started, the company decided it’s better if I stop. This was a surprise to me. I told them I can’t prevent others from continuing. And so Fred Toussi took the code and started HSQLDB with it. At around 2005, PointBase ran out of money, I wanted to go back to HSQLDB. But I felt more radical changes were needed, and it would be better to start a new project instead, which was then H2.

Very interesting historic facts! … and when will you replace Derby / JavaDB in the JDK? :-)

Hopefully never! If H2 is integrated in the JDK, Oracle would put constraints to the future of H2. I don’t want that. I want to keep H2 independent.

You’ve been in the database industry for a while. What is your most interesting anecdote that you’d like to share?

When Oracle bought MySQL, I was very surprised I got mail from the European Union with a large questionary about the merger, how it will affect the industry and competition. I don’t know how they found my work address, it is not on the web site, and they never asked me by email. And Switzerland is not even part of the European Union. H2 is a very small fish in the “database pond”, but it seems H2 does matter.

It’s a small world or small pond, I guess!

You’re one of the few developers I know who is working both on a SQL database (H2) and on a NoSQL database (JackRabbit, Adobe CRX). Tell us a little bit about how those databases compare.

They are quite different. H2 is a relational database with the traditional SQL and JDBC API, and Jackrabbit is a mix between a file system and a database, with a very different API, and a hierarchical data model. The query language is different as well: for Jackrabbit, XPath is more commonly used, even thought SQL is available as well. Both the relational and the hierarchical models have advantages and disadvantages. The hierarchical model more easily supports semi-structured, JSON style data. The relational model, on the other hand, is more “mature”, and there is more competition.

At Adobe / JackRabbit, you’re heavily involved with implementing storage algorithms in Java. Is the JVM even good at implementing low-level storage stuff?

Yes! The real advantage of Java is that a programmer can concentrate at the algorithms, and doesn’t have to spend so much time with memory management and low level stuff. That way, there is more time to improve the algorithms. And in relational databases, the most important aspect is using the best possible algorithms, for example to reduce I/O. Even for very low level CPU intensive stuff like data compression and encryption, things like concurrency and cache efficiency nowadays are more important than whether to use Java or C.

That’s an interesting thought along the lines of avoiding premature optimisation!

One last question: What problems are you working on right now?

Optimizing the database for solid state disks (SSDs) and new file system. Almost all relational databases still use algorithms optimized for rotating disks, where overwriting small blocks (for example 4 KB) was the way to go. With SSDs and Btrfs this doesn’t work well. H2 version 1.4.x (beta) already uses a new storage subsystem (MVStore) that should work well, however there is still some work needed before it is ready for production.

Other than database stuff, I’m interested in various programming topics. If I implement something that might be useful, I publish it as open source within my H2 database project, in the “tools” directory, until it is used in the database or moved to another project. I wrote a archiving utility (like zip, gzip) called “ArchiveTool” that combines de-duplication with regular compression. It is fast but compresses large directories (source code, databases) very well. As part of my work on the new storage subsystem, I came across minimal perfect hash tables. I invented a new algorithm that needs less space than all known ones (“MinimalPerfectHash”). I would would like to publish a paper about that. There are plenty of interesting problems to solve.

Common SQL Clauses and Their Equivalents in Java 8 Streams


Functional programming allows for quasi-declarative programming in a general purpose language. By using powerful fluent APIs like Java 8’s Stream API, or jOOλ’s sequential Stream extension Seq or more sophisticated libraries like javaslang or functionaljava, we can express data transformation algorithms in an extremely concise way. Compare Mario Fusco’s imperative and functional version of the same algorithm:

Using such APIs, functional programming certainly feels like true declarative programming.

The most popular true declarative programming language is SQL. When you join two tables, you don’t tell the RDBMS how to implement that join. It may decide at its discretion whether a nested loop, merge join, hash join, or some other algorithm is the most suitable in the context of the complete query and of all the available meta information. This is extremely powerful because the performance assumptions that are valid for a simple join may no longer be valid for a complex one, where a different algorithm would outperform the original one. By this abstraction, you can just easily modify a query in 30 seconds, without worrying about low-level details like algorithms or performance.

When an API allows you to combine both (e.g. jOOQ and Streams), you will get the best of both worlds – and those worlds aren’t too different.

In the following sections, we’ll compare common SQL constructs with their equivalent expressions written in Java 8 using Streams and jOOλ, in case the Stream API doesn’t offer enough functionality.

Tuples

For the sake of this article, we’re going to assume that SQL rows / records have an equivalent representation in Java. For this, we’ll be using jOOλ’s Tuple type, which is essentially:

public class Tuple2<T1, T2> {

    public final T1 v1;
    public final T2 v2;

    public Tuple2(T1 v1, T2 v2) {
        this.v1 = v1;
        this.v2 = v2;
    }
}

… plus a lot of useful gimmicks like Tuple being Comparable, etc.

Note that we’re assuming the following imports in this and all subsequent examples.

import static org.jooq.lambda.Seq.*;
import static org.jooq.lambda.tuple.Tuple.*;

import java.util.*;
import java.util.function.*;
import java.util.stream.*;

import org.jooq.lambda.*;

Much like SQL rows, a tuple is a “value-based” type, meaning that it doesn’t really have an identity. Two tuples (1, 'A') and (1, 'A') can be considered exactly equivalent. Removing identity from the game makes SQL and functional programming with immutable data structures extremely elegant.

FROM = of(), stream(), etc.

In SQL, the FROM clause logically (but not syntactically) precedes all the other clauses. It is used to produce a set of tuples from at least one table, possibly multiple joined tables. A single-table FROM clause can be trivially mapped to Stream.of(), for instance, or to any other method that simply produces a stream:

SQL

SELECT *
FROM (
  VALUES(1, 1),
        (2, 2)
) t(v1, v2)

yielding

+----+----+
| v1 | v2 |
+----+----+
|  1 |  1 |
|  2 |  2 |
+----+----+

Java

Stream.of(
  tuple(1, 1),
  tuple(2, 2)
).forEach(System.out::println);

yielding

(1, 1)
(2, 2)

CROSS JOIN = flatMap()

Selecting from multiple tables is already more interesting. The easiest way to combine two tables in SQL is by producing a cartesian product, either via a table list or using a CROSS JOIN. The following two are equivalent SQL statements:

SQL

-- Table list syntax
SELECT *
FROM (VALUES( 1 ), ( 2 )) t1(v1), 
     (VALUES('A'), ('B')) t2(v2)

-- CROSS JOIN syntax
SELECT *
FROM       (VALUES( 1 ), ( 2 )) t1(v1)
CROSS JOIN (VALUES('A'), ('B')) t2(v2)

yielding

+----+----+
| v1 | v2 |
+----+----+
|  1 |  A |
|  1 |  B |
|  2 |  A |
|  2 |  B |
+----+----+

In a cross join (or cartesian product), every value from t1 is combined with every value from t2 producing size(t1) * size(t2) rows in total.

Java

In functional programming using Java 8’s Stream, the Stream.flatMap() method corresponds to SQL CROSS JOIN as can be seen in the following example:

List<Integer> s1 = Stream.of(1, 2);
Supplier<Stream<String>> s2 = ()->Stream.of("A", "B");

s1.flatMap(v1 -> s2.get()
                   .map(v2 -> tuple(v1, v2)))
  .forEach(System.out::println);

yielding

(1, A)
(1, B)
(2, A)
(2, B)

Note how we have to wrap the second stream in a Supplier because streams can be consumed only once, but the above algorithm is really implementing a nested loop, combining all elements of stream s2 with each element from stream s1. An alternative would be not to use streams but lists (which we will do in subsequent examples, for simplicity):

List<Integer> s1 = Arrays.asList(1, 2);
List<String> s2 = Arrays.asList("A", "B");

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .map(v2 -> tuple(v1, v2)))
  .forEach(System.out::println);

In fact, CROSS JOIN can be chained easily both in SQL and in Java:

SQL

-- Table list syntax
SELECT *
FROM (VALUES( 1 ), ( 2 )) t1(v1), 
     (VALUES('A'), ('B')) t2(v2), 
     (VALUES('X'), ('Y')) t3(v3)

-- CROSS JOIN syntax
SELECT *
FROM       (VALUES( 1 ), ( 2 )) t1(v1)
CROSS JOIN (VALUES('A'), ('B')) t2(v2)
CROSS JOIN (VALUES('X'), ('Y')) t3(v3)

yielding

+----+----+----+
| v1 | v2 | v3 |
+----+----+----+
|  1 |  A |  X |
|  1 |  A |  Y |
|  1 |  B |  X |
|  1 |  B |  Y |
|  2 |  A |  X |
|  2 |  A |  Y |
|  2 |  B |  X |
|  2 |  B |  Y |
+----+----+----+

Java

List<Integer> s1 = Arrays.asList(1, 2);
List<String> s2 = Arrays.asList("A", "B");
List<String> s3 = Arrays.asList("X", "Y");

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .map(v2 -> tuple(v1, v2)))
  .flatMap(v12-> s3.stream()
                   .map(v3 -> tuple(v12.v1, v12.v2, v3)))
  .forEach(System.out::println);

yielding

(1, A, X)
(1, A, Y)
(1, B, X)
(1, B, Y)
(2, A, X)
(2, A, Y)
(2, B, X)
(2, B, Y)

Note how we explicitly unnested the tuples from the first CROSS JOIN operation to form “flat” tuples in the second operation. This is optional, of course.

Java with jOOλ’s crossJoin()

Us jOOQ developers, we’re a very SQL-oriented people, so it is only natural to have added a crossJoin() convenience method for the above use-case. So our triple-cross join can be written like this:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<String> s2 = Seq.of("A", "B");
Seq<String> s3 = Seq.of("X", "Y");

s1.crossJoin(s2)
  .crossJoin(s3)
  .forEach(System.out::println);

yielding

((1, A), X)
((1, A), Y)
((1, B), X)
((1, B), Y)
((2, A), X)
((2, A), Y)
((2, B), X)
((2, B), Y)

In this case, we didn’t unnest the tuple produced in the first cross join. From a merely relational perspective, this doesn’t matter either. Nested tuples are the same thing as flat tuples. In SQL, we just don’t see the nesting. Of course, we could still unnest as well by adding a single additional mapping:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<String> s2 = Seq.of("A", "B");
Seq<String> s3 = Seq.of("X", "Y");

s1.crossJoin(s2)
  .crossJoin(s3)
  .map(t -> tuple(t.v1.v1, t.v1.v2, t.v2))
  .forEach(System.out::println);

yielding, again

(1, A, X)
(1, A, Y)
(1, B, X)
(1, B, Y)
(2, A, X)
(2, A, Y)
(2, B, X)
(2, B, Y)

(You may have noticed that map() corresponds to SELECT as we’ll see again later on)

INNER JOIN = flatMap() with filter()

The SQL INNER JOIN is essentially just syntactic sugar for a SQL CROSS JOIN with a predicate that reduces the tuple set after cross-joining. In SQL, the following two ways of inner joining are equivalent:

SQL

-- Table list syntax
SELECT *
FROM (VALUES(1), (2)) t1(v1), 
     (VALUES(1), (3)) t2(v2)
WHERE t1.v1 = t2.v2

-- INNER JOIN syntax
SELECT *
FROM       (VALUES(1), (2)) t1(v1)
INNER JOIN (VALUES(1), (3)) t2(v2)
ON t1.v1 = t2.v2

yielding

+----+----+
| v1 | v2 |
+----+----+
|  1 |  1 |
+----+----+

(note that the keyword INNER is optional).

So, the values 2 from t1 and the values 3 from t2 are “thrown away”, as they produce any rows for which the join predicate yields true.

The same can be expressed easily, yet more verbosely in Java

Java (inefficient solution!)

List<Integer> s1 = Arrays.asList(1, 2);
List<Integer> s2 = Arrays.asList(1, 3);

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .map(v2 -> tuple(v1, v2)))
  .filter(t -> Objects.equals(t.v1, t.v2))
  .forEach(System.out::println);

The above correctly yields

(1, 1)

But beware that you’re attaining this result after producing a cartesian product, the nightmare of every DBA! As mentioned at the beginning of this article, unlike in declarative programming, in functional programming you instruct your program to do exactly the order of operations that you specify. In other words:

In functional programming, you define the exact “execution plan” of your query.

In declarative programming, an optimiser may reorganise your “program”

There is no optimiser to transform the above into the much more efficient:

Java (more efficient)

List<Integer> s1 = Arrays.asList(1, 2);
List<Integer> s2 = Arrays.asList(1, 3);

s1.stream()
  .flatMap(v1 -> s2.stream()
                   .filter(v2 -> Objects.equals(v1, v2))
                   .map(v2 -> tuple(v1, v2)))
  .forEach(System.out::println);

The above also yields

(1, 1)

Notice, how the join predicate has moved from the “outer” stream into the “inner” stream, that is produced in the function passed to flatMap().

Java (optimal)

As mentioned previously, functional programming doesn’t necessarily allow you to rewrite algorithms depending on knowledge of the actual data. The above presented implementation for joins always implement nested loop joins going from the first stream to the second. If you join more than two streams, or if the second stream is very large, this approach can be terribly inefficient. A sophisticated RDBMS would never blindly apply nested loop joins like that, but consider constraints, indexes, and histograms on actual data.

Going deeper into that topic would be out of scope for this article, though.

Java with jOOλ’s innerJoin()

Again, inspired by our work on jOOQ we’ve also added an innerJoin() convenience method for the above use-case:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.innerJoin(s2, (t, u) -> Objects.equals(t, u))
  .forEach(System.out::println);

yielding

(1, 1)

… because after all, when joining two streams, the only really interesting operation is the join Predicate. All else (flatmapping, etc.) is just boilerplate.

LEFT OUTER JOIN = flatMap() with filter() and a “default”

SQL’s OUTER JOIN works like INNER JOIN, except that additional “default” rows are produced in case the JOIN predicate yields false for a pair of tuples. In terms of set theory / relational algebra, this can be expressed as such:

dd81ee1373d922122ce1b3e0da74cb28

Or in a SQL-esque dialect:

R LEFT OUTER JOIN S ::=

R INNER JOIN S
UNION (
  (R EXCEPT (SELECT R.* FROM R INNER JOIN S))
  CROSS JOIN
  (null, null, ..., null)
)

This simply means that when left outer joining S to R, there will be at least one row in the result for each row in R, with possibly an empty value for S.

Inversely, when right outer joining S to R, there will be at least one row in the result for each row in S, with possibly an empty value for R.

And finally, when full outer joining S to R, there will be at least one row in the result for each row in R with possibly an empty value for S AND for each row in S with possibly an empty value for R.

Let us look at LEFT OUTER JOIN, which is used most often in SQL.

SQL

-- Table list, Oracle syntax (don't use this!)
SELECT *
FROM (SELECT 1 v1 FROM DUAL
      UNION ALL 
      SELECT 2 v1 FROM DUAL) t1, 
     (SELECT 1 v2 FROM DUAL
      UNION ALL
      SELECT 3 v2 FROM DUAL) t2
WHERE t1.v1 = t2.v2 (+)

-- OUTER JOIN syntax
SELECT *
FROM            (VALUES(1), (2)) t1(v1)
LEFT OUTER JOIN (VALUES(1), (3)) t2(v2)
ON t1.v1 = t2.v2

yielding

+----+------+
| v1 |   v2 |
+----+------+
|  1 |    1 |
|  2 | null |
+----+------+

(note that the keyword OUTER is optional).

Java

Unfortunately, the JDK’s Stream API doesn’t provide us with an easy way to produce “at least” one value from a stream, in case the stream is empty. We could be writing a utility function as explained by Stuart Marks on Stack Overflow:

static <T> Stream<T> defaultIfEmpty(
    Stream<T> stream, Supplier<T> supplier) {
    Iterator<T> iterator = stream.iterator();

    if (iterator.hasNext()) {
        return StreamSupport.stream(
            Spliterators.spliteratorUnknownSize(
                iterator, 0
            ), false);
    } else {
        return Stream.of(supplier.get());
    }
}

Or, we just use jOOλ’s Seq.onEmpty()

List<Integer> s1 = Arrays.asList(1, 2);
List<Integer> s2 = Arrays.asList(1, 3);

seq(s1)
.flatMap(v1 -> seq(s2)
              .filter(v2 -> Objects.equals(v1, v2))
              .onEmpty(null)
              .map(v2 -> tuple(v1, v2)))
.forEach(System.out::println);

(notice, we’re putting null in a stream. This might not always be a good idea. We’ll follow up with that in a future blog post)

The above also yields

(1, 1)
(2, null)

How to read the implicit left outer join?

  • We’ll take each value v1 from the left stream s1
  • For each such value v1, we flatmap the right stream s2 to produce a tuple (v1, v2) (a cartesian product, cross join)
  • We’ll apply the join predicate for each such tuple (v1, v2)
  • If the join predicate leaves no tuples for any value v2, we’ll generate a single tuple containing the value of the left stream v1 and null

Java with jOOλ

For convenience, jOOλ also supports leftOuterJoin() which works as described above:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.leftOuterJoin(s2, (t, u) -> Objects.equals(t, u))
  .forEach(System.out::println);

yielding

(1, 1)
(2, null)

RIGHT OUTER JOIN = inverse LEFT OUTER JOIN

Trivially, a RIGHT OUTER JOIN is just the inverse of the previous LEFT OUTER JOIN. The jOOλ implementation of rightOuterJoin() looks like this:

default <U> Seq<Tuple2<T, U>> rightOuterJoin(
    Stream<U> other, BiPredicate<T, U> predicate) {
    return seq(other)
          .leftOuterJoin(this, (u, t) -> predicate.test(t, u))
          .map(t -> tuple(t.v2, t.v1));
}

As you can see, the RIGHT OUTER JOIN inverses the results of a LEFT OUTER JOIN, that’s it. For example:

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.rightOuterJoin(s2, (t, u) -> Objects.equals(t, u))
  .forEach(System.out::println);

yielding

(1, 1)
(null, 3)

WHERE = filter()

The most straight-forward mapping is probably SQL’s WHERE clause having an exact equivalent in the Stream API: Stream.filter().

SQL

SELECT *
FROM (VALUES(1), (2), (3)) t(v)
WHERE v % 2 = 0

yielding

+---+
| v |
+---+
| 2 |
+---+

Java

Stream<Integer> s = Stream.of(1, 2, 3);

s.filter(v -> v % 2 == 0)
 .forEach(System.out::println);

yielding

2

The interesting thing with filter() and the Stream API in general is that the operation can apply at any place in the call chain, unlike the WHERE clause, which is limited to be placed right after the FROM clause – even if SQL’s JOIN .. ON or HAVING clauses are semantically similar.

GROUP BY = collect()

The least straight-forward mapping is GROUP BY vs. Stream.collect().

First off, SQL’s GROUP BY may be a bit tricky to fully understand. It is really part of the FROM clause, transforming the set of tuples produced by FROM .. JOIN .. WHERE into groups of tuples, where each group has an associated set of aggregatable tuples, which can be aggregated in the HAVING, SELECT, and ORDER BY clauses. Things get even more interesting when you use OLAP features like GROUPING SETS, which allow for duplicating tuples according to several grouping combinations.

In most SQL implementations that don’t support ARRAY or MULTISET, the aggregatable tuples are not available as such (i.e. as nested collections) in the SELECT. Here, the Stream API’s feature set excels. On the other hand, the Stream API can group values only as a terminal operation, where in SQL, GROUP BY is applied purely declaratively (and thus, lazily). The execution planner may choose not to execute the GROUP BY at all if it is not needed. For instance:

SELECT *
FROM some_table
WHERE EXISTS (
    SELECT x, sum(y)
    FROM other_table
    GROUP BY x
)

The above query is semantically equivalent to

SELECT *
FROM some_table
WHERE EXISTS (
    SELECT 1
    FROM other_table
)

The grouping in the subquery was unnecessary. Someone may have copy-pasted that subquery in there from somewhere else, or refactored the query as a whole. In Java, using the Stream API, each operation is always executed.

For the sake of simplicity, we’ll stick to the most simple examples here

Aggregation without GROUP BY

A special case is when we do not specify any GROUP BY clause. In that case, we can specify aggregations on all columns of the FROM clause, producing always exactly one record. For instance:

SQL

SELECT sum(v)
FROM (VALUES(1), (2), (3)) t(v)

yielding

+-----+
| sum |
+-----+
|   6 |
+-----+

Java

Stream<Integer> s = Stream.of(1, 2, 3);

int sum = s.collect(Collectors.summingInt(i -> i));
System.out.println(sum);

yielding

6

Aggregation with GROUP BY

A more common case of aggregation in SQL is to specify an explicit GROUP BY clause as explained before. For instance, we may want to group by even and odd numbers:

SQL

SELECT v % 2, count(v), sum(v)
FROM (VALUES(1), (2), (3)) t(v)
GROUP BY v % 2

yielding

+-------+-------+-----+
| v % 2 | count | sum |
+-------+-------+-----+
|     0 |     1 |   2 |
|     1 |     2 |   4 |
+-------+-------+-----+

Java

For this simple grouping / collection use-case, luckily, the JDK offers a utility method called Collectors.groupingBy(), which produces a collector that generates a Map<K, List<V>> type like this:

Stream<Integer> s = Stream.of(1, 2, 3);

Map<Integer, List<Integer>> map = s.collect(
    Collectors.groupingBy(v -> v % 2)
);

System.out.println(map);

yielding

{0=[2], 1=[1, 3]}

This certainly takes care of the grouping. Now we want to produce aggregations for each group. The slightly awkward JDK way to do this would be:

Stream<Integer> s = Stream.of(1, 2, 3);

Map<Integer, IntSummaryStatistics> map = s.collect(
    Collectors.groupingBy(
        v -> v % 2,
        Collectors.summarizingInt(i -> i)
    )
);

System.out.println(map);

we’ll now get:

{0=IntSummaryStatistics{count=1, sum=2, min=2, average=2.000000, max=2},
 1=IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}}

As you can see, the count() and sum() values have been calculated somewhere along the lines of the above.

More sophisticated GROUP BY

When doing multiple aggregations with Java 8’s Stream API, you will quickly be forced to wrestle low-level API implementing complicated collectors and accumulators yourself. This is tedious and unnecessary. Consider the following SQL statement:

SQL

CREATE TABLE t (
  w INT,
  x INT,
  y INT,
  z INT
);

SELECT
    z, w, 
    MIN(x), MAX(x), AVG(x), 
    MIN(y), MAX(y), AVG(y) 
FROM t
GROUP BY z, w;

In one go, we want to:

  • Group by several values
  • Aggregate from several values

Java

In a previous article, we’ve explained in detail how this can be achieved using convenience API from jOOλ via Seq.groupBy()

class A {
    final int w;
    final int x;
    final int y;
    final int z;
 
    A(int w, int x, int y, int z) {
        this.w = w;
        this.x = x;
        this.y = y;
        this.z = z;
    }
}

Map<
    Tuple2<Integer, Integer>, 
    Tuple2<IntSummaryStatistics, IntSummaryStatistics>
> map =
Seq.of(
    new A(1, 1, 1, 1),
    new A(1, 2, 3, 1),
    new A(9, 8, 6, 4),
    new A(9, 9, 7, 4),
    new A(2, 3, 4, 5),
    new A(2, 4, 4, 5),
    new A(2, 5, 5, 5))
 
// Seq.groupBy() is just short for 
// Stream.collect(Collectors.groupingBy(...))
.groupBy(
    a -> tuple(a.z, a.w),
 
    // ... because once you have tuples, 
    // why not add tuple-collectors?
    Tuple.collectors(
        Collectors.summarizingInt(a -> a.x),
        Collectors.summarizingInt(a -> a.y)
    )
);

System.out.println(map);

The above yields

{(1, 1)=(IntSummaryStatistics{count=2, sum=3, min=1, average=1.500000, max=2},
         IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}),
 (4, 9)=(IntSummaryStatistics{count=2, sum=17, min=8, average=8.500000, max=9},
         IntSummaryStatistics{count=2, sum=13, min=6, average=6.500000, max=7}),
 (5, 2)=(IntSummaryStatistics{count=3, sum=12, min=3, average=4.000000, max=5},
         IntSummaryStatistics{count=3, sum=13, min=4, average=4.333333, max=5})}

For more details, read the full article here.

Notice how using Stream.collect(), or Seq.groupBy() already makes for an implicit SELECT clause, which we are no longer needed to obtain via map() (see below).

HAVING = filter(), again

As mentioned before, there aren’t really different ways of applying predicates with the Stream API, there is only Stream.filter(). In SQL, HAVING is a “special” predicate clause that is syntactically put after the GROUP BY clause. For instance:

SQL

SELECT v % 2, count(v)
FROM (VALUES(1), (2), (3)) t(v)
GROUP BY v % 2
HAVING count(v) > 1

yielding

+-------+-------+
| v % 2 | count |
+-------+-------+
|     1 |     2 |
+-------+-------+

Java

Unfortunately, as we have seen before, collect() is a terminal operation in the Stream API, which means that it eagerly produces a Map, instead of transforming the Stream<T> into a Stream<K, Stream<V>, which would compose much better in complex Stream. This means that any operation that we’d like to implement right after collecting will have to be implemented on a new stream produced from the output Map:

Stream<Integer> s = Stream.of(1, 2, 3);

s.collect(Collectors.groupingBy(
      v -> v % 2,
      Collectors.summarizingInt(i -> i)
  ))
  .entrySet()
  .stream()
  .filter(e -> e.getValue().getCount() > 1)
  .forEach(System.out::println);

yielding

1=IntSummaryStatistics{count=2, sum=4, min=1, average=2.000000, max=3}

As you can see, the type transformation that is applied is:

  • Map<Integer, IntSummaryStatistics>
  • Set<Entry<Integer, IntSummaryStatistics>>
  • Stream<Entry<Integer, IntSummaryStatistics>>

SELECT = map()

The SELECT clause in SQL is nothing more than a tuple transformation function that takes the cartesian product of tuples produced by the FROM clause and transforms it into a new tuple expression, which is fed either to the client, or to some higher-level query if this is a nested SELECT. An illustration:

FROM output

+------+------+------+------+------+
| T1.A | T1.B | T1.C | T2.A | T2.D |
+------+------+------+------+------+
|    1 |    A |    a |    1 |    X |
|    1 |    B |    b |    1 |    Y |
|    2 |    C |    c |    2 |    X |
|    2 |    D |    d |    2 |    Y |
+------+------+------+------+------+

Applying SELECT

SELECT t1.a, t1.c, t1.b || t1.d

+------+------+--------------+
| T1.A | T1.C | T1.B || T1.D |
+------+------+--------------+
|    1 |    a |           AX |
|    1 |    b |           BY |
|    2 |    c |           CX |
|    2 |    d |           DY |
+------+------+--------------+

Using Java 8 Streams, SELECT can be achieved very simply by using Stream.map(), as we’ve already seen in previous examples, where we unnested tuples using map(). The following examples are functionally equivalent:

SQL

SELECT t.v1 * 3, t.v2 + 5
FROM (
  VALUES(1, 1),
        (2, 2)
) t(v1, v2)

yielding

+----+----+
| c1 | c2 |
+----+----+
|  3 |  6 |
|  6 |  7 |
+----+----+

Java

Stream.of(
  tuple(1, 1),
  tuple(2, 2)
).map(t -> tuple(t.v1 * 3, t.v2 + 5))
 .forEach(System.out::println);

yielding

(3, 6)
(6, 7)

DISTINCT = distinct()

The DISTINCT keyword that can be supplied with the SELECT clause simply removes duplicate tuples right after they have been produced by the SELECT clause. An illustration:

FROM output

+------+------+------+------+------+
| T1.A | T1.B | T1.C | T2.A | T2.D |
+------+------+------+------+------+
|    1 |    A |    a |    1 |    X |
|    1 |    B |    b |    1 |    Y |
|    2 |    C |    c |    2 |    X |
|    2 |    D |    d |    2 |    Y |
+------+------+------+------+------+

Applying SELECT DISTINCT

SELECT DISTINCT t1.a

+------+
| T1.A |
+------+
|    1 |
|    2 |
+------+

Using Java 8 Streams, SELECT DISTINCT can be achieved very simply by using Stream.distinct() right after Stream.map(). The following examples are functionally equivalent:

SQL

SELECT DISTINCT t.v1 * 3, t.v2 + 5
FROM (
  VALUES(1, 1),
        (2, 2),
        (2, 2)
) t(v1, v2)

yielding

+----+----+
| c1 | c2 |
+----+----+
|  3 |  6 |
|  6 |  7 |
+----+----+

Java

Stream.of(
  tuple(1, 1),
  tuple(2, 2),
  tuple(2, 2)
).map(t -> tuple(t.v1 * 3, t.v2 + 5))
 .distinct()
 .forEach(System.out::println);

yielding

(3, 6)
(6, 7)

UNION ALL = concat()

Set operations are powerful both in SQL and using the Stream API. The UNION ALL operation maps to Stream.concat(), as can be seen below:

SQL

SELECT *
FROM (VALUES(1), (2)) t(v)
UNION ALL
SELECT *
FROM (VALUES(1), (3)) t(v)

yielding

+---+
| v |
+---+
| 1 |
| 2 |
| 1 |
| 3 |
+---+

Java

Stream<Integer> s1 = Stream.of(1, 2);
Stream<Integer> s2 = Stream.of(1, 3);

Stream.concat(s1, s2)
      .forEach(System.out::println);

yielding

1
2
1
3

Java (using jOOλ)

Unfortunately, concat() exists in Stream only as a static method, while Seq.concat() also exists on instances when working with jOOλ.

Seq<Integer> s1 = Seq.of(1, 2);
Seq<Integer> s2 = Seq.of(1, 3);

s1.concat(s2)
  .forEach(System.out::println);

UNION = concat() and distinct()

In SQL, UNION is defined to remove duplicates after concatenating the two sets via UNION ALL. The following two statements are equivalent:

SELECT * FROM t
UNION
SELECT * FROM u;

-- equivalent

SELECT DISTINCT *
FROM (
  SELECT * FROM t
  UNION ALL
  SELECT * FROM u
);

Let’s put this in action:

SQL

SELECT *
FROM (VALUES(1), (2)) t(v)
UNION
SELECT *
FROM (VALUES(1), (3)) t(v)

yielding

+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
+---+

Java

Stream<Integer> s1 = Stream.of(1, 2);
Stream<Integer> s2 = Stream.of(1, 3);

Stream.concat(s1, s2)
      .distinct()
      .forEach(System.out::println);

ORDER BY = sorted()

The ORDER BY mapping is trivial

SQL

SELECT *
FROM (VALUES(1), (4), (3)) t(v)
ORDER BY v

yielding

+---+
| v |
+---+
| 1 |
| 3 |
| 4 |
+---+

Java

Stream<Integer> s = Stream.of(1, 4, 3);

s.sorted()
 .forEach(System.out::println);

yielding

1
3
4

LIMIT = limit()

The LIMIT mapping is even more trivial

SQL

SELECT *
FROM (VALUES(1), (4), (3)) t(v)
LIMIT 2

yielding

+---+
| v |
+---+
| 1 |
| 4 |
+---+

Java

Stream<Integer> s = Stream.of(1, 4, 3);

s.limit(2)
 .forEach(System.out::println);

yielding

1
4

OFFSET = skip()

The OFFSET mapping is trivial as well

SQL

SELECT *
FROM (VALUES(1), (4), (3)) t(v)
OFFSET 1

yielding

+---+
| v |
+---+
| 4 |
| 3 |
+---+

Java

Stream<Integer> s = Stream.of(1, 4, 3);

s.skip(1)
 .forEach(System.out::println);

yielding

4
3

Conclusion

In the above article, we’ve seen pretty much all the useful SQL SELECT query clauses and how they can be mapped to the Java 8 Stream API, or to jOOλ’s Seq API, in case Stream doesn’t offer sufficient functionality.

The article shows that SQL’s declarative world is not that much different from Java 8’s functional world. SQL clauses can compose ad-hoc queries just as well as Stream methods can be used to compose functional transformation pipelines. But there is a fundamental difference.

While SQL is truly declarative, functional programming is still very instructive. The Stream API does not make optimisation decisions based on constraints, indexes, histograms and other meta information about the data that you’re transforming. Using the Stream API is like using all possible optimisation hints in SQL to force the SQL engine to choose one particular execution plan over another. However, while SQL is a higher level algorithm abstraction, the Stream API may allow you to implement more customisable algorithms.

Top 10 Useful, Yet Paranoid Java Programming Techniques


After coding for a while (eek, almost 20 years or so in my case, time flies when you’re having fun), one starts to embrace those habits. Because, you know…

Anything that Can Possibly Go Wrong, Does.

This is why people embrace “defensive programming”, i.e. paranoid habits that sometimes make total sense, and sometimes are rather obscure and/or clever and perhaps a bit eerie when you think of the person who wrote it. Here’s my personal list of top 10 useful, yet paranoid Java programming techniques. Let’s go:

1. Put the String literal first

It’s just never a bad idea to prevent the occasional NullPointerException by putting the String literal on the left side of an equals() comparison as such:

// Bad
if (variable.equals("literal")) { ... }

// Good
if ("literal".equals(variable)) { ... }

This is a no-brainer. Nothing is lost from rephrasing the expression from the less good version to the better one. If only we had true Options though, right? Different discussion…

2. Don’t trust the early JDK APIs

In the early days of Java, programming must’ve been a big pain. The APIs were still very immature and you might’ve walked across a piece of code like this:

String[] files = file.list();

// Watch out
if (files != null) {
    for (int i = 0; i < files.length; i++) {
        ...
    }
}

Looking paranoid? Perhaps, but read the Javadoc:

If this abstract pathname does not denote a directory, then this method returns null. Otherwise an array of strings is returned, one for each file or directory in the directory.

Yeah right. Better add another check, though, just to be sure:

if (file.isDirectory()) {
    String[] files = file.list();

    // Watch out
    if (files != null) {
        for (int i = 0; i < files.length; i++) {
            ...
        }
    }
}

Bummer! Violation of rule #5 and #6 of our 10 Subtle Best Practices when Coding Java list. So be prepared and add that null check!

3. Don’t trust that “-1”

This is paranoid, I know. The Javadoc of String.indexOf() clearly states that…

the index of the first occurrence of the character in the character sequence represented by this object [is returned], or -1 if the character does not occur.

So, -1 can be taken for granted, right? I say nay. Consider this:

// Bad
if (string.indexOf(character) != -1) { ... }

// Good
if (string.indexOf(character) >= 0) { ... }

Who knows. Perhaps they’ll need ANOTHER encoding at some point in time to say, the otherString would have been contained if checked case-insensitively… Perhaps a good case for returning -2? Who knows.

After all, we’ve had billions of discussions about the billion dollar mistake, which is NULL. Why shouldn’t we start discussions about -1, which is – in a way – an alternative null for primitive type int?

4. Avoid the accidental assignment

Yep. It happens to the best (although, not to me. See #7).

(Assume this is JavaScript, but let’s be paranoid about the language as well)

// Ooops
if (variable = 5) { ... }

// Better (because causes an error)
if (5 = variable) { ... }

// Intent (remember. Paranoid JavaScript: ===)
if (5 === variable) { ... }

Again. If you have a literal in your expression, put it to the left side. You can’t accidentally go wrong here, when you meant to add another = sign.

5. Check for null AND length

Whenever you have a collection, array, etc., make sure it’s present AND not empty.

// Bad
if (array.length > 0) { ... }

// Good
if (array != null && array.length > 0) { ... }

You never know where those arrays come from. Perhaps from early JDK API?

6. All methods are final

You can tell me all you want about your open/closed principles, that’s all bollocks. I don’t trust you (to correctly extend my classes) and I don’t trust myself (to not accidentally extend my classes). Which is why everything that is not explicitly intended for subtyping (i.e. only interfaces) is strictly final. See also item #9 of our 10 Subtle Best Practices when Coding Java list.

// Bad
public void boom() { ... }

// Good. Don't touch.
public final void dontTouch() { ... }

Yes. It’s final. If that doesn’t work for you, patch it, or instrument it, or rewrite the byte code. Or send a feature request. I’m sure that your intent of overriding the above isn’t a good idea anyway.

7. All variables and parameters are final

As I said. I don’t trust myself (to not accidentally overwrite my values). Having said so, I don’t trust myself at all. Because…

yesterdays-regex

… which is why all variables and parameters are made final, too.

// Bad
void input(String importantMessage) {
    String answer = "...";

    answer = importantMessage = "LOL accident";
}

// Good
final void input(final String importantMessage) {
    final String answer = "...";
}

OK, I admit. This one, I don’t apply very often, really, although I should. I wish Java got it right like Scala, where people just type val all over the place, without even thinking about mutability – except when they need it explicitly (rarely!), via var.

8. Don’t trust generics when overloading

Yes. It can happen. You believe you wrote that super nice API which totally rocks and is totally intuitive, and along comes some user who just raw-casts everything up to Object until the darn compiler stops bitching, and suddently they’ll link the wrong method, thinking it’s your fault (it always is).

Consider this:

// Bad
<T> void bad(T value) {
    bad(Collections.singletonList(value));
}

<T> void bad(List<T> values) {
    ...
}

// Good
final <T> void good(final T value) {
    if (value instanceof List)
        good((List<?>) value);
    else
        good(Collections.singletonList(value));
}

final <T> void good(final List<T> values) {
    ...
}

Because, you know… Your users, they’re like

// This library sucks
@SuppressWarnings("all")
Object t = (Object) (List) Arrays.asList("abc");
bad(t);

Trust me. I’ve seen everything. Including things like

7H6FAH7[1]

It’s good to be paranoid.

9. Always throw on switch default

Switch… One of those funny statements where I don’t know whether to petrify with awe or to just cry. Anyway, we’re stuck with switch, so we may as well get it right when we have to. I.e.

// Bad
switch (value) {
    case 1: foo(); break;
    case 2: bar(); break;
}

// Good
switch (value) {
    case 1: foo(); break;
    case 2: bar(); break;
    default:
        throw new ThreadDeath("That'll teach them");
}

Because that moment where value == 3 is introduced into the software, it’ll come for sure! And don’t say enum, because it’ll happen to enums as well!

10. Switch with curly braces

In fact, switch is the most wicked statement anyone has every allowed to get into a language while they were either drunk or lost a bet. Consider the following example:

// Bad, doesn't compile
switch (value) {
    case 1: int j = 1; break;
    case 2: int j = 2; break;
}

// Good
switch (value) {
    case 1: {
        final int j = 1;
        break;
    }
    case 2: {
        final int j = 2;
        break;
    }

    // Remember:
    default: 
        throw new ThreadDeath("That'll teach them");
}

Within the switch statement, there is only one scope defined among all the case statements. In fact, these case statements aren’t even really statements, they’re like labels and the switch is a goto call. In fact, you could even compare case statements with the astonishing FORTRAN 77 ENTRY statement, a device whose mystery is only exceeded by its power.

This means that the variable final int j is defined for all the different cases, regardless if we issue a break or not. Not very intuitive. Which is why it’s always a good idea to create a new, nested scope per case statement via a simple block. (but don’t forget the break within the block!)

Conclusion

Paranoid programming may seem weird at times, as code often turns out to be a bit more verbose than really needed. You might think, “oh this is never gonna happen”, but as I said. After 20 years or so programming, you just don’t want to fix those stupid little unnecessary bugs anymore that exist only because the language is so old and flawed. Because you know…

Now, it’s your turn!

What’s your most paranoid quirk in programming?

Read on

Did you enjoy this list? Have more spare time to waste on lists? We have more. Here are:

And last but not least, because we’re paranoid…

RAM is the new SSD


Your data fits in RAM. Yes, it does. Don’t believe it? Visit the hilarious yourdatafitsinram.com website.

But there is an entirely new dimension to this since last week’s announcement by Intel, which hasn’t gotten enough attention in the blogosphere yet.

New 3D XPoint™ technology brings non-volatile memory speeds up to 1,000 times faster than NAND, the most popular non-volatile memory in the marketplace today.

And then:

The companies invented unique material compounds and a cross point architecture for a memory technology that is 10 times denser than conventional memory

This is colossal news, which you can read from the official source here:
http://newsroom.intel.com/community/intel_newsroom/blog/2015/07/28/intel-and-micron-produce-breakthrough-memory-technology

What does it mean for software?

SSD has already had a big impact on how we think about software, especially in the database business. Many RDBMS’s internal optimisations are based on the fact that the database is installed on a system with few CPUs, a bit of RAM and a large HDD. HDD’s are very slow and suffer from a lot of latency due to their spinning. Data needs to be cached on several layers – in the operating system that accesses blocks on the disk as well as in the database that accesses rows from the tables or indexes.

SSD changed a lot of this, as the spinning (and its associated latency) has gone, which is most useful for index lookups as Markus Winand from use-the-index-luke.com explains:

index lookups have a tendency to cause many random IO operations and can thus benefit from the fast response time of SSDs. The fun part is that properly indexed databases get better benefits from SSD than poorly indexed ones

SSD is still relatively new and not yet fully adopted in enterprise data centers and associated software, yet already, we’re seeing this new trend:

tweet thisRAM is the new SSD

One of the most impressive displays of yourdatafitsinram.com is Stack Exchange, the platform running the popular Stack Overflow. According to their website, the platform is transferring 48 TB data / month to its users via an average 225 requests per second.

From our perspective, the database metrics are even more interesting, as Stack Overflow is essentially running a single SQL Server instance (with a hot standby), accommodating 440M queries per day via 384 GB of RAM and a DB size of 2.4TB.

The full metrics can be found on this website:
http://stackexchange.com/performance

Now, let’s apply Intel’s new 3D XPoint™ technology to this model – perhaps we don’t need any disk anymore, after all (except for logging and backups)?

Don’t scale out. Yet.

A lot of recent hype has been evolving around the need for scaling out as Moore’s Law has come to a halt and we now need to parallelise on many many cores. But this doesn’t mean that we absolutely need to parallelise on many machines. Keeping all data processing in one place that can be scaled up greatly with processors and RAM will help us prevent hard-to-manage network latency and will again allow us to continue using established, slightly adapted RDBMS technology. Prices for hardware will crumble soon enough.

We’re looking forward to an exciting new era of scaling up massively. With SQL, of course!

INTERSECT – the Underestimated Two-Way IN Predicate


Have you ever wondered how you could express a predicate that “feels” like the following, in SQL:

WHERE Var1 OR Var2 IN (1, 2, 3)

/u/CyBerg90 has, on reddit. The idea was to create a predicate that yields true whenever both values Var1 and Var2 yield either 1, 2, or 3.

The canonical solution

The canonical solution would obviously be to write it all out as:

WHERE Var1 = 1 OR Var1 = 2 OR Var1 = 3
OR    Var2 = 1 OR Var2 = 2 OR Var2 = 3

A lot of duplication, though.

Using IN predicates

Most readers would just connect the two IN predicates:

WHERE Var1 IN (1, 2, 3)
OR    Var2 IN (1, 2, 3)

Or the clever ones might reverse the predicates as such, to form the equivalent:

WHERE 1 IN (Var1, Var2)
OR    2 IN (Var1, Var2)
OR    3 IN (Var1, Var2) 

Nicer solution using EXISTS and JOIN

All of the previous solutions require syntax / expression repetition to some extent. While this may not have any significant impact performance-wise, it can definitely explode in terms of expression length. Better solutions (from that perspective) make use of the EXISTS predicate, constructing ad-hoc sets that are non-empty when both Var1 and Var2 yield either 1, 2, or 3.

Here’s EXISTS with JOIN

WHERE EXISTS (
    SELECT 1
    FROM (VALUES (Var1), (Var2)) t1(v)
    JOIN (VALUES (1), (2), (3)) t2(v)
    ON t1.v = t2.v
)

This solution constructs two tables with a single value each, joining them on that value:

+------+    +------+
| t1.v |    | t2.v |
+------+    +------+
| Var1 |    |    1 |
| Var2 |    |    2 |
+------+    |    3 |
            +------+

Looking at a Venn Diagram, it is easy to see how JOIN will produce only those values from t1 and t2 that are present in both sets:

intersect

Nicest solution using EXISTS and INTERSECT

However, people might not think of a set intersection when they read JOIN. So why not make use of actual set intersection via INTERSECT? The following is the nicest solution in my opinion:

WHERE EXISTS (
    SELECT v
    FROM (VALUES (Var1), (Var2)) t1(v)
    INTERSECT
    SELECT v
    FROM (VALUES (1), (2), (3)) t2(v)
)

Observe, how the length of the SQL statement increases with O(m + n) (or simply O(N), where m, n = number of values in each set, whereas the original solutions using IN increase with O(m * n) (or simply O(N2)).

INTERSECT Support in popular RDBMS

INTERSECT is widely supported, both in the SQL standard as well as in any of the following RDBMS that are supported by jOOQ:

  • CUBRID
  • DB2
  • Derby
  • H2
  • HANA
  • HSQLDB
  • Informix
  • Ingres
  • Oracle
  • PostgreSQL
  • SQLite
  • SQL Server
  • Sybase ASE
  • Sybase SQL Anywhere

In fact, the following databases also support the less commonly used INTERSECT ALL, which doesn’t remove duplicate values from resulting bags (see also UNION vs. UNION ALL)

  • CUBRID
  • PostgreSQL

Happy intersecting!

Java 8’s Method References Put Further Restrictions on Overloading


Method overloading has always been a topic with mixed feelings. We’ve blogged about it and the caveats that it introduces a couple of times:

There are two main reasons why overloading is useful:

  1. To allow for defaulted arguments
  2. To allow for disjunct argument type alternatives

Bot reasons are motivated simply to provide convenience for API consumers. Good examples are easy to find in the JDK:

Defaulted arguments

public class Integer {
    public static int parseInt(String s) {
        return parseInt(s,10);
    }

    public static int parseInt(String s, int radix) {}
}

In the above example, the first parseInt() method is simply a convenience method for calling the second one with the most commonly used radix.

Disjunct argument type alternatives

Sometimes, similar behaviour can be achieved using different types of parameters, which mean similar things but which are not compatible in Java’s type system. For example when constructing a String:

public class String {
    public static String valueOf(char c) {
        char data[] = {c};
        return new String(data, true);
    }

    public static String valueOf(boolean b) {
        return b ? "true" : "false";
    }

    // and many more...
}

As you can see, the behaviour of the same method is optimised depending on the argument type. This does not affect the “feel” of the method when reading or writing source code as the semantics of the two valueOf() methods is the same.

Another use-case for this technique is when commonly used, similar but incompatible types need convenient conversion between each other. As an API designer, you don’t want to make your API consumer goof around with such tedious conversions. Instead, you offer:

public class IOUtils {
    public static void copy(InputStream input, OutputStream output);
    public static void copy(InputStream input, Writer output);
    public static void copy(InputStream input, Writer output, String encoding);
    public static void copy(InputStream input, Writer output, Charset encoding);
}

This is a nice example showing both defaulted parameters (optional encoding) as well as argument type alternatives (OutputStream vs. Writer or String vs. Charset encoding representation.

Side-note

I suspect that the union type and defaulted argument ships have sailed for Java a long time ago – while union types might be implemented as syntax sugar, defaulted arguments would be a beast to introduce into the JVM as it would depend on the JVM’s missing support for named arguments.

As displayed by the Ceylon language, these two features cover about 99% of all method overloading use-cases, which is why Ceylon can do completely without overloading – on top of the JVM!

Overloading is dangerous and unnececssary

The above examples show that overloading is essentially just a means to help humans interact with an API. For the runtime, there is no such thing as overloading. There are only different, unique method signatures to which calls are linked “statically” in byte code (give or take more recent opcodes like invokedynamic). But the point is, there’s no difference for the computer if the above methods are all called copy(), or if they had been called unambiguously m1(), m2(), m3(), and m4().

On the other hand, overloading is real in Java source code, and the compiler has to do a lot of work to find the most specific method, and otherwise apply the JLS’s complex overload resolution algorithm. Things get worse with each new Java language version. In Java 8, for instance, method references will add additional pain to API consumers, and require additional care from API designers. Consider the following example by Josh Bloch:

You can copy-paste the above code into Eclipse to verify the compilation error (note that not-up-to-date compilers may report type inference side-effects instead of the actual error). The compilation error reported by Eclipse for the following simplification:

static void pfc(List<Integer> x) {
    Stream<?> s = x.stream().map(Integer::toString);
}

… is

Ambiguous method reference: both toString() and 
toString(int) from the type Integer are eligible

Oops!

The above expression is ambiguous. It can mean any of the following two expressions:

// Instance method:
x.stream().map(i -> i.toString());

// Static method:
x.stream().map(i -> Integer.toString(i));

As can be seen, the ambiguity is immediately resolved by using lambda expressions rather than method references. Another way to resolve this ambiguity (towards the instance method) would be to use the super-type declaration of toString() instead, which is no longer ambiguous:

// Instance method:
x.stream().map(Object::toString);

Conclusion

The conclusion here for API designers is very clear:

Method overloading has become an even more dangerous tool for API designers since Java 8

While the above isn’t really “severe”, API consumers will waste a lot of time overcoming this cognitive friction when their compilers reject seemingly correct code. One big faux-pas that is a takeaway from this example is to:

Never mix similar instance and static method overloads

And in fact, this amplifies when your static method overload overloads a name from java.lang.Object, as we’ve explained in a previous blog post.

There’s a simple reason for the above rule. Because there are only two valid reasons for overloading (defaulted parameters and incompatible parameter alternatives), there is no point in providing a static overload for a method in the same class. A much better design (as exposed by the JDK) is to have “companion classes” – similar to Scala’s companion objects. For instance:

// Instance logic
public interface Collection<E> {}
public class Object {}

// Utilities
public class Collections {}
public final class Objects {}

By changing the namespace for methods, overloading has been circumvented somewhat elegantly, and the previous problems would not have appeared.

TL;DR: Avoid overloading unless the added convenience really adds value!

NULL is Not The Billion Dollar Mistake. A Counter-Rant


A short while ago, I gave this answer on Quora. The question was “What is the significance of NULL in SQL?” and most of the existing answers went on about citing C.J. Date or Tony Hoare and unanimously declared NULL as “evil”.

So, everyone rants about NULL all the time. Let me counter-rant.

Academics

Of course, academics like C.J. Date will rant about NULL (see Greg Kemnitz’s interesting answer on Quora). Let me remind you that C.J. Date also ranted about UNION ALL, as pure relational theory operates only on sets, not on bags (like SQL does). While in theory, sets are probably much purer than bags, in practice, bags are just very useful.

These people probably also still mourn over the fact that SQL (useful) won over QUEL (pure), and I don’t blame them. Theory is always more beautiful than the real world, which is exposed to real world requirements.

Purists

There are also other kinds of purists who will run about and educate everyone about their black/white opinions that leave no room to “it depends…” pragmatic approaches. I like to display this witty comic strip for such occasions: New intern knows best: GOTO. Purists like extreme abstraction when they describe their world, and such abstraction asks for very simple models, no complexity. NULL adds tremendous complexity to the SQL “model”, and does thus not fit their view.

Fact is: It depends

The only factual opinion ever is one where there’s no clear opinion. NULL is an incredibly useful value, and some representation of NULL is inevitable in all languages / models that want to model cardinalities of the form:

  • 0 or 1 (here’s where NULL is useful)
  • exactly 1 (here, you don’t need NULL)
  • 0 .. many (here, you don’t need NULL)

Functional programming languages like to make use of the Optional “monad” (see Mario Fusco’s excellent explanation of what a monad is) to model the 0 or 1 cardinality, but that’s just another way of modelling NULL. The (possibly) absent value. Perhaps, if you like to discuss style (then you should read this), NULL vs. Optional may matter to you, but they’re really exactly the same thing. We’ve just been shifting whitespace and curly braces.

The only way to really do without the absent value would be to disallow the optional cardinality and use 0 .. many instead, which would be much less descriptive.

So, regardless of what purists or academics say about a perfect world, we engineers need potent tools that help us get our work done, and NULL (or “Optional” is one of these potent tools that allow us to do so.

Caveat: SQL NULL is not an absent value

Now, the caveat with SQL’s NULL is that it doesn’t behave like an absent value. It is the UNKNOWN value as others have also explained. This subtle difference has severe impact on a variety of operations and predicates, which do not behave very intuitively if you’re not aware of this distinction. Some examples (and there are many many more):

Even with this specification of SQL NULL being UNKNOWN, most people abuse SQL NULL to model the absent value instead, which works just nicely in most cases until you run into a caveat. It turns out that the UNKNOWN value is even more useful than the absent value, as it allows for modelling things with even more descriptiveness. One might think that having two “special” values would solve problems, like JavaScript, which distinguishes between null (UNKNOWN) and undefined (absent).

JavaScript itself is a beacon of usefulness that is inversely proportional to its purity or beauty, so long story short:

Pick your favourite spot on the useful <-> pure scale

Programming, languages, data models are always a tradeoff between purity and usefulness. Pick your favourite spot on that scale, but stop ranting about NULL being evil. Or as Simon Peyton Jones said:

Haskell is useless

What the sun.misc.Unsafe Misery Teaches Us


Oracle will remove the internal sun.misc.Unsafe class in Java 9. While most people are probably rather indifferent regarding this change, some other people – mostly library developers – are not. There had been a couple of recent articles in the blogosphere painting a dark picture of what this change will imply:

Maintaining a public API is extremely difficult, especially when the API is as popular as that of the JDK. There is simply (almost) no way to keep people from shooting themselves in the foot. Oracle (and previously Sun) have always declared the sun.* packages as internal and not to be used. Citing from the page called “Why Developers Should Not Write Programs That Call ‘sun’ Packages”:

The sun.* packages are not part of the supported, public interface.

A Java program that directly calls into sun.* packages is not guaranteed to work on all Java-compatible platforms. In fact, such a program is not guaranteed to work even in future versions on the same platform.

This disclaimer is just one out of many similar disclaimers and warnings. Whoever goes ahead and uses Unsafe does so … “unsafely“.

What do we learn from this?

The concrete solution to solving this misery is being discussed and still open. A good idea would be to provide a formal and public replacement before removing Unsafe, in order to allow for migration paths of the offending libraries.

But there’s a more important message to all of this. The message is:

When all you have is a hammer, every problem looks like a thumb

Translated to this situation: The hammer is Unsafe and given that it’s a very poor hammer, but the only option, well, library developers might just not have had much of a choice. They’re not really to blame. In fact, they took a gamble in one of the world’s most stable and backwards compatible software environments (= Java) and they fared extremely well for more than 10 years. Would you have made a different choice in a similar situation? Or, let me ask differently. Was betting on AWT or Swing a much safer choice at the time?

If something can somehow be used by someone, then it will be, no matter how obviously they’re gonna shoot themselves in the foot. The only way to currently write a library / API and really prevent users from accessing internals is to put everything in a single package and make everything package-private. This is what we’ve been doing in jOOQ from the beginning, knowing that jOOQ’s internals are extremely delicate and subject to change all the time.

For more details about this rationale, read also:

However, this solution has a severe drawback for those developing those internals. It’s a hell of a package with almost no structure. That makes development rather difficult.

What would be a better Java, then?

Java has always had an insufficient set of visibilities:

  • public
  • protected
  • default (package-private)
  • private

There should be a fifth visibility that behaves like public but prevents access from “outside” of a module. In a way, that’s between the existing public and default visibilities. Let’s call this the hypothetical module visibility.

In fact, not only should we be able to declare this visibility on a class or member, we should be able to govern module inter-dependencies on a top level, just like the Ceylon language allows us to do:

module org.hibernate "3.0.0.beta" {
    import ceylon.collection "1.0.0";
    import java.base "7";
    shared import java.jdbc "7";
}

This reads very similar to OSGi’s bundle system, where bundles can be imported / exported, although the above module syntax is much much simpler than configuring OSGi.

A sophisticated module system would go even further. Not only would it match OSGi’s features, it would also match those of Maven. With the possibility of declaring dependencies on a Java language module basis, we might no longer need the XML-based Maven descriptors, as those could be generated from a simple module syntax (or Gradle, or ant/ivy).

And with all of this in place, classes like sun.misc.Unsafe could be declared as module-visible for only a few JDK modules – not the whole world. I’m sure the number of people abusing reflection to get a hold of those internals would decrease by 50%.

Conclusion

I do hope that in a future Java, this Ceylon language feature (and also Fantom language feature, btw) will be incorporated into the Java language. A nice overview of Java 9 / Jigsaw’s modular encapsulation can be seen in this blog post:
http://blog.codefx.org/java/dev/features-project-jigsaw-java-9/#Encapsulation

Until then, if you’re an API designer, do know that all disclaimers won’t work. Your internal APIs will be used and abused by your clients. They’re part of your ordinary public API from day 1 after you publish them. It’s not your user’s fault. That’s how things work.

Follow

Get every new post delivered to your Inbox.

Join 2,966 other followers

%d bloggers like this: