Type Safe Implicit JOIN Through Path Navigation in jOOQ 3.11

One of the biggest contributors to SQL syntax verbosity is the need to explicitly JOIN every table that somehow contributes to the query, even if that contribution is “trivial”. When looking at the Sakila database, an example could be seen easily when fetching customer data:

SELECT 
  cu.first_name,
  cu.last_name,
  co.country
FROM customer AS cu
JOIN address USING (address_id)
JOIN city USING (city_id)
JOIN country AS co USING (country_id)  

That single access to the country information cost us 3 additional lines of SQL code as well as the cognitive overhead of mentally navigating through the to-one relationships in order to get the joins right.

This can be equally tedious when writing the SQL as well as when reading it! There is separation of concerns (projection vs joins) where there shouldn’t be in this particular case. We’re just projecting the country, not doing anything with it, let alone care about the individual table / primary key / foreign key names. Imagine if we had composite keys in the path from customer to country…

Implicit JOIN from SELECT clause

Wouldn’t it be much better (in this case) to be able to write:

SELECT 
  cu.first_name,
  cu.last_name,
  cu.address.city.country.country
FROM customer AS cu

Because after all, that’s really the same thing. We’re fetching only customers, and we load some additional content from its parent table(s). Since we’re navigating to-one relationships only (as opposed to navigating to-many relationships), we don’t really need actual JOIN semantics, a JOIN being a filtered cartesian product.

Implicit JOIN from WHERE clause

The same is true when fetching customers from a particular country. Why write:

SELECT 
  cu.first_name,
  cu.last_name
FROM customer AS cu
JOIN address USING (address_id)
JOIN city USING (city_id)
JOIN country AS co USING (country_id)
WHERE co.country = 'Switzerland'

When this would be a lot more natural:

SELECT 
  cu.first_name,
  cu.last_name
FROM customer AS cu
WHERE cu.address.city.country.country = 'Switzerland'

Implicit JOIN from multiple clauses

Another example would be when grouping by country to find out how many customers per country we have. Standard SQL, explicit JOIN version:

SELECT 
  co.country,
  COUNT(*),
  COUNT(DISTINCT city.city)
FROM customer AS cu
JOIN address USING (address_id)
JOIN city USING (city_id)
JOIN country AS co USING (country_id)  
GROUP BY co.country
ORDER BY co.country

Again, the many JOINs could be seen as noise, when the implicit version may seem much leaner:

SELECT 
  cu.address.city.country.country,
  COUNT(*),
  COUNT(DISTINCT cu.address.city.city)
FROM customer AS cu
GROUP BY cu.address.city.country.country
ORDER BY cu.address.city.country.country

Even if the same expression is repeated 3x (and we could easily alias it, of course), the output query would still do only that single JOIN graph that we’ve seen before. In fact, there are two different paths:

  • cu.address.city.*
  • cu.address.city.country.*

Internally, we should recognise that the paths are part of the same tree traversal, so the JOIN graph produced by cu.address.city.* can be re-used for cu.address.city.country.*

In fact, we could actually add one (semi-)explicit JOIN to avoid the repetition:

SELECT 
  ci.country.country,
  COUNT(*),
  COUNT(DISTINCT ci.city)
FROM customer AS cu
IMPLICIT JOIN cu.address.city AS ci
GROUP BY ci.country.country
ORDER BY ci.country.country

Implicit JOIN from correlated subqueries

A more sophisticated case would be an implicit join in a correlated subquery, which should really affect the outer query rather than the subquery. Consider finding all customers and the number of customers from the same country:

SELECT 
  cu.first_name,
  cu.last_name, 
  (
    SELECT COUNT(*)
    FROM customer AS cu2
    JOIN address USING (address_id)
    JOIN city AS ci2 USING (city_id)
    WHERE ci2.country_id = ci.country_id
  ) AS customers_from_same_country
FROM customer AS cu
JOIN address USING (address_id)
JOIN city AS ci USING (city_id)

Now clearly, the JOINs start getting into the way of readability (and writeability as well). There’s a slight risk of getting semantics wrong because of all the aliasing going on. A much leaner solution is:

SELECT 
  cu.first_name,
  cu.last_name, 
  (
    SELECT COUNT(*)
    FROM customer AS cu2
    WHERE cu2.address.city.country_id =
          cu.address.city.country_id
  ) AS customers_from_same_country
FROM customer AS cu

Now, of course, many of you cringed and were ready to point out that a correlated subquery isn’t the best solution in this case, and you’re absolutely correct. Use window functions, instead!

Implicit JOIN from window functions

Still, you can profit from implicit JOIN again. Plain SQL version:

SELECT 
  cu.first_name,
  cu.last_name, 
  COUNT(*) OVER (PARTITION BY ci.country_id)
    AS customers_from_same_country
FROM customer AS cu
JOIN address USING (address_id)
JOIN city AS ci USING (city_id)

Implicit JOIN version:

SELECT 
  cu.first_name,
  cu.last_name, 
  COUNT(*) OVER (PARTITION BY cu.address.city.country_id)
    AS customers_from_same_country
FROM customer AS cu

It doesn’t matter where the implicit JOIN appears, i.e. where the path-based parent table access appears. The translation from implicit JOIN syntax to explicit JOIN will always append a JOIN or several JOINs to the left-most child table in the JOIN path, wherever that table is declared. This is a simple matter of scope resolution.

Drawbacks

Technically, there are no drawbacks of the implicit JOIN syntax for to-one relationships compared to the explicit JOIN syntax. But of course, as always with syntax sugar, there’s a slight risk of a developer not fully aware of how things work behind the scenes choosing a less optimal (but visually more elegant) solution over a more performant one.

This could be the case when modelling ANTI JOINs as implicit JOINs with a IS NULL predicate. In some databases, that might still be better, but in most databases, using NOT EXISTS() should be preferred when ANTI JOIN semantics is implemented.

Implicit JOIN for to-many relationship

Having a syntax for navigating to-many relationships is desireable as well, although the implications on semantics are vastly different. While implicit JOINs on to-one relationships have no unexpected effects on the semantics of the query, implicit JOINs on to-many relationships implicitly change the cardinalities of queries they’re contained in. For example:

SELECT
  a.first_name,
  a.last_name,
  a.film.title
FROM actor AS a

When navigating from the ACTOR to the FILM table (via the FILM_ACTOR relationship table), we’re going to duplicate the actor results. It is rather unexpected to have an expression in the SELECT clause to modify the cardinalities of a query, and thus, probably not a good idea. Specifically, there are many cases of implicit JOINs on to-many relationships where the semantics is unclear, ambiguous, or even wrong, because of this change of cardinalities.

For the sake of simplicity, this discussion is out of scope for this article, and for the upcoming jOOQ feature:

jOOQ support for implicit JOIN

Some ORMs like Hibernate, Doctrine, and others have implemented this feature in the past in their own respective query languages, such as HQL, DQL. jOOQ 3.11 follows suit and offers this feature as well through its type safe SQL query API (see https://github.com/jOOQ/jOOQ/issues/1502)

This will be done for the entirety of the SQL language, not just a limited subset, such as HQL or DQL.

All of the above queries can be written in jOOQ as such:

Customer cu = CUSTOMER.as("cu");

ctx.select(
      cu.FIRST_NAME,
      cu.LAST_NAME,
      cu.address().city().country().COUNTRY)
   .from(cu)
   .fetch();

ctx.select(
      cu.FIRST_NAME,
      cu.LAST_NAME)
   .from(cu)
   .where(cu.address().city().country().COUNTRY.eq("Switzerland"))
   .fetch();

ctx.select(cu.address().city().country().COUNTRY, count())
   .from(cu)
   .groupBy(cu.address().city().country().COUNTRY)
   .orderBy(cu.address().city().country().COUNTRY)
   .fetch();

Customer cu2 = CUSTOMER.as("cu2");

ctx.select(
      cu.FIRST_NAME,
      cu.LAST_NAME,
      field(selectCount()
          .from(cu2)
          .where(cu2.address().city().COUNTRY_ID.eq(
                 cu.address().city().COUNTRY_ID))
      ).as("customers_from_same_country"))
   .from(cu)
   .fetch();

ctx.select(
      cu.FIRST_NAME,
      cu.LAST_NAME,
      count().over(partitionBy(cu.address().city().COUNTRY_ID))
        .as("customers_from_same_country"))
   .from(cu)
   .fetch();

The navigation is completely type safe thanks to jOOQ’s code generator which generates navigational methods from child table to parent table in the presence of foreign keys. By default, the method name matches the parent table name (single foreign key between child and parent) or the foreign key constraint name (multiple foreign keys between child and parent), but as always, this can be overridden easily using generator strategies.

The feature is really extremely powerful. For a much more complex example, see:

Bringing implicit JOIN to actual SQL

A nice jOOQ feature that hasn’t been advertised too often yet is the new jOOQ parser, whose main purpose (so far) is to offer support for the DDLDatabase, a tool that reverse engineers your DDL scripts to generate jOOQ code. The parser will have many other uses in the future, though, including its capability of being exposed behind a JDBC proxy API, which can parse any JDBC based application’s SQL and re-generate it using different settings (e.g. a different dialect).

Of course, the parser (if supplied with schema meta information, see https://github.com/jOOQ/jOOQ/issues/5296) will be able to resolve such path expressions and transform the input SQL string using implicit JOINs to the equivalent output SQL string with natural SQL joins.

This topic is still under research. More information will follow as the scope of this functionality will become more clear.

Availability in jOOQ

jOOQ 3.11 is due for late Q3 2018 / early Q4 2018. You can already play around with this feature by checking out jOOQ from GitHub:
https://github.com/jOOQ/jOOQ

Your feedback is very welcome!

jOOQ Tuesdays: Ming-Yee Iu Gives Insight into Language Integrated Querying

Welcome to the jOOQ Tuesdays series. In this series, we’ll publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

Ming-Yee Iu

We have the pleasure of talking to Ming-Yee Iu in this eighth edition who will be telling us about how different people in our industry have tackled the integration of query systems into general purpose languages, including his own library JINQ, which does so for Java.

Ming, everyone coming from C# to Java will google LINQ for Java. You have implemented just that with JINQ. What made you do it?

Jinq actually grew out of my PhD research at EPFL university in Switzerland. When I started a PhD there in 2005, I needed a thesis topic, and I heard that my supervisor Willy Zwaenepoel was interested in making it easier to write database code. I had a bit of a background with Java internals from when I was an intern with one of IBM’s JVM teams in 1997, so when I took a look at the problem, I looked at it from a lower-level systems perspective. As a result, I came up with the idea of using a bytecode rewriting scheme to rewrite certain types of Java code into database queries. There were other research groups looking at the problem at the same time, including the LINQ group. Different groups came up with different approaches based on their own backgrounds. The basic assumption was that programmers had difficulty writing database code because there was a semantic gap–the relational database model was so different from the object-oriented programming model that programmers wasted mental effort bridging the differences. The hope was that this semantic gap could be reduced by letting programmers write normal Java code and having the computer figure out how to run this code on a database. Different approaches would result in tools that could handle more complex database queries or could be more flexible in the style of code they accept.

Although I came up with an initial approach fairly quickly, it took me many years to refine the algorithms into something more robust and usable. Similar to the LINQ researchers, I found that my algorithms worked best with functional code. Because functional-style code has no side effects, it’s easier to analyze. It’s also easier to explain to programmers how to write complex code that the algorithms could still understand. Unfortunately, when I finished my PhD in 2010, Java still didn’t properly support functional programming, so I shelved the research to work on other things. But when Java 8 finally came out in 2014 with lambdas, I decided to revisit my old research. I adapted my research to make use of Java 8 lambdas and to integrate with current enterprise tools. And the result was Jinq, an open source tool that provided support for LINQ-style queries in Java.

In a recent discussion on reddit, you’ve mentioned that the Java language stewards will never integrate query systems into the language, and that LINQ has been a mistake. Yet, LINQ is immensely popular in C#. Why was LINQ a mistake?

My opinion is a little more nuanced than that. LINQ makes a lot of sense for the C# ecosystem, but I think it is totally inappropriate for Java. Different languages have different trade-offs, different philosophies, and different historical baggage. Integrating a query system into Java would run counter to the Java philosophy and would be considered a mistake. C# was designed with different trade-offs in mind, so adding feature like query integration to C# is more acceptable.

C# was designed to evolve quickly. C# regularly forces programmers to leave behind old code so that it can embrace new ways of doing things. There’s an old article on Joel on Software describing how Microsoft has two camps: the Raymond Chen camp that always tries to maintain backwards compatibility and the MSDN Magazine camp that is always evangelizing shiny new technology that may abandoned after a few years. Raymond Chen camp allows me to run 20 year old Windows programs on Windows 10. The MSDN Magazine camp produces cool new technology like C# and Typescript and LINQ. There is nothing wrong with the MSDN philosophy. Many programmers prefer using languages built using this philosophy because the APIs and languages end up with less legacy cruft in them. You don’t have to understand the 30 year history of an API to figure out the proper way to use it. Apple uses this philosophy, and many programmers love it despite the fact that they have to rewrite all their code every few years to adapt to the latest APIs. For C#, adopting a technology that is immature and still evolving is fine because they can abandon it later if it doesn’t work out.

The Java philosophy is to never break backwards compatibility. Old Java code from the 1990s still compiles and runs perfectly fine on modern Java. As such, there’s a huge maintenance burden to adding new features to Java. Any feature has to be maintained for decades. Once a feature is added to Java, it can’t be changed or it might break backwards compatibility. As a result, only features that have that have withstood the test of time are candidates for being added to Java. When features are added to Java that haven’t yet fully matured, it “locks-in” a specific implementation and prevents the feature from evolving as people’s needs change. This can cause major headaches for the language in the future.

One example of this lock-in is Java serialization. Being able to easily write objects to disk is very convenient. But the feature locked in an architecture that isn’t flexible enough for future use-cases. People want to serialize objects to JSON or XML, but can’t do that using the existing serialization framework. Serialization has led to many security errors, and a huge amount of developer resources were required to get lambdas and serialization to work correctly together. Another example of this premature lock-in is synchronization support for all objects. At the time, it seemed very forward-looking to have multi-threading primitives built right into the language. Since every object could be used as a multi-threaded monitor, you could easily synchronize access to every object. But we now know that good multi-threaded programs avoid that sort of fine-grained synchronization. It’s better to work with higher-level synchronization primitives. All that low-level synchronization slows down the performance of both single-threaded and multi-threaded code. Even if you don’t use the feature, all Java objects have to be burdened by the overhead of having lock support. Serialization and synchronization were both added to Java with the best of intentions. But those features are now treated like “goto”: they don’t pass the smell test. If you see any code that uses those features, it usually means that the code needs extra scrutiny.

Adding LINQ-style queries to Java would likely cause similar problems. Don’t get me wrong. LINQ is a great system. It is currently the most elegant system we have now for integrating a query language into an object-oriented language. Many people love using C# specifically because of LINQ. But the underlying technology is still too immature to be added to Java. Researchers are still coming up with newer and better ways of embedding query systems into languages, so there is a very real danger of locking Java into an approach that would later be considered obsolete. Already, researchers have many improvements to LINQ that Microsoft can’t adopt without abandoning its old code.

For example, to translate LINQ expressions to database queries, Microsoft added some functionality to C# that lets LINQ inspect the abstract syntax trees of lambda expressions at runtime. This functionality is convenient, but it limits LINQ to only working with expressions. LINQ doesn’t work with statements because it can’t inspect the abstract syntax trees of lambdas containing statements. This restriction on what types of lambdas can be inspected is inelegant. Although this functionality for inspecting lambdas is really powerful, it is so restricted that very few other frameworks use it. In a general-purpose programming language, all the language primitives should be expressive enough that they can be used as building blocks for many different structures and frameworks. But this lambda inspection functionality has ended up only being useful for query frameworks like LINQ. In fact, Jinq has shown that this functionality isn’t even necessary. It’s possible to build a LINQ-style query system using only the compiled bytecode, and the resulting query system ends up being more flexible in that it can handle statements and other imperative code structures.

As programmers have gotten more experience with LINQ, they have also started to wonder if there might be alternate approaches that would work better than LINQ. LINQ is supposed to make it easier for programmers to write database queries because they can write functional-style code instead of having to learn SQL. In reality though, to use LINQ well, a programmer still needs to understand SQL too. But if a programmer already understands SQL, what advantages does LINQ give them? Would it be better to use a query system like jOOQ matches SQL syntax more closely than Slick and can quickly evolve to encompass new SQL features then? Perhaps, query systems aren’t even necessary. More and more companies are adopting NoSQL databases that don’t even support queries at all.

Given how quickly our understanding of LINQ-style query systems are evolving, it would definitely be a mistake to add that functionality directly to a language like Java at the moment. Any approach might end up being obsolete, and it would impose a large maintenance burden on future versions of Java. Fortunately, Java programmers can use libraries such as Jinq and jOOQ instead, which provide most of the benefits of LINQ but don’t require tight language integration like LINQ.

Lightbend maintains Slick – LINQ for Scala. How does JINQ compare to Slick?

They both try to provide a LINQ-style interface for querying databases. Since Slick is designed for Scala, it has great Scala integration and is able to use Scala’s more expressive programming model to provide a very elegant implementation. To get the full benefits of Slick, you have to embrace the Scala ecosystem though.

Jinq is primarily designed for use with Java. It integrates with existing Java technologies like JPA and Hibernate. You don’t have to abandon your existing Java enterprise code when adopting Jinq because Jinq works with your existing JPA entity classes. Jinq is designed for incremental adoption. You can selectively use it some places and fall back to using regular JPA code elsewhere. Although Jinq can be used with Scala, it’s more useful for organizations that are using Scala but haven’t embraced the full Scala ecosystem. For example, Jinq allows you to use your existing Hibernate entities in your Scala code while still using a modern LINQ-style functional query system for them.

JINQ has seen the biggest improvement when Java 8 introduced the Stream API. What is your opinion about functional programming in Java?

I’m really happy that Java finally has support for lambdas. It’s a huge improvement that really makes my life as a programmer much easier. Over time, I’m hoping that the Java language stewards will be able to refine lambdas further though.

From Jinq’s perspective, one of the major weaknesses of Java 8’s lambdas is the total lack of any reflection support. Jinq needs reflection support to decode lambdas and to translate them to queries. Since there is no reflection support, Jinq needs to use slow and brittle alternate techniques to get the same information. Personally, I think the lack of reflection is a significant oversight, and this lack of reflection support could potentially weaken the entire Java ecosystem as a whole in the long term.

I have a few small annoyances with the lack of annotation support and lack of good JavaDoc guidelines for how to treat lambdas. The Streams API and lambda metafactories also seem a little bit overly complex to me, and I wonder if something simpler would have been better there.

From a day-to-day programming perspective though, I’ve found that the lack of syntactic sugar for calling lambdas is the main issue that has repeatedly frustrated me. It seems like a fairly minor thing, but the more I use lambdas, the more I feel that it is really important. In Java 8, it’s so easy to create and pass around lambdas, that I’m usually able to completely ignore the fact that lambdas are represented as classes with a single method. I’m able to think of my code in terms of lambdas. My mental model when I write Java 8 code is that I’m creating lambdas and passing them around. But when I actually have to invoke a lambda, the lambda magic completely breaks down. I have to stop and switch gears and think of lambdas in terms of classes. Personally, I can never remember the name of the method I need to call in order to invoke a lambda. Is it run(), accept(), consume(), or apply()? I often end up having to look up the documentation for the method name, which breaks my concentration. If Java 8 had syntactic sugar for calling lambdas, then I would never need to break out of the lambda abstraction. I would be able to create, pass around, and call lambdas without having to think about them as classes.

Java 9 will introduce the Flow API for reactive interoperability. Do you plan to implement a reactive JINQ?

To be honest, I’m not too familiar with reactive APIs. Lately, I’ve been working mostly on desktop applications, so I haven’t had to deal with problems at a sufficient scale where a reactive approach would make sense.

You’ve mentioned to me in the past that you have other projects running. What are you currently working on?

After a while, it’s easy to accumulate projects. Jinq is mostly stable at the moment though I do occasionally add bug fixes and other changes. There are still a few major features that could be added such as support for bulk updates or improved code generation, but those are fairly major undertakings that would require some funding to do.

I occasionally work on a programming language called Babylscript, which is a multilingual programming language that lets you write code in a mix of French, Chinese, Arabic, and other non-English languages. As a companion project to that, I also run a website for teaching programming to kids called Programming Basics that teaches programming in 17 different languages. Currently, though, I’m spending most of my time on two projects. One is an art tool called Omber, which is a vector drawing program that specializes in advanced gradients. The other project involves using HTML5 as the UI front-end for desktop Java programs. All your UI code would still be written in Java, but instead of using AWT or Swing, you would just manipulate HTML using a standard DOM interface bound to Java. As a side benefit, all your UI code can be recompiled using GWT to JavaScript, so you can reuse your UI code for web pages too.

Further info

Thank you very much for this very interesting interview, Ming. Want to learn more about JINQ? Read about it in this previous guest post on the jOOQ blog, and watch Ming’s JVMLS 2015 talk:

Popular ORMs Don’t do SQL

I’m contemplating about what has happened in the ISO / IEC SQL standard during the last 15 years. We’ve had quite a few new features added to our beloved SQL language. Check this out:

  • With the ISO/IEC SQL:1999 standard, we could take advantage of grouping sets and (recursive) common table expressions.
  • With the ISO/IEC SQL:2003 standard, we’ve had the very sophisticated window functions and the MERGE statement.
  • With the ISO/IEC SQL:2008 standard, we could perform partitioned JOINs.
  • With the ISO/IEC SQL:2011 standard, we can now interoperate with temporal databases (implemented so far in IBM DB2 and Oracle).

And obviously, there’s much more goodness hiding in the almost unreadable 1423-page-long documents.

But JPA…

Now, does any of these awesome features appear in JPA? Nope. Will the next SQL standard introduce new awesome features? I’m sure it will! I could imagine that the Oracle / CUBRID CONNECT BY clause, or the Oracle / SQL Server PIVOT / UNPIVOT clauses are good candiadates for standardisation. I’d go absolutely nuts if Oracle’s crazy MODEL clause would make it, too.

While exciting things happen at these ends, the ORM impedance mismatch will further deepen and confirm Charles Humble’s recent findings from the QCon, where he’s observed an increasing number of people who are unhappy with popular ORMs’ ever increasing complexity. A complexity example: NamedEntityGraph!

@NamedEntityGraph(
    name="ExecutiveProjects"
    attributeNodes={
        @NamedAttributeNode("address"),
        @NamedAttributeNode(
            value="projects",
            subgraph="projects"
        )
    },
    subgraphs={
        @NamedSubgraph(
            name="projects",
            attributeNodes={
                @NamedAttributeNode("properties")
            }
        ),
        @NamedSubgraph(
            name="projects",
            type=LargeProject.class,
            attributeNodes={
                @NamedAttributeNode("executive")
            }
        )
    }
)

Man, did this really have to be added to JPA? Stack Overflow cannot display that many annotations on a single screen! Well, if this is JEE’s answer to SQL’s recent evolutions, then I’m glad I don’t do too much JEE these days. I’m doing SQL, which is an awesome language if let running loose.