Transform Your SQL Data into Charts Using jOOQ and JavaFX

In the recent past, we’ve shown how Java 8 and functional programming will bring a new perspective to Java developers when it comes to functional data transformation of SQL data using jOOQ and Java 8 lambdas and Streams.

Today, we take this a step further and transform the data into JavaFX XYChart.Series to produce nice-looking bar charts from our data.

Setting up the database

We’re going to be using a small subset of the World Bank’s Open Data again, in a PostgreSQL database. The data that we’re using is this here:

DROP SCHEMA IF EXISTS world;

CREATE SCHEMA world;

CREATE TABLE world.countries (
  code CHAR(2) NOT NULL,
  year INT NOT NULL,
  gdp_per_capita DECIMAL(10, 2) NOT NULL,
  govt_debt DECIMAL(10, 2) NOT NULL
);
 
INSERT INTO world.countries
VALUES ('CA', 2009, 40764, 51.3),
       ('CA', 2010, 47465, 51.4),
       ('CA', 2011, 51791, 52.5),
       ('CA', 2012, 52409, 53.5),
       ('DE', 2009, 40270, 47.6),
       ('DE', 2010, 40408, 55.5),
       ('DE', 2011, 44355, 55.1),
       ('DE', 2012, 42598, 56.9),
       ('FR', 2009, 40488, 85.0),
       ('FR', 2010, 39448, 89.2),
       ('FR', 2011, 42578, 93.2),
       ('FR', 2012, 39759,103.8),
       ('GB', 2009, 35455,121.3),
       ('GB', 2010, 36573, 85.2),
       ('GB', 2011, 38927, 99.6),
       ('GB', 2012, 38649,103.2),
       ('IT', 2009, 35724,121.3),
       ('IT', 2010, 34673,119.9),
       ('IT', 2011, 36988,113.0),
       ('IT', 2012, 33814,131.1),
       ('JP', 2009, 39473,166.8),
       ('JP', 2010, 43118,174.8),
       ('JP', 2011, 46204,189.5),
       ('JP', 2012, 46548,196.5),
       ('RU', 2009,  8616,  8.7),
       ('RU', 2010, 10710,  9.1),
       ('RU', 2011, 13324,  9.3),
       ('RU', 2012, 14091,  9.4),
       ('US', 2009, 46999, 76.3),
       ('US', 2010, 48358, 85.6),
       ('US', 2011, 49855, 90.1),
       ('US', 2012, 51755, 93.8);

(see also this article here about another awesome set of SQL queries against the above data)

What we want to do now is plot the two sets of values in two different bar charts:

  • Each country’s GDP per capita in each year between 2009-2012
  • Each country’s debt as a percentage of its GDP in each year between 2009-2012

This will then create 8 series with four data points for each series in both charts. In addition to the above, we’d like to order the series among themselves by the average projected value between 2009-2012, such that the series – and thus the countries – can be compared easily.

This is obviously easier to explain visually via the resulting chart than in text, so stay tuned until the end of the article.

Collecting the data with jOOQ and JavaFX

The query that we would write to calculate the above data series would look as follows in plain SQL:

select 
    COUNTRIES.YEAR, 
    COUNTRIES.CODE, 
    COUNTRIES.GOVT_DEBT 
from 
    COUNTRIES 
join (
    select 
        COUNTRIES.CODE, 
        avg(COUNTRIES.GOVT_DEBT) avg 
    from 
        COUNTRIES 
    group by 
        COUNTRIES.CODE
) c1 
on COUNTRIES.CODE = c1.CODE 
order by 
    avg asc, 
    COUNTRIES.CODE asc, 
    COUNTRIES.YEAR asc

In other words, we’ll simply select the relevant columns from the COUNTRIES table, and we’ll self-join the average projected value per country such that we can order the result by that average. The same query could be written using window functions. We’ll get to that later on.

The code that we’ll write to create such a bar chart with jOOQ and JavaFX is the following:

CategoryAxis xAxis = new CategoryAxis();
NumberAxis yAxis = new NumberAxis();
xAxis.setLabel("Country");
yAxis.setLabel("% of GDP");

BarChart<String, Number> bc = 
    new BarChart<>(xAxis, yAxis);
bc.setTitle("Government Debt");
bc.getData().addAll(

    // SQL data transformation, executed in the DB
    // -------------------------------------------
    DSL.using(connection)
       .select(
           COUNTRIES.YEAR,
           COUNTRIES.CODE,
           COUNTRIES.GOVT_DEBT)
       .from(COUNTRIES)
       .join(
           table(
               select(
                   COUNTRIES.CODE, 
                   avg(COUNTRIES.GOVT_DEBT).as("avg"))
               .from(COUNTRIES)
               .groupBy(COUNTRIES.CODE)
           ).as("c1")
       )
       .on(COUNTRIES.CODE.eq(
           field(
               name("c1", COUNTRIES.CODE.getName()), 
               String.class
           )
       ))

       // order countries by their average 
       // projected value
       .orderBy(
           field(name("avg")),
           COUNTRIES.CODE,
           COUNTRIES.YEAR)

       // The result produced by the above statement
       // looks like this:
       // +----+----+---------+
       // |year|code|govt_debt|
       // +----+----+---------+
       // |2009|RU  |     8.70|
       // |2010|RU  |     9.10|
       // |2011|RU  |     9.30|
       // |2012|RU  |     9.40|
       // |2009|CA  |    51.30|
       // +----+----+---------+

    // Java data transformation, executed in app memory
    // ------------------------------------------------

       // Group results by year, keeping sort 
       // order in place
       .fetchGroups(COUNTRIES.YEAR)

       // The generic type of this is inferred...
       // Stream<Entry<Integer, Result<
       //     Record3<BigDecimal, String, Integer>>
       // >>
       .entrySet()
       .stream()

       // Map entries into { Year -> Projected value }
       .map(entry -> new XYChart.Series<>(
           entry.getKey().toString(),
           observableArrayList(

           // Map records into a chart Data
           entry.getValue().map(country -> 
                new XYChart.Data<String, Number>(
                  country.getValue(COUNTRIES.CODE),
                  country.getValue(COUNTRIES.GOVT_DEBT)
           ))
           )
       ))
       .collect(toList())
);

The interesting thing here is really that we can fetch data from the database, and later on, transform it into JavaFX data structures all in one go. The whole thing is almost a single Java statement.

SQL and Java is cleanly separated

As we’ve blogged on this blog before, there is a very important difference when comparing the above approach to LINQ or to JPQL’s DTO fetching capabilities. The SQL query is cleanly separated from the Java in-memory data transformation, even if we express the whole transformation in one single statement.

We want to be as precise as possible when expressing our SQL query for the database to be able to calculate the optimal execution plan. Only once we have materialised our data set, the Java 8 Stream transformation will kick in.

The importance of this is made clear when we change the above SQL-92 compatible query with a SQL-1999 compatible one that makes use of awesome window functions. The jOOQ part of the above statement could be replaced by the following query:

DSL.using(connection)
   .select(
       COUNTRIES.YEAR,
       COUNTRIES.CODE,
       COUNTRIES.GOVT_DEBT)
   .from(COUNTRIES)
   .orderBy(
       avg(COUNTRIES.GOVT_DEBT)
           .over(partitionBy(COUNTRIES.CODE)),
       COUNTRIES.CODE,
       COUNTRIES.YEAR)
   ;

… or in SQL:

select
    COUNTRIES.YEAR,
    COUNTRIES.CODE,
    COUNTRIES.GOVT_DEBT
from
    COUNTRIES
order by
    avg(COUNTRIES.GOVT_DEBT) 
        over (partition by COUNTRIES.CODE),
    COUNTRIES.CODE,
    COUNTRIES.YEAR

As you can see, staying in control of you actual SQL statement is of the essence when you run such reports. There is no way you could have refactored ordering via nested selects into a much more efficient ordering via window functions as easily as this. Let alone refactoring dozens of lines of Java sorting logic.

Yep. It’s hard to beat the beauty of window functions

If we add some additional JavaFX boilerplate to put the chart into a Pane, a Scene, and a Stage, we’ll get these nice-looking charts below:

SQL with JavaFX and jOOQ

Play with it yourself

You can download and run the above example yourself. Simply download the following example and run mvn clean install:

https://github.com/jOOQ/jOOQ/tree/master/jOOQ-examples/jOOQ-javafx-example

How to Emulate the MEDIAN() Aggregate Function Using Inverse Distribution Functions

Some databases are awesome enough to implement the MEDIAN() aggregate function. Remember that the MEDIAN() is sligthly different from (and often more useful than) the MEAN() or AVG() (average).

While the average is calculated as the SUM(exp) / COUNT(exp), the MEDIAN() tells you that 50% of all values in the sample are higher than the MEDIAN() whereas the other 50% of the set are lower than the MEDIAN().

So, in other words, if you take the following query:

WITH t(value) AS (
  SELECT 1   FROM DUAL UNION ALL
  SELECT 2   FROM DUAL UNION ALL
  SELECT 3   FROM DUAL
)
SELECT
  avg(value),
  median(value)
FROM
  t;

… then both average and median are the same:

avg   median
2     2

But if you heavily skew your data like this:

WITH t(value) AS (
  SELECT 1   FROM DUAL UNION ALL
  SELECT 2   FROM DUAL UNION ALL
  SELECT 100 FROM DUAL
)
SELECT
  avg(value),
  median(value)
FROM
  t;

Then your average will also be skewed, whereas your median will still indicate where most of the values are in your sample

avg      median
34.333   2

The above sample is of course statistically insignificant, but you can easily see that the effect can be dramatic and relevant, if you have more data:

Image license CC-BY-SA 3.0. Uploaded to Wikipedia by Cmglee
Image license CC-BY-SA 3.0. Uploaded to Wikipedia by Cmglee

The skewing effect is very important in statistics and in order to make any interesting claim about anything, using percentiles is most often more useful than using averages. Take the average income vs. the median income in a country, for instance. While the average income in the U.S. (and in many other countries) has been steadily increasing, the median income has seen a decline over the past decade. This is due to wealth being heavily skewed towards the super-rich more and more.

This blog is not about politics but about Java and SQL, so let’s get back into calculating the actual facts.

Using precentiles in SQL

As we’ve seen before, the MEDIAN() divides a sample into two equally-sized groups and takes the value “between” those two groups. This particular value is also called the 50th percentile because 50% of all values in the sample are lower than the MEDIAN(). We can thus establish:

  • MIN(exp): The 0-percentile
  • MEDIAN(exp): The 50th-percentile
  • MAX(exp): The 100th-percentile

All of the above are special cases of percentiles, and while MIN() and MAX() are supported in all SQL databases (and the SQL standard), MEDIAN() is not in the SQL standard and only supported by the following jOOQ databases:

  • CUBRID
  • HSQLDB
  • Oracle
  • Sybase SQL Anywhere

There is another way of calculating the MEDIAN() in particular and any sort of percentile in general in the SQL standard, and since PostgreSQL 9.4 also in PostgreSQL using …

Ordered-set aggregate functions

Interestingly, apart from window functions, you can also specify ORDER BY clauses to certain aggregate functions that aggregate data based on ordered sets.

One such function is the SQL standard percentile_cont function, which takes the percentile as an argument, and then accepts an additional WITHIN GROUP clause that takes an ORDER BY clause as an argument. These particular ordered-set functions are also called inverse distribution functions, because we want to find where a particular percentile is located in the distribution of all values in the sample (if you’re not scared by the math, check out the wikipedia article)

So, in PostgreSQL 9.4+, the MEDIAN() function can be emulated like this:

WITH t(value) AS (
  SELECT 1   UNION ALL
  SELECT 2   UNION ALL
  SELECT 100
)
SELECT
  avg(value),
  percentile_cont(0.5) WITHIN GROUP (ORDER BY value)
FROM
  t;

This interesting syntax is standardised and may be known to some of you from Oracle’s LISTAGG(), which allows to aggregate values into concatenated strings:

WITH t(value) AS (
  SELECT 1   FROM DUAL UNION ALL
  SELECT 2   FROM DUAL UNION ALL
  SELECT 100 FROM DUAL
)
SELECT
  listagg(value, ', ') WITHIN GROUP (ORDER BY value)
FROM
  t;

This query yields simply:

listagg
---------
1, 2, 100

On a side-note: LISTAGG() is, of course, completely useless, because it returns VARCHAR2, which again has a max length of 4000 in Oracle. Useless…

Emulation out-of-the-box with jOOQ

As always, jOOQ will emulate these kinds of things out of the box. You can either use the DSL.median() function, or with the upcoming jOOQ 3.6, the new DSL.percentileCont() function to produce the same value:

DSL.using(configuration)
   .select(
       median(T.VALUE),
       percentileCont(0.5).withinGroupOrderBy(T.VALUE)
   )
   .from(T)
   .fetch();

jOOQ is the best way to write SQL in Java