Using SQL to Calculate the Popularity (on Stack Overflow) of Derby, H2, and HSQLDB

Few people know about this very very awesome feature of the Stack Exchange platform. The Stack Exchange Data Explorer

se-data-explorer

To be found here:
http://data.stackexchange.com

As you may know, much of the Stack Exchange platform runs on SQL Server (interesting architecture details here: http://stackexchange.com/performance), and the team has had the courtesy of making a lot of data publicly available through a SQL web API. Here’s the schmema that you can query:

se-data-explorer-schema

Using a running total to calculate cumulative daily questions per tag

The amount of analytics possibilities with such a public schema are infinite. Today, we’ll look into a question that has been interesting for a lot of users in the past: What is the most “popular” Java in-memory database among Derby (also known as Java DB, which ships with the JDK), the popular test database H2 (see also our interview with Thomas Müller, its creator), or HSQLDB.

What we’d like to do is sum up the number of questions per database, up to any given date. This should give us one of those nice exponential curves that managers like so much.

Here’s the SQL query that we’ll run:

SELECT 
  d,
  SUM(h2)     OVER (ORDER BY d) AS h2,
  SUM(hsqldb) OVER (ORDER BY d) AS hsqldb,
  SUM(derby)  OVER (ORDER BY d) AS derby
FROM (
  SELECT 
    CAST(CreationDate AS DATE) AS d, 
    COUNT(CASE WHEN Tags LIKE '%<h2>%'     THEN 1 END) AS h2,
    COUNT(CASE WHEN Tags LIKE '%<hsqldb>%' THEN 1 END) AS hsqldb,
    COUNT(CASE WHEN Tags LIKE '%<derby>%'  THEN 1 END) AS derby
  FROM Posts
  GROUP BY CAST(CreationDate AS DATE)
) AS DailyPosts
ORDER BY d ASC

A short explanation:

The nested select "DailyPosts" creates a PIVOT table with the aggregated number of questions per database and date. We could have used the SQL Server PIVOT clause, if the Stack Exchange platform had stored tagging information in a normalised form, but the equivalent COUNT(CASE) expressions work just as nicely (see also our article about PostgreSQL’s aggregation FILTER clause for more inspiration).

Now, that we have the number of posts per tag and day, all that’s left to do is sum up those numbers from the first day to any given day. That is often also called a “running total”, which can be calculated very easily using the SUM() OVER() window function.

Now we’re done. You can run and play around with this query here:
http://data.stackexchange.com/stackoverflow/query/469392/java-in-memory-database-popularity-by-time

The raw result is not very interesting. It’s a lot of numbers and dates. But if we plot that result in a graph / chart, we’re getting this nice-looking curve here:

se-data-explorer-derby-h2-hsqldb

As we can see, all three databases are roughly equivalent in terms of “popularity”, although H2 seems to be catching up momentum while HSQLDB is on a slight decline.

(Obviously, this “popularity” is not representative of true market share. More questions might just mean that people struggle more with the technology, or – less skilled people are using it).

Have fun further exploring the Stack Exchange Data Explorer:
http://data.stackexchange.com

(all Stack Exchange Subscriber Content is licensed under the terms of the CC BY-SA 3.0 license. For more details, see: http://stackexchange.com/legal)

Further articles

Further articles that are interesting in the context of the displayed query:

Apache Derby About to Adopt the Awesome SQL:2003 MERGE Statement

Apache Derby is one out of three popular Java embeddable databases (apart from H2 and HSQLDB). It is very SQL and JDBC standards-compliant, but maybe a bit behind on developments of more advanced SQL features. Around 6 years after its first submission, there has now been some action on the Apache Derby DERBY-3155 ticket, recently. Rick Hillegas has attached a first, promising draft for the MERGE statement specification, which can be seen here:
https://issues.apache.org/jira/secure/attachment/12597795/MergeStatement.html

Among all 14 of the SQL databases supported by jOOQ, Derby would thus be the 8th to support the SQL:2003 MERGE statement, lining up with:

Other databases support proprietary versions of MERGE:

… or other forms of UPSERT:

Visit DERBY-3155 and show the maintainers some love for implementing this awesome and powerful SQL statement!

Simulation of TRUNC() in Derby

Derby is missing out a lot of functions from the set of functions that other databases usually provide. One example is the TRUNC(value, decimals) function. According to the Wikipedia, truncation can be achieved as such:

-- trunc(x, n) 
CASE WHEN x > 0 
THEN
  floor(power(10, n) * x) / power(10, n) 
ELSE 
  ceil(power(10, n) * x) / power(10, n) 
END

Unfortunately, there is no POWER(base, exponent) function in Derby either. But no problem, we can simulate that as well. Let’s consider the Wikipedia again and we’ll find:

power(b, x) = exp(x * ln(b))

If we substitute that into the original simulation, we get for Derby:

-- trunc(x, n) 
CASE WHEN x > 0 
THEN 
  floor(exp(n * ln(10)) * x) / exp(n * ln(10))
ELSE 
  ceil(exp(n * ln(10)) * x) / exp(n * ln(10)) 
END

Verbose, probably quite inefficient, but effective! Let’s run a short test, too:

create table test (x numeric(10, 5), n int);

insert into test values (11.111, 0);
insert into test values (11.111, 1);
insert into test values (11.111, 2);
insert into test values (11.111, -1);

select
  x, n, 
  case when x >= 0
  then
    floor(exp(n * ln(10)) * x) / exp(n * ln(10))
  else
    ceil(exp(n * ln(10)) * x) / exp(n * ln(10))
  end "trunc(x, n)"
from test;

The above yields

X N TRUNC(X, N)
11.111 0 11
11.111 1 11.1
11.111 2 11.11
11.111 -1 10

Derby casting madness – the sequel

I have recently blogged about the general bind variable casting madness in SQL:

https://lukaseder.wordpress.com/2011/08/31/rdbms-bind-variable-casting-madness/

So this is the sequel of the above story, purely dedicated to Derby with its “conversion table from hell“. One of jOOQ’s goals is to make SQL as compatible as possible across various databases, in a way that you can re-use the same SQL on various environments. For instance:

  • Use Derby to develop your database
  • Use DB2 for production

While I personally discourage such set-ups, I know that many developers prefer that, especially when it comes to run fast-running integration tests. And the above coupling of Derby and DB2 is an especially good one, as Derby is quite similar to DB2. See also this Stack Overflow Question:

http://stackoverflow.com/questions/4419684/portable-schema-between-derby-and-db2

But back to casting. In order to make casting as compatible as possible, jOOQ generates casting SQL according to the following rules:

Casting NUMERIC to VARCHAR

Interestingly, this is not supported, but casting to CHAR is. So jOOQ generates:

-- When 123 is inlined:
trim(cast(cast(123 as char(38)) as varchar(32672)))

-- When 123 is bound as a variable
trim(cast(cast(cast(? as int) as char(38)) as varchar(32672)))

Casting CHAR/VARCHAR to DOUBLE/FLOAT/REAL

Again, this isn’t supported for some reason. So jOOQ generates:

-- When 123.0 is inlined:
cast(cast('123.0' as decimal) as float)

-- When 123.0 is bound as a variable
cast(cast(cast(? as varchar(32672)) as decimal) as float)

Casting NUMERIC to BOOLEAN

This can’t be expressed simply with a CAST clause. A CASE .. WHEN clause is rendered instead, by jOOQ (note that Derby doesn’t support the simple CASE clause either…):

case when cast(? as int) = 0 then false
     when cast(? as int) is null then null
     else true
end

Casting CHAR/VARCHAR to BOOLEAN

The Derby documentation claims that this should work, but I’ve experienced quite some trouble. Derby seems to accept only SQL standard boolean literals and rejects values such as ‘0’, ‘1’, etc… Most databases accept ‘0’, ‘1’ as boolean string values as well. So jOOQ simulates the following

case when       cast(? as varchar(32672))  = '0' then false
     when lower(cast(? as varchar(32672))) = 'false' then false
     when lower(cast(? as varchar(32672))) = 'f' then false
     when cast(? as varchar(32672)) is null then null
     else true
end

Casting other pairs

Luckily, all other common types of casts seem to work as expected, also in the Derby database.