Why You Should Design Your Database to Optimise for Statistics

In my SQL Masterclass, I frequently remind participants of the fact how important statistics are for a modern cost based optimiser. For instance, if you consider the fact that in an average E-Banking system’s bookings table, you will probably have a transaction amount histogram like the following:

histogram

In other words, most of your transactions are probably small things like small payments in a restaurant (around USD -50.00 for example), or on the other side, you’ll have the restaurant, which receives the payment (e.g. around USD +50.00 for example). Only few bookings have an amount of several thousands, or if you’re an insanely good SQL trainer like me, several USD 100k, autographs included.

That’s a classic bell curve, although nothing really indicates that you should have a bell curve rather than any other distribution of your data across your database. Let’s not delve into the different types of distributions but just accept the fact that there are some value ranges for amount that appear much more frequently than others, and that it’s not always easy to predict this distribution.

So, what does this mean for a database?

Concretely, let’s look at the following two predicates:

-- Predicate 1
WHERE amount BETWEEN -2000 AND 2000

-- Predicate 2
WHERE amount BETWEEN 5000 AND 9000

Both predicates query a range of amount whose size is 4000. The two ranges are of equal size. But do both queries return the same number of rows? In other words, are both predicates equally selective? Or, again, do both predicates profit from indexing?

The answer is: Not at all!

Here’s what the first predicate does:

histogram-1

It’s not selective at all. We might as well scan the entire table to select almost 95% of our rows!

Whereas the second predicate …

histogram-2

Now we’re talking! Selecting only 2% – 3% of our data from the table. We’ll definitely want to use an index for this!

The importance of statistics

I’m frequently surprised how many developers are shielded off from productive environments in a way that they don’t know about the importance of database statistics. Or maybe they know (that they exist), but they’re not sure about how important they are.

This is extremely important, folks. Databases make a lot of decisions based on statistics, because cost based optimisers (most databases have one) will use a tremendous amount of information about your data (including statistics) in order to decide what algorithm is the best option to run your SQL query. There’s absolutely no way you could possibly figure out and manually write a better algorithm with a non-4GL language because your assumptions are always completely wrong when you go to production. That’s why SQL is so strong. Stuff still works because the optimiser adapts execution plans as your data characteristics change.

Let me repeat this:

It’s crazy to think that there are still some folks out there who think you (and your ORM) can outsmart a database optimiser when running a complex SQL query. You cannot. Just write the query in SQL already and use the power of your database!

Obviously, you (both you the developer and you the DBA) must always ensure that your statistics are up to date.

OK, they’re up to date, why is my query still slow?

Check out the above histogram again. Real world histograms may work slightly (or entirely) differently, but the key here is that in our example, we only have 22 buckets of equal range size (e.g. amount BETWEEN 1000 AND 2000). Each bucket comprises a range [A, B[ where B - A = 1000. It’s just an example, but you get the idea. This is already great help in assessing that most of our data falls into the middle 4 buckets amount BETWEEN -2000 AND 2000.

The way statistics are calculated is simple to understand. We can’t have an infinite number of buckets – we must keep the number of buckets small, because when we write a predicate like amount BETWEEN -2000 AND 2000, the optimiser will need to get a cardinality estimate from 4 buckets (in our example). Imagine the optimiser having to run through thousands of buckets to sum their sizes – it would take a long time for the estimate alone, and the optimiser has to evaluate a large number of possible execution plans to find the best one – we simply don’t have that time.

So, statistics are high level approximations of our data. Fine, some error is probably OK, we’re still fast enough, right?

Usually yes, but let’s look at a very peculiar edge case, where you might not have thought of statistics at all. Let’s look at a hypothetical YEAR_MONTH data type:

The hypothetical YEAR_MONTH data type

In the following part of the post, I’m using the Oracle database, but everything I’m saying is true for other databases as well.

When we’re in the domain of date time arithmetic, we humans often like to have dates or date parts in a human readable version. Date time arithmetic is hard enough on its own. Here’s some fun reading to illustrate what I mean.

Let’s say we’re in the business of doing reporting. And in reporting (and accounting), we often like to have stuff aggregated by a YEAR_MONTH, i.e. something of the form YYYY-MM.

There’s no native data type for this particular type in most databases. But we can encode it in different ways. E.g. by using a string, number, or date. And for each type, there are different encodings, too. Let’s look at four strategies:

CREATE TABLE dates (
  d_date              DATE        NOT NULL,
  d_string            VARCHAR2(6) NOT NULL,
  d_number            NUMBER(6)   NOT NULL,
  d_number_continuous NUMBER(6)   NOT NULL
);

The idea here is that the d_date column will contain the YEAR_MONTH at the first day. For instance, we’ll store 2016-12-01 when we really mean 201612. That looks totally, OK, right? If you want to enforce data integrity (and you should), you could optionally add a CHECK constraint as such:

ALTER TABLE dates
ADD CONSTRAINT d_date_chk 
  CHECK (d_date = TRUNC(d_date, 'MM'));

The d_string and d_number columns will contain string and number representations of values like '201612', or 201612 respectively. Again, we could add a CHECK constraint, but that’s not the point.

And the magic d_number_continuous column will encode the month in terms of “number of months since Jan 1970”. Just using some random epoch, by convention, let’s use the unix epoch and start counting from 1, because … well, months (what kind of troll designed JavaScript months to be zero-based is beyond me). So, 201612 is encoded as

564 = (2016 - 1970) * 12 + 12

Yes, 564 is not human readable, but we’ll see afterwards why this is so magic for our query.

Create the table

Let’s see how these encodings fare and compare against each other. First off, let’s add some actual data:

INSERT INTO dates
SELECT 
  trunc(d, 'MM'),
  to_char(d, 'YYYYMM'),
  to_number(to_char(d, 'YYYYMM')),
  (to_number(to_char(d, 'YYYY')) - 1970) * 12 +
  (to_number(to_char(d, 'MM')))
FROM (
  SELECT DATE '2010-01-01' + level AS d 
  FROM dual
  CONNECT BY level <= 100000
) t
ORDER BY dbms_random.random -- Btw: Don’t do this
;

The above query generates 100000 consecutive dates, truncates the YYYY-MM-DD format to YYYYMM only, orders the values randomly to have some more entropy in the test setup (don’t do this in real queries, btw!) and then stores each YEAR_MONTH value in each of the 4 encodings. Easy.

Now, let’s add indexes on each column

CREATE INDEX idx_date              ON dates(d_date);
CREATE INDEX idx_string            ON dates(d_string);
CREATE INDEX idx_number            ON dates(d_number);
CREATE INDEX idx_number_continuous ON dates(d_number_continuous);

And, of course, calculate statistics!

BEGIN
  dbms_stats.gather_table_stats (user, 'DATES');
END;
/

Showtime

Here, we have 4 queries that produce the exact same result, each with a predicate on one of the 4 encodings of our YEAR_MONTH values, each for the range of December 2016 – January 2017, i.e. we should be getting 62 each time:

SELECT count(*) FROM dates 
WHERE d_date BETWEEN DATE '2016-12-01' AND DATE '2017-01-01';

SELECT count(*) FROM dates
WHERE d_string BETWEEN '201612' AND '201701';

SELECT count(*) FROM dates 
WHERE d_number BETWEEN 201612 AND 201701;

SELECT count(*) FROM dates 
WHERE d_number_continuous BETWEEN 564 AND 565;

The result is:

  COUNT(*)
----------
        62

  COUNT(*)
----------
        62

  COUNT(*)
----------
        62

  COUNT(*)
----------
        62

Now, what happens if we SELECT *? Let’s do it, and instead of executing the query, let’s focus on the execution plan:

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_date BETWEEN DATE '2016-12-01' AND DATE '2017-01-01';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates
WHERE d_string BETWEEN '201612' AND '201701';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number BETWEEN 201612 AND 201701;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number_continuous BETWEEN 564 AND 565;

SELECT * FROM TABLE(dbms_xplan.display);

The above queries will yield:

--------------------------------------------------------
| Id  | Operation                   | Name     | Rows  |
--------------------------------------------------------
|   0 | SELECT STATEMENT            |          |    92 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES    |    92 |
|*  2 |   INDEX RANGE SCAN          | IDX_DATE |    92 |
--------------------------------------------------------

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   387 |
|*  1 |  TABLE ACCESS FULL| DATES |   387 |
-------------------------------------------

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   387 |
|*  1 |  TABLE ACCESS FULL| DATES |   387 |
-------------------------------------------

---------------------------------------------------------------------
| Id  | Operation                   | Name                  | Rows  |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                       |    91 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES                 |    91 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER_CONTINUOUS |    91 |
---------------------------------------------------------------------

Interesting! All of our cardinality estimates (the rows columns) are off, but the amount of rows they’re off is excessive for the middle two encodings (d_string and d_number). This leads to the database erroneously thinking that a full table scan is the best option when we query for those columns.

On the other hand, the estimates on the d_date and d_number_continous columns are still “good enough” for the query to be reasonably run on the index.

Why is that the case?

In all cases, our statistics have been calculated, but not in all cases have they been accurate and precise enough. Let’s look at a subtle difference between the different encodings, and keep in mind, we’re querying the range 201612 to 201701, where we have a full year switch in between:

  • d_date “wastes” quite a few values in between each valid date, but that’s OK because the amount of values that are left out of the type’s domain is equally distributed across our data set. We always have between 27 and 30 days that never appear in our data. In other words, while we have gaps between our values, the uniform distribution will distribute values “evenly” in each statistics bucket, preventing side-effects in this case.
  • d_string on the other hand, has a lot of possible values between '201612' and '201701'. There’s no way the optimiser can know from a sample that there will never be a value like '2016aa' in between that might have just slipped by the statistics. So, the range that spans over the end of year is an entirely different range in this data type than, for instance, d_string BETWEEN '201701' AND '201702'
  • d_number is similar to d_string and suffers from the same problem. There are 89 possible values between 201612 and 201701, which don’t really exist, but the optimiser cannot know this.
  • d_number_continuous appears to be an encoding just as good as d_date with respect to statistics as there are no possible gaps between consecutive values, and our statistics are thus most accurate. Bonus: Prove by finding an edge case that this is a better encoding than d_date (or prove that it is not)

Edge case

Do note that the above query exposes an edge case where we query for a range that spans over the end of year. We could compare these queries here where we query between May and June:

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_date BETWEEN DATE '2016-05-01' AND DATE '2016-06-01';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates
WHERE d_string BETWEEN '201605' AND '201606';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number BETWEEN 201605 AND 201606;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number_continuous BETWEEN 557 AND 558;

SELECT * FROM TABLE(dbms_xplan.display);

And we would now get index range scans for all predicates. Yay! Hmm… Yay? The d_string and d_number encodings would even get better cardinality estimates in this case, which is the opposite side of the coin, i.e. the other extreme:

--------------------------------------------------------
| Id  | Operation                   | Name     | Rows  |
--------------------------------------------------------
|   0 | SELECT STATEMENT            |          |    92 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES    |    92 |
|*  2 |   INDEX RANGE SCAN          | IDX_DATE |    92 |
--------------------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    65 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    65 |
|*  2 |   INDEX RANGE SCAN          | IDX_STRING |    65 |
----------------------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    65 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    65 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER |    65 |
----------------------------------------------------------

---------------------------------------------------------------------
| Id  | Operation                   | Name                  | Rows  |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                       |    91 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES                 |    91 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER_CONTINUOUS |    91 |
---------------------------------------------------------------------

However! Don’t be fooled into thinking that the accidentally better cardinality estimate is a good thing. It just says that the estimate’s error is fluctuating a lot more for d_string and d_number. Let’s query May – August, for instance, querying the double amount of data (123 rows):

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_date BETWEEN DATE '2016-05-01' AND DATE '2016-08-01';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates
WHERE d_string BETWEEN '201605' AND '201608';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number BETWEEN 201605 AND 201608;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number_continuous BETWEEN 557 AND 560;

SELECT * FROM TABLE(dbms_xplan.display);

Here’s what we’re getting:

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   153 |
|*  1 |  TABLE ACCESS FULL| DATES |   153 |
-------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    72 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    72 |
|*  2 |   INDEX RANGE SCAN          | IDX_STRING |    72 |
----------------------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    72 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    72 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER |    72 |
----------------------------------------------------------

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   152 |
|*  1 |  TABLE ACCESS FULL| DATES |   152 |
-------------------------------------------

Now we’re getting the inverse situation. Our cardinality estimates are way too low for the d_string and d_number column queries, which results in an index range scan that was estimated to be rather cheap, when it is in fact much more expensive. The full table scan is what we should be doing in this case, as it will ultimately perform faster for this query.

For this example, we can conclude that sometimes we’re lucky (May – June), but very often we’re very unlucky (December – January, or May – August) and the optimiser will choose the wrong plan, despite our statistics being up to date.

Conclusion

These things matter! When we design our data in a way that makes it hard for a database to gather accurate statistics (without domain knowledge), we don’t exactly help the optimiser as much as we can. Some domains are very hard to encode in a way such that statistics become accurate. Other domains (especially temporal ones) are extremely easy to encode “properly”. As a general rule of thumb, do remember that:

  • Actual DATE or TIMESTAMP types are great
  • Discrete numeric encodings that start with some epoch are good as well (e.g. the unix timestamp), although date time arithmetic may become a bit harder
  • String representations are bad, because they are never uniformly distributed even if they should be. This is a very high price to pay for a bit of readability

The conclusion is:

Sometimes you should help the machine do its job a bit more than you should help the human reading the data. The latter can still be done in the UI

Intrigued? Want to learn more? Book our SQL Masterclass!

Logon Triggers: The Oracle Database Magic Bullet

Imagine you want to collect detailed usage statistics to tune your Oracle database, e.g. if you want to have A-Rows and A-Time values in your execution plans (by default, Oracle only reports E-Rows and E-Time with “E” for “Estimated”. But usually, you will care more about the “A” for “Actual”). All you have to do is connect with a sysdba user and issue the following statement:

C:\> sqlplus "/ as sysdba"

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0

SQL> alter system set statistics_level = all;

System altered.

But of course, you hardly ever have the required privileges to connect as sysdba or to issue ALTER SYSTEM statements. So how can we get those advanced statistics? It’s easy, by using a logon trigger:

CREATE OR REPLACE TRIGGER logon_actions
AFTER LOGON
ON DATABASE
ENABLE
BEGIN
    EXECUTE IMMEDIATE 
    'ALTER SESSION SET STATISTICS_LEVEL = all';
END;
/

This will set your session’s statistics level to all every time you log on to the database. Of course, you will need to have the privilege to create such a system trigger, but maybe your DBA will allow you to specify a certain package that provides a debug level for things like these:

DECLARE
    v_loglevel VARCHAR2(100);
BEGIN
    v_loglevel := logger_package.loglevel;

    IF v_loglevel = 'DEBUG' THEN
        EXECUTE IMMEDIATE 
        'ALTER SESSION SET STATISTICS_LEVEL = all';
    END IF;
END;

To be sure that the statistics_level has been set correctly, run this query:

SELECT SID, NAME, VALUE
FROM   V$SES_OPTIMIZER_ENV
WHERE  NAME = 'statistics_level'
AND SID = (
    SELECT SID
    FROM V$MYSTAT
    WHERE ROWNUM = 1
);

To learn about how to get A-Rows and A-Time values in your execution plan, read this article here.

Happy statistics collecting!

Silly Metrics: The Most Used Java Keywords

Tell me…

  • Haven’t you ever wondered how many times you actually “synchronized” something?
  • Didn’t you worry about not using the “do {} while ()” loop structure often enough?
  • Are you an expert in applying “volatile”?
  • Do you “catch” more often than you “try”?
  • Is your program rather “true” or rather “false?
  • And how did that “goto” make it into your source code??

Here’s a bit of a distraction among all the other, rather informative posts I’ve written recently. An utterly useless ranking of the top Java keywords in jOOQ. I mean, after all, useful metrics can already be reviewed at ohloh, or collected with FindBugs and JArchitect

Now, you can figure it out. Here’s the ranking!

Keyword      Count
public       8127
return       6801
final        6608
import       5938
static       3903
new          3110
extends      2111
int          1822
throws       1756
void         1707
if           1661
this         1464
private      1347
class        1239
case         841
else         839
package      711
boolean      506
throw        495
for          421
long         404
true         384
byte         345
interface    337
false        332
protected    293
super        265
break        200
try          149
switch       146
implements   139
catch        127
default      112
instanceof   107
char         96
short        91
abstract     54
double       43
transient    42
finally      34
float        34
enum         25
while        23
continue     12
synchronized 8
volatile     6
do           1

Curious about your own Java keyword ranking? I’ve published the script to calculate these values on GitHub, under the ASL 2.0 license. Check out the sources here:

https://github.com/lukaseder/silly-metrics

Use it, and publish your own rankings! And feel free to provide pull requests to count keywords from other languages, or to calculate entirely different silly and useless metrics.

Get hidden feature requests from your users

In general, I’m not a marketing guy, I prefer to develop code. But when I look at modern marketing tools that we developers have created for our marketing friends, I’m getting a bit jealous. Take this blog, for instance. It’s the perfect jOOQ marketing tool. Check out my 2013 visitor statistics:

jOOQ blog 2013 statistics

jOOQ blog 2013 statistics

Quite obviously, my February posts were given a lot more attention than my January posts even if I’m taking into account the average visitor that visits old posts, generating “background noise”. The reason is simple:

  • In February, I blogged mostly about controversial Java topics. Before, the topics were more technical and SQL-related (and thus objective and boring). Java blogs get more traction than SQL blogs.
  • My two blogging partners DZone and JCG tend to reblog Java posts more than SQL posts

Aha. So I should write more Java blog posts in order to get even more traffic. But I don’t just want traffic, I want “relevant” traffic.

How to generate “relevant” traffic

“Relevant” traffic for jOOQ is traffic that will generate “conversions”, e.g. people that may not know jOOQ before hitting this blog, checking out jOOQ and downloading it, because they read my articles. That’s an obvious case for “relevant” traffic.

But I also want to get people on my blog / website, because they are already (or not yet) using jOOQ and because they’re looking for something specific, like a jOOQ feature that they’re missing. Only few people will actually take the time to write up a nice, understandable e-mail with a well-explained use-case to issue a feature request on the jOOQ user group. They’re more likely to give up after a couple of google searches. Hence, it is important to take notice of those searches and to be sure that they will end up on my blog / website, not on some arbitrary google search result or on Stack Overflow (which is good, but which I cannot analyse). Here are some interesting search strings that have lead to “hidden” feature requests or blog posts in the past:

Search strings leading to blog posts

These search strings indicated that there are grounds for generating more relevant traffic on the blog:

  • jooq vs linq: jOOQ is a good answer to Java’s missing LINQ (Language INtegrated Query).
  • jooq vs hibernate: jOOQ is a viable alternative to Hibernate. This is a good opportunity to read about the pros / cons of each approach.
  • scala slick: SLICK (Scala Language-Integrated Connection Kit is another “LINQ-esque” API.
  • jooq vs mybatis: MyBatis is a different approach at abstracting the data layer in Java.
  • jooq vs jpa: See jOOQ vs Hibernate
  • java fluent sql: All users that are looking for fluent APIs should come here. There is probably no better fluent implementation of SQL in Java, other than jOOQ.
  • scala sql: Yes, Scala users should use jOOQ to write SQL statements.
  • where are there good java sql open source projects: Here! ;-)
  • jooq migration: Some people may expect jOOQ to add a database migration module. There are good tools for that already, e.g.: Liquibase or Flyway.
  • jooq caching: Some day, I will get this bad idea out of people’s head. The data access layer shouldn’t implement caching!
  • jooq google app engine: Yes, that should work, too. Although I don’t have a running instance anymore.
  • bad software: I don’t know how they got here :-)
  • jooq clojure: I wonder what that would look like. Anything like SQLKorma?
  • jooq alternative: Why?
  • jooq enterprise sql: You can donate here, to feel a bit more enterprisey!
  • how to create a java internal dsl from a bnf: I should sell this idea, or create a non-free tool from it!
  • martin fowler jooq: Yes, I’ve asked him too. Neither he himself, nor his company ThoughtWorks responded to my inquiries so far. I should really get on their tech radar…

The above examples show that I should probably write a post that shows the main disadvantages of LINQ (heavily reduced expressivity) compared to using SQL directly (very expressive, feature-rich language). In a way, LINQ and JPQL both attempt to cripple / standardise SQL by removing most of SQL’s features. Unsurprisingly, one of the most popular articles on this blog is this one here:

https://blog.jooq.org/2012/07/30/when-will-we-have-linq-in-java

Search strings leading to “hidden” feature requests

These search strings indicated that users may be interested in a particular feature:

  • jooq ddl: It’s about time that jOOQ also supports DDL statements.
  • jooq cte: Common Table Expressions (CTE’s) are an important missing feature today, in jOOQ. jOOQ’s current fluent API technique doesn’t fully take care of taking advantage of CTE’s expressivity.
  • postgres insert returning: Postgres’ INSERT/UPDATE .. RETURNING syntax is one of the nice features in jOOQ.
  • jooq transaction: jOOQ currently doesn’t handle transactions. jOOQ can run, but jOOQ can’t hide from this feature (delegating it to other APIs).
  • relational division: Beautiful feature. Implemented in jOOQ!
  • bnf sql: Yes, jOOQ’s API has an underlying (informal) BNF notation. Time to formalise it!
  • jooq cursor: jOOQ can read cursors from stored procedures, which is a pain to do with JDBC.
  • derby trunc: jOOQ simulates functions unavailable in some dialects.
  • jooq logging: jOOQ allows for easy logging of executed SQL statements.
  • jooq meta: jOOQ has an API to query database meta data, which is more intuitive than JDBC’s
  • java multiple cursors: Yes, fetching several result sets is much easier with jOOQ than with JDBC.
  • hibernate olap: No, unfortunately, Hibernate can only handle simple OLTP. For OLAP, you’ve come to the right place.
  • simulate skip locked h2: Beautiful! Challenge accepted.
  • connect by: Oracle and CUBRID support this nice and concise way of expressing recursive SQL.
  • group by rollup: A nice OLAP feature.
  • load mysql script java .sql: jOOQ’s Loader API gets you started quickly, when batch loading data into your database.
  • derby bitand: Ah, yes. Bitwise operations are poorly supported in most SQL dialects. jOOQ simulates them for all databases.
  • jooq blob: Most JDBC drivers don’t really care about the distinction between Blob / byte[], or Clob / String. But things might get nasty for very large Blobs
  • production schema: Yes, jOOQ’s distinction between development schema and production schema has proven very useful to many users.
  • jooq batch insert: Batch queries. A pain to do with JDBC. So simple, with jOOQ
  • jooq customtable: I get a feeling that this “dynamic SQL” thing is still not 100% well implemented in jOOQ.
  • jooq informix support: Yes, one day!
  • for update sql server: jOOQ can simulate pessimistic locking through the JDBC API!

The above examples show that a library that is “true to SQL” in a way that it does not try to hide SQL’s syntax complexity hits the spot with many users.

Conclusion

These search strings help adding even more value to jOOQ. They are the “hidden feature requests” of jOOQ’s users. Tool vendors, use this knowledge and blog about your experience, products, thoughts, ideas. Give away free information that is interesting to a broad audience. It will pay back when you analyse your incoming traffic and the google search strings that people used to find your relevant blog (granted that your blog is somewhat relevant)!