Sure, you get a little convenience when writing the table. But from now on, you’re stuck with weird, system generated names both for the constraint and for the backing index. For instance, when doing execution plan analyses:
EXPLAIN PLAN FOR
SELECT *
FROM order1
JOIN order2 USING (order_id)
WHERE order_id = 1;
SELECT * FROM TABLE (dbms_xplan.display);
The simplified execution plan (output of the above queries) is this:
-------------------------------------------
| Id | Operation | Name |
-------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | NESTED LOOPS | |
|* 2 | INDEX UNIQUE SCAN| SYS_C0042007 |
|* 3 | INDEX UNIQUE SCAN| SYS_C0042005 |
-------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ORDER2"."ORDER_ID"=1)
3 - access("ORDER1"."ORDER_ID"=1)
So, I got these system generated index names called SYS_C0042007 and SYS_C0042005. What do they mean? I can derive the actual meaning perhaps from the predicate information, as SYS_C0042007 is accessed in operation #2, which uses an access predicate on ORDER2. Fine. But do I really need to look these things up all the time?
Just name your constraints. Always!
Don’t be fooled into this convenience. It’ll hurt you time and again, not just when doing analyses. You might not be able to easily import / export your schema to some other database, because another database might already occupy these generated names.
Find a naming schema (any naming scheme), like for instance PK_[table name]. If you’re cleaning up an existing database, this might help:
SET SERVEROUTPUT ON
BEGIN
FOR stmts IN (
SELECT
'ALTER TABLE ' || table_name ||
' RENAME CONSTRAINT ' || constraint_name ||
' TO PK_' || table_name AS stmt
FROM user_constraints
WHERE constraint_name LIKE 'SYS%'
AND constraint_type = 'P'
) LOOP
dbms_output.put_line(stmts.stmt);
EXECUTE IMMEDIATE stmts.stmt;
END LOOP;
FOR stmts IN (
SELECT
'ALTER INDEX ' || index_name ||
' RENAME TO PK_' || table_name AS stmt
FROM user_constraints
WHERE index_name LIKE 'SYS%'
AND constraint_type = 'P'
) LOOP
dbms_output.put_line(stmts.stmt);
EXECUTE IMMEDIATE stmts.stmt;
END LOOP;
END;
/
The above yields (and runs)
ALTER TABLE ORDER1 RENAME CONSTRAINT SYS_C0042005 TO PK_ORDER1
ALTER TABLE ORDER2 RENAME CONSTRAINT SYS_C0042007 TO PK_ORDER2
ALTER INDEX SYS_C0042005 RENAME TO PK_ORDER1
ALTER INDEX SYS_C0042007 RENAME TO PK_ORDER2
You can of course repeat the exercise for unique constraints, etc. I omit the example here because the naming scheme might be a bit more complicated there. Now re-calculate the execution plan and check this out:
EXPLAIN PLAN FOR
SELECT *
FROM order1
JOIN order2 USING (order_id)
WHERE order_id = 1;
SELECT * FROM TABLE (dbms_xplan.display);
The simplified execution plan (output of the above queries) is this:
----------------------------------------
| Id | Operation | Name |
----------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | NESTED LOOPS | |
|* 2 | INDEX UNIQUE SCAN| PK_ORDER2 |
|* 3 | INDEX UNIQUE SCAN| PK_ORDER1 |
----------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ORDER2"."ORDER_ID"=1)
3 - access("ORDER1"."ORDER_ID"=1)
A lot of people use SQL constraints mainly to enforce data integrity, and that’s already a very good thing. A UNIQUE constraint, for instance, makes sure that there is at most one instance of any possible value (or tuple, in the case of a composite constraint) in a table. For instance:
CREATE TABLE x (
a NUMBER(10),
UNIQUE (a)
);
-- This works:
INSERT INTO x VALUES (1);
-- This fails:
INSERT INTO x VALUES (1);
Constraints are also good for (SELECT) performance
One thing that people often do not think about, though, is the fact that constraints can also be used as very valuable meta information by the database optimiser to make a better decision when finding the most optimal execution plan. It is a big difference for an optimiser…
To be unaware of how many different values there are in any given column (worst case)
To be able to estimate the different values in any given column (statistics are present)
To know that each value can appear at most once
In the last case, a UNIQUE (or PRIMARY KEY) constraint tells the optimiser exactly that. Once the optimiser knows that a certain operation returns at most one row (and not 10, 100, or even 10M rows), a nested loop suddenly becomes extremely cheap, as it can abort as soon as one row was found.
A Java analogy
Compare this with the following Java code:
// This is much "harder" ...
List<Object> objects = unknownSize();
for (Object object : objects) {
doSomethingWith(object);
}
// ... than this, where we "know" the list
// only contains one value
for (Object object : Collections.singletonList("abc")) {
doSomethingWith(object);
}
If Java had such optimisation capabilities (bonus question: can the JIT do it?), then the second loop could be optimised as such:
// The loop can be avoided:
doSomethingWith("abc");
Let’s look at Oracle
Let’s look at a concrete example with Oracle:
CREATE TABLE x1 (
a NUMBER(10) NOT NULL,
data VARCHAR2(100),
CONSTRAINT pk_y1 PRIMARY KEY (a)
);
CREATE TABLE y1 (
a NUMBER(10) NOT NULL,
optional_data VARCHAR2(100),
CONSTRAINT fk_y1 FOREIGN KEY (a) REFERENCES x1(a)
);
CREATE INDEX i1 ON y1(a);
CREATE INDEX i1data ON x1(data);
INSERT INTO x1
SELECT level, dbms_random.string('a', 100)
FROM dual
CONNECT BY level <= 10000;
INSERT INTO y1
SELECT level, dbms_random.string('a', 100)
FROM dual
CONNECT BY level <= 5000;
This is a typical one-to-many relationship between x1 and y1. With the constraints in place, there can be between 0 and N rows in y1 for each row in x1. As a user, we know that there is only one value in y1 for any value in x1, but we don’t enforce this knowledge with a constraint.
Let’s look at the following query:
SELECT count(*)
FROM x1
JOIN y1 USING (a)
WHERE data LIKE 'a%';
What we’re doing here is we want all values in x1 whose data starts with the letter ‘a’, and for which we also have any optional_data. The execution plan is
-----------------------------------------------------
| Operation | Name | Rows | Cost |
-----------------------------------------------------
| SELECT STATEMENT | | | 29 |
| SORT AGGREGATE | | 1 | |
| HASH JOIN | | 176 | 29 |
| VIEW | | 176 | 24 |
| HASH JOIN | | | |
| INDEX RANGE SCAN | I1DATA | 176 | 4 |
| INDEX FAST FULL SCAN| PK_Y1 | 176 | 24 |
| INDEX FAST FULL SCAN | I1 | 5000 | 5 |
-----------------------------------------------------
As you can see, Oracle chooses to run a hash join operation, which means that all the values from x1 starting with ‘a’ are fetched (around 176), and joined in a hashmap with the entire set of values in y1, fetched from the index i1 (5000 values).
How does this compare with using a UNIQUE constraint?
We’ll create almost the exact same schema as follows:
CREATE TABLE x2 (
a NUMBER(10) NOT NULL,
data VARCHAR2(100),
CONSTRAINT pk_x2 PRIMARY KEY (a)
);
CREATE TABLE y2 (
a NUMBER(10) NOT NULL,
optional_data VARCHAR2(100),
CONSTRAINT uk_y2 UNIQUE (a),
CONSTRAINT fk_y2 FOREIGN KEY (a) REFERENCES x2(a)
);
CREATE INDEX i2data ON x2(data);
INSERT INTO x2
SELECT * FROM x1;
INSERT INTO y2
SELECT * FROM y1;
BEGIN
dbms_stats.gather_table_stats('TEST', 'X2');
dbms_stats.gather_table_stats('TEST', 'Y2');
END;
/
The data is exactly the same, but now we enforce a UNIQUE constraint on y2’s foreign key, making this effectively a one-to-one relationship. Check out what happens when we run the exact same query…
SELECT count(*)
FROM x2
JOIN y2 USING (a)
WHERE data LIKE 'a%';
As you can see, the overall cost has decreased from 29 to 25 as we’re no longer using a hash join, but a nested loop join operation, which is probably faster if our statistics are not way off, as we only have to look up the single value in y2 corresponding to x2 for each of x2’s estimated 176 rows that start with the letter ‘a’.
SET SERVEROUTPUT ON
DECLARE
v_ts TIMESTAMP;
v_repeat CONSTANT NUMBER := 1000;
BEGIN
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT count(*)
FROM x1
JOIN y1 USING (a)
WHERE data LIKE 'a%'
) LOOP
NULL;
END LOOP;
END LOOP;
dbms_output.put_line('Without UNIQUE constraint: '
|| (SYSTIMESTAMP - v_ts));
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT count(*)
FROM x2
JOIN y2 USING (a)
WHERE data LIKE 'a%'
) LOOP
NULL;
END LOOP;
END LOOP;
dbms_output.put_line('With UNIQUE constraint: '
|| (SYSTIMESTAMP - v_ts));
END;
/
The above benchmark repeats each individual query 1000 times. The results speak for themselves:
Without UNIQUE constraint: +000000000 00:00:04.250000000
With UNIQUE constraint: +000000000 00:00:02.847000000
Remark
The lack of a UNIQUE constraint may happen in situations where you prefer using a surrogate primary key in the referencing table (which I omitted in the examples for brevity). If you’re “sharing” the primary key or use natural keys, this issue won’t happen to you, of course.
Conclusion
The execution planner can make a more informed decision if it has formal knowledge about your data via an additional UNIQUE constraint. This formal knowledge is by far more powerful than any statistics that might indicate the same thing. In the absence of a formal UNIQUE constraint, the database will always have to make sure there is not another row once it has found one. With a formal UNIQUE constraint, it can stop looking as soon as that unique row was found. This can drastically speed up queries. As we’ve seen in the above example, this improves things by a factor of 1.5, so the second query is 50% faster!
Always tell your database as much as you can. Your SELECT performance will greatly increase, at the small cast of a little overhead when inserting data.
When writing DDL in SQL, you can specify a couple of constraints on columns, like NOT NULL or DEFAULT constraints. Some people might wonder, if the two constraints are actually redundant, i.e. is it still necessary to specify a NOT NULL constraint, if there is already a DEFAULT clause?
The answer is: Yes!
Yes, you should still specify that NOT NULL constraint. And no, the two constraints are not redundant. The answer I gave here on Stack Overflow wraps it up by example, which I’m going to repeat here on our blog:
DEFAULT is the value that will be inserted in the absence of an explicit value in an insert / update statement. Lets assume, your DDL did not have the NOT NULL constraint:
ALTER TABLE tbl
ADD COLUMN col VARCHAR(20)
DEFAULT "MyDefault"
Then you could issue these statements
-- 1. This will insert "MyDefault"
-- into tbl.col
INSERT INTO tbl (A, B)
VALUES (NULL, NULL);
-- 2. This will insert "MyDefault"
-- into tbl.col
INSERT INTO tbl (A, B, col)
VALUES (NULL, NULL, DEFAULT);
-- 3. This will insert "MyDefault"
-- into tbl.col
INSERT INTO tbl (A, B, col)
DEFAULT VALUES;
-- 4. This will insert NULL
-- into tbl.col
INSERT INTO tbl (A, B, col)
VALUES (NULL, NULL, NULL);
Alternatively, you can also use DEFAULT in UPDATE statements, according to the SQL-1992 standard:
-- 5. This will update "MyDefault"
-- into tbl.col
UPDATE tbl SET col = DEFAULT;
-- 6. This will update NULL
-- into tbl.col
UPDATE tbl SET col = NULL;
Note, not all databases support all of these SQL standard syntaxes. Adding the NOT NULL constraint will cause an error with statements 4, 6, while 1-3, 5 are still valid statements. So to answer your question:
No, NOT NULL and DEFAULT are not redundant
That’s already quite interesting, so the DEFAULT constraint really only interacts with DML statements and how they specify the various columns that they’re updating. The NOT NULL constraint is a much more universal guarantee, that constraints a column’s content also “outside” of the manipulating DML statements.
For instance, if you have a set of data and then you add a DEFAULT constraint, this will not affect your existing data, only new data being inserted.
If, however, you have a set of data and then you add a NOT NULL constraint, you can actually only do so if the constraint is valid – i.e. when there are no NULL values in your column. Otherwise, an error will be raised.
Query performance
Another very interesting use case that applies only to NOT NULL constraints is their usefulness for query optimisers and query execution plans. Assume that you have such a constraint on your column and then, you’re using a NOT IN predicate:
SELECT *
FROM table
WHERE value NOT IN (
SELECT not_nullable
FROM other_table
)
In particular, when you’re using Oracle, the above query will be much faster when the not_nullable column has an index AND that particular constraint, because unfortunately, NULL values are not included in Oracle indexes.
CHECK constraints are already pretty great when you want to sanitize your data. But there are some limitations to CHECK constraints, including the fact that they are applied to the table itself, when sometimes, you want to specify constraints that only apply in certain situations.
This can be done with the SQL standard WITH CHECK OPTION clause, which is implemented by at least Oracle and SQL Server. Here’s how to do that:
CREATE TABLE books (
id NUMBER(10) NOT NULL,
title VARCHAR2(100 CHAR) NOT NULL,
price NUMBER(10, 2) NOT NULL,
CONSTRAINT pk_book PRIMARY KEY (id)
);
/
CREATE VIEW expensive_books
AS
SELECT id, title, price
FROM books
WHERE price > 100
WITH CHECK OPTION;
/
INSERT INTO books
VALUES (1, '1984', 35.90);
INSERT INTO books
VALUES (
2,
'The Answer to Life, the Universe, and Everything',
999.90
);
As you can see, expensive_books are all those books whose price is more than 100.00. This view will only report the second book:
SELECT * FROM expensive_books;
The above query yields:
ID TITLE PRICE
-- ----------------------------------------- -------
2 The Answer to Life, the Universe, and ... 999.9
But now, that we have that CHECK OPTION, we can also prevent users from inserting “expensive books” that aren’t really expensive. For instance, let’s run this query:
INSERT INTO expensive_books
VALUES (3, '10 Reasons why jOOQ is Awesome', 9.99);
This query won’t work now. We’re getting:
ORA-01402: view WITH CHECK OPTION where-clause violation
We also cannot update any of the “expensive books” to be non-expensive:
UPDATE expensive_books
SET price = 9.99;
This query results in the same ORA-01402 error message.
Inline WITH CHECK OPTION
In case you need to locally prevent bogus data from being inserted into a table, you can also use inline WITH CHECK OPTION clauses like so:
INSERT INTO (
SELECT *
FROM expensive_books
WHERE price > 1000
WITH CHECK OPTION
) really_expensive_books
VALUES (3, 'Modern Enterprise Software', 999.99);
And the above query again resutls in an ORA-01402 error.
Using SQL transformation to generate ad-hoc constraints
While CHECK OPTION is very useful for stored views, which can have proper grants for those users that may not access the underlying table directly, the inline CHECK OPTION is mainly useful when you transform dynamic SQL in an intermediate SQL transformation layer in your applciation.
This can be done with jOOQ’s SQL transformation capabilities, for instance, where you can watch out for a certain table in your SQL statements, and then centrally prevent bogus DML from being executed. This is a great way to implement multi-tenancy, if your database doesn’t natively support row-level security.
Stay tuned for a future blog post explaining how to transform your SQL with jOOQ to implement row-level security for any database.
(Sorry for that click-bait heading. Couldn’t resist ;-) )
We’re on a mission. To teach you SQL. But mostly, we want to teach you how to appreciate SQL. You’ll love it!
Getting SQL right or wrong shouldn’t be about that You’re-Doing-It-Wrong™ attitude that can be encountered often when evangelists promote their object of evangelism. Getting SQL right should be about the fun you’ll have once you do get it right. The things you start appreciating when you notice that you can easily replace 2000 lines of slow, hard-to-maintain, and ugly imperative (or object-oriented) code with 300 lines of lean functional code (e.g. using Java 8), or even better, with 50 lines of SQL.
We’re glad to see that our blogging friends have started appreciating SQL, and most specifically, window functions after reading our posts. For instance, take
Yet Another 10 Common Mistakes Java Developer Make When Writing SQL
And of course, this doesn’t apply to Java developers alone, but it’s written from the perspective of a Java (and SQL) developer. So here we go (again):
1. Not Using Window Functions
After all that we’ve been preaching, this must be our number 1 mistake in this series. Window functions are probably the coolest SQL feature of them all. They’re so incredibly useful, they should be the number one reason for anyone to switch to a better database, e.g. PostgreSQL:
Mind bending talk by @lukaseder about @JavaOOQ at tonight's @jsugtu. My new resolution: Install PostgreSQL and study SQL standard at once.
If free and/or Open Source is important to you, you have absolutely no better choice than using PostgreSQL (and you’ll even get to use the free jOOQ Open Source Edition, if you’re a Java developer).
And if you’re lucky enough to work in an environment with Oracle or SQL Server (or DB2, Sybase) licenses, you get even more out of your new favourite tool.
We won’t repeat all the window function goodness in this section, we’ve blogged about them often enough:
Start playing with window functions. You’ll never go back, guaranteed.
2. Not declaring NOT NULL constraints
This one was already part of a previous list where we claimed that you should add as much metadata as possible to your schema, because your database will be able to leverage that metadata for optimisations. For instance, if your database knows that a foreign key value in BOOK.AUTHOR_IDmust also be contained exactly once in AUTHOR.ID, then a whole set of optimisations can be achieved in complex queries.
Now let’s have another look at NOT NULL constraints. If you’re using Oracle, NULL values will not be part of your index. This doesn’t matter if you’re expressing an IN constraint, for instance:
SELECT * FROM table
WHERE value IN (
SELECT nullable_column FROM ...
)
But what happens with a NOT IN constraint?
SELECT * FROM table
WHERE value NOT IN (
SELECT nullable_column FROM ...
)
Due to SQL’s slightly unintuitive way of handling NULL, there is a slight risk of the second query unexpectedly not returning any results at all, namely if there is at least one NULL value as a result from the subquery. This is true for all databases that get SQL right.
But because the index on nullable_column doesn’t contain any NULL values, Oracle has to look up the complete content in the table, resulting in a FULL TABLE SCAN. Now that is unexpected! Details about this can be seen in this article.
The Cure:
Carefully review all your nullable, yet indexed columns, and check if you really cannot add a NOT NULL constraint to those columns.
The Tool:
If you’re using Oracle, use this query to detect all nullable, yet indexed columns:
SELECT
i.table_name,
i.index_name,
LISTAGG(
LPAD(i.column_position, 2) || ': ' ||
RPAD(i.column_name , 30) || ' ' ||
DECODE(t.nullable, 'Y', '(NULL)', '(NOT NULL)'),
', '
) WITHIN GROUP (ORDER BY i.column_position)
AS "NULLABLE columns in indexes"
FROM user_ind_columns i
JOIN user_tab_cols t
ON (t.table_name, t.column_name) =
((i.table_name, i.column_name))
WHERE EXISTS (
SELECT 1
FROM user_tab_cols t
WHERE (t.table_name, t.column_name, t.nullable) =
((i.table_name, i.column_name, 'Y' ))
)
GROUP BY i.table_name, i.index_name
ORDER BY i.index_name ASC;
Example output:
TABLE_NAME | INDEX_NAME | NULLABLE columns in indexes
-----------+--------------+----------------------------
PERSON | I_PERSON_DOB | 1: DATE_OF_BIRTH (NULL)
And then, fix it!
(Accidental criticism of Maven is irrelevant here ;-) )
If you’re curious about more details, see also these posts:
Now, this is a boring one if you’re not using Oracle, but if you are (and you’re a Java developer), be very wary of PL/SQL package state. Are you really doing what you think you’re doing?
CREATE OR REPLACE PACKAGE pkg IS
-- Package state here!
n NUMBER := 1;
FUNCTION next_n RETURN NUMBER;
END pkg;
CREATE OR REPLACE PACKAGE BODY pkg IS
FUNCTION next_n RETURN NUMBER
IS
BEGIN
n := n + 1;
RETURN n;
END next_n;
END pkg;
Wonderful, so you’ve created yourself an in-memory counter that generates a new number every time you call pkg.next_n. But who owns that counter? Yes, the session. Each session has their own initialised “package instance”.
But no, it’s probably not the session you might have thought of.
We Java developers connect to databases through connection pools. When we obtain a JDBC Connection from such a pool, we recycle that connection from a previous “session”, e.g. a previous HTTP Request (not HTTP Session!). But that’s not the same. The database session (probably) outlives the HTTP Request and will be inherited by the next request, possibly from an entirely different user. Now, imagine you had a credit card number in that package…?
CREATE OR REPLACE PACKAGE pkg IS
PRAGMA SERIALLY_REUSABLE;
n NUMBER := 1;
FUNCTION next_n RETURN NUMBER;
END pkg;
Because:
You cannot even use that package from SQL, now (see ORA-06534).
Mixing this PRAGMA with regular package state from other packages just makes things a lot more complex.
So, don’t.
Not The Cure:
I know. PL/SQL can be a beast. It often seems like such a quirky language. But face it. Many things run much much faster when written in PL/SQL, so don’t give up, just yet. Dropping PL/SQL is not the solution either.
The Cure:
At all costs, try to avoid package state in PL/SQL. Think of package state as of static variables in Java. While they might be useful for caches (and constants, of course) every now and then, you might not actually access that state that you wanted. Think about load-balancers, suddenly transferring you to another JVM. Think about class loaders, that might have loaded the same class twice, for some reason.
Instead, pass state as arguments through procedures and functions. This will avoid side-effects and make your code much cleaner and more predictable.
Or, obviuously, persist state to some table.
4. Running the same query all the time
Master data is boring. You probably wrote some utility to get the latest version of your master data (e.g. language, locale, translations, tenant, system settings), and you can query it every time, once it is available.
At all costs, don’t do that. You don’t have to cache many things in your application, as modern databases have grown to be extremely fast when it comes to caching:
Table / column content
Index content
Query / materialized view results
Procedure results (if they’re deterministic)
Cursors
Execution plans
So, for your average query, there’s virtually no need for an ORM second-level cache, at least from a performance perspective (ORM caches mainly fulfil other purposes, of course).
But when you query master data, i.e. data that never changes, then, network latency, traffic and many other factors will impair your database experience.
The Cure:
Please do take 10 minutes, download Guava, and use its excellent and easy to set up cache, that ships with various built-in invalidation strategies. Choose time-based invalidation (i.e. polling), choose Oracle AQ or Streams, or PostgreSQL’s NOTIFY for event-based invalidation, or just make your cache permanent, if it doesn’t matter. But don’t issue an identical master data query all the time.
… This obviously brings us to
5. Not knowing about the N+1 problem
You had a choice. At the beginning of your software product, you had to choose between:
SELECT * FROM book
-- And then, for each book:
SELECT * FROM author WHERE id = ?
SELECT * FROM author WHERE id = ?
SELECT * FROM author WHERE id = ?
Of course, you could go and tweak your hundreds of annotations to correctly prefetch or eager fetch each book’s associated author information to produce something along the lines of:
SELECT *
FROM book
JOIN author
ON book.author_id = author.id
But that would be an awful lot of work, and you’ll risk eager-fetching too many things that you didn’t want, resulting in another performance issue.
Maybe, you could upgrade to JPA 2.1 and use the new @NamedEntityGraph to express beautiful annotation trees like this one:
em.createQuery("select p from Post p where p.id=:id",
Post.class)
.setHint("javax.persistence.fetchgraph",
postGraph)
.setParameter("id", this.id)
.getResultList()
.get(0);
Let us all appreciate the above application of JEE standards with all due respect, and then consider…
The Cure:
You just listen to the wise words at the beginning of this article and replace thousands of lines of tedious Java / Annotatiomania™ code with a couple of lines of SQL. Because that will also likely help you prevent another issue that we haven’t even touched yet, namely selecting too many columns as you can see in these posts:
Since you’re already using an ORM, this might just mean resorting to native SQL – or maybe you manage to express your query with JPQL. Of course, we agree with Alessio Harri in believing that you should use jOOQ together with JPA:
Loved the type safety of @JavaOOQ today. OpenJPA is the workhorse and @JavaOOQ is the artist :) #80/20
While the above will certainly help you work around some real world issues that you may have with your favourite ORM, you could also take it one step further and think about it this way. After all these years of pain and suffering from the object-relational impedance mismatch, the JPA 2.1 expert group is now trying to tweak their way out of this annotation madness by adding more declarative, annotation-based fetch graph hints to JPQL queries, that no one can debug, let alone maintain.
The alternative is simple and straight-forward SQL. And with Java 8, we’ll add functional transformation through the Streams API. That’s hard to beat.
But obviuosly, your views and experiences on that subject may differ from ours, so let’s head on to a more objective discussion about…
6. Not using Common Table Expressions
While common table expressions obviously offer readability improvements, they may also offer performance improvements. Consider the following query that I have recently encountered in a customer’s PL/SQL package (not the actual query):
SELECT round (
(SELECT amount FROM payments WHERE id = :p_id)
*
(
SELECT e.bid
FROM currencies c, exchange_rates e
WHERE c.id =
(SELECT cur_id FROM payments WHERE id = :p_id)
AND e.cur_id =
(SELECT cur_id FROM payments WHERE id = :p_id)
AND e.org_id =
(SELECT org_id FROM payments WHERE id = :p_id)
) / (
SELECT c.factor
FROM currencies c, exchange_rates e
WHERE c.id =
(SELECT cur_id FROM payments WHERE id = :p_id)
AND e.cur_id =
(SELECT cur_id FROM payments WHERE id = :p_id)
AND e.org_id =
(SELECT org_id FROM payments WHERE id = :p_id)
), 0
)
INTO amount
FROM dual;
So what does this do? This essentially converts a payment’s amount from one currency into another. Let’s not delve into the business logic too much, let’s head straight to the technical problem. The above query results in the following execution plan (on Oracle):
------------------------------------------------------
| Operation | Name |
------------------------------------------------------
| SELECT STATEMENT | |
| TABLE ACCESS BY INDEX ROWID | PAYMENTS |
| INDEX UNIQUE SCAN | PAYM_PK |
| NESTED LOOPS | |
| INDEX UNIQUE SCAN | CURR_PK |
| TABLE ACCESS BY INDEX ROWID | PAYMENTS |
| INDEX UNIQUE SCAN | PAYM_PK |
| TABLE ACCESS BY INDEX ROWID | EXCHANGE_RATES |
| INDEX UNIQUE SCAN | EXCH_PK |
| TABLE ACCESS BY INDEX ROWID | PAYMENTS |
| INDEX UNIQUE SCAN | PAYM_PK |
| TABLE ACCESS BY INDEX ROWID | PAYMENTS |
| INDEX UNIQUE SCAN | PAYM_PK |
| NESTED LOOPS | |
| TABLE ACCESS BY INDEX ROWID | CURRENCIES |
| INDEX UNIQUE SCAN | CURR_PK |
| TABLE ACCESS BY INDEX ROWID| PAYMENTS |
| INDEX UNIQUE SCAN | PAYM_PK |
| INDEX UNIQUE SCAN | EXCH_PK |
| TABLE ACCESS BY INDEX ROWID | PAYMENTS |
| INDEX UNIQUE SCAN | PAYM_PK |
| TABLE ACCESS BY INDEX ROWID | PAYMENTS |
| INDEX UNIQUE SCAN | PAYM_PK |
| FAST DUAL | |
------------------------------------------------------
The actual execution time is negligible in this case, but as you can see, the same objects are accessed again and again within the query. This is a violation of Common Mistake #4: Running the same query all the time.
The whole thing would be so much easier to read, maintain, and for Oracle to execute, if we had used a common table expression. From the original source code, observe the following thing:
-- We're always accessing a single payment:
FROM payments WHERE id = :p_id
-- Joining currencies and exchange_rates twice:
FROM currencies c, exchange_rates e
So, let’s factor out the payment first:
-- "payment" contains only a single payment
-- But it contains all the columns that we'll need
-- afterwards
WITH payment AS (
SELECT cur_id, org_id, amount
FROM payments
WHERE id = :p_id
)
SELECT round(p.amount * e.bid / c.factor, 0)
-- Then, we simply don't need to repeat the
-- currencies / exchange_rates joins twice
FROM payment p
JOIN currencies c ON p.cur_id = c.id
JOIN exchange_rates e ON e.cur_id = p.cur_id
AND e.org_id = p.org_id
Note, that we’ve also replaced table lists with ANSI JOINs as suggested in our previous list
You wouldn’t believe it’s the same query, would you? And what about the execution plan? Here it is!
---------------------------------------------------
| Operation | Name |
---------------------------------------------------
| SELECT STATEMENT | |
| NESTED LOOPS | |
| NESTED LOOPS | |
| NESTED LOOPS | |
| FAST DUAL | |
| TABLE ACCESS BY INDEX ROWID| PAYMENTS |
| INDEX UNIQUE SCAN | PAYM_PK |
| TABLE ACCESS BY INDEX ROWID | EXCHANGE_RATES |
| INDEX UNIQUE SCAN | EXCH_PK |
| TABLE ACCESS BY INDEX ROWID | CURRENCIES |
| INDEX UNIQUE SCAN | CURR_PK |
---------------------------------------------------
No doubt that this is much much better.
The Cure:
If you’re lucky enough and you’re using one of those databases that supports window functions, chances are incredibly high (100%) that you also have common table expression support. (Even MySQL 8.0 will finally have them).
Common table expressions are like local variables in SQL. In every large statement, you should consider using them, as soon as you feel that you’ve written something before.
The Takeaway:
Some databases (e.g. PostgreSQL, or SQL Server) also support common table expressions for DML statements. In other words, you can write:
But few people know that they can also be used in an UPDATE statement, in most databases. Check out the following query, which I again found in a customer’s PL/SQL package (simplified again, of course):
UPDATE u
SET n = (SELECT n + 1 FROM t WHERE u.n = t.n),
s = (SELECT 'x' || s FROM t WHERE u.n = t.n),
x = 3;
So this query takes a subquery as a data source for updating two columns, and the third column is updated “regularly”. How does it perform? Moderately:
-----------------------------
| Operation | Name |
-----------------------------
| UPDATE STATEMENT | |
| UPDATE | U |
| TABLE ACCESS FULL| U |
| TABLE ACCESS FULL| T |
| TABLE ACCESS FULL| T |
-----------------------------
Let’s ignore the full table scans, as this query is constructed. The actual query could leverage indexes. But T is accessed twice, i.e. in both subqueries. Oracle didn’t seem to be able to apply scalar subquery caching in this case.
To the rescue: row value expressions. Let’s simply rephrase our UPDATE to this:
UPDATE u
SET (n, s) = ((
SELECT n + 1, 'x' || s FROM t WHERE u.n = t.n
)),
x = 3;
Let’s ignore the funny, Oracle-specific double-parentheses syntax for the right hand side of such a row value expression assignment, but let’s appreciate the fact that we can easily assign a new value to the tuple (n, s)in one go! Note, we could have also written this, instead, and assign x as well:
UPDATE u
SET (n, s, x) = ((
SELECT n + 1, 'x' || s, 3
FROM t WHERE u.n = t.n
));
As you will have expected, the execution plan has also improved, and T is accessed only once:
-----------------------------
| Operation | Name |
-----------------------------
| UPDATE STATEMENT | |
| UPDATE | U |
| TABLE ACCESS FULL| U |
| TABLE ACCESS FULL| T |
-----------------------------
The Cure:
Use row value expressions. Where ever you can. They make your SQL code incredibly more expressive, and chances are, they make it faster, as well.
Note that the above is supported by jOOQ’s UPDATE statement. This is the moment we would like to make you aware of this cheap, in-article advertisement:
;-)
8. Using MySQL when you could use PostgreSQL
Disclaimer: I wrote this article in 2014. Since then, MySQL has made tremendous progress, so I wouldn’t bet all my money on PostgreSQL any longer, when comparing it with MySQL! So, read this tip with the 2014 context in mind!
To some, this may appear to be a bit of a hipster discussion. But let’s consider the facts:
MySQL claims to be the “most popular Open Source database”.
PostgreSQL claims to be the “most advanced Open Source database”.
Let’s consider a bit of history. MySQL has always been very easy to install, maintain, and it has had a great and active community. This has lead to MySQL still being the RDBMS of choice with virtually every web hoster on this planet. Those hosters also host PHP, which was equally easy to install, and maintain.
BUT!
We Java developers tend to have an opinion about PHP, right? It’s summarised by this image here:
The PHP Hammer
Well, it works, but how does it work?
The same can be said about MySQL. MySQL has always worked somehow, but while commercial databases like Oracle have made tremendous progress both in terms of query optimisation and feature scope, MySQL has hardly moved in the last decade.
Many people choose MySQL primarily because of its price (USD $ 0.00). But often, the same people have found MySQL to be slow and quickly concluded that SQL is slow per se – without evaluating the options. This is also why all NoSQL stores compare themselves with MySQL, not with Oracle, the database that has been winning the Transaction Processing Performance Council’s (TPC) benchmarks almost forever. Some examples:
While the last article bluntly adds “(and other RDBMS)” it doesn’t go into any sort of detail whatsoever, what those “other RDBMS” do wrong. It really only compares MongoDB with MySQL.
The Cure:
We say: Stop complaining about SQL, when in fact, you’re really complaining about MySQL. There are at least four very popular databases out there that are incredibly good. These are:
Both companies are solving a problem that few people have. They’re selling us niche products as commodity, making us think that our real commodity databases (the RDBMS) no longer fulfil our needs. They are well-funded and have big marketing teams to throw around with blunt claims.
In the mean time, PostgreSQL just got even better, and you, as a reader of this blog / post, are about to bet on the winning team :-)
This article has been quite strongly against MySQL. We don’t mean to talk badly about a database that perfectly fulfils its purpose, as this isn’t a black and white world. Heck, you can get happy with SQLite in some situations. MySQL, being the cheap and easy to use, easy to install commodity database. We just wanted to make you aware of the fact, that you’re expressly choosing the cheap, not-so-good database, rather than the cheap, awesome one.
9. Forgetting about UNDO / REDO logs
We have claimed that MERGE statements or bulk / batch updates are good. That’s correct, but nonetheless, you should be wary when updating huge data sets in transactional contexts. If your transaction “takes too long”, i.e. if you’re updating 10 million records at a time, you will run into two problems:
You increase the risk of race conditions, if another process is also writing to the same table. This may cause a rollback on their or on your transaction, possibly making you roll out the huge update again
You cause a lot of concurrency on your system, because every other transaction / session, that wants to see the data that you’re about to update, will have to temporarily roll back all of your updates first, before they reach the state on disk that was there before your huge update. That’s the price of ACID.
One way to work around this issue is to allow for other sessions to read uncommitted data.
Another way to work around this issue is to frequently commit your own work, e.g. after 1000 inserts / updates.
In any case, due to the CAP theorem, you will have to make a compromise. Frequent commits will produce the risk of an inconsistent database in the event of the multi-million update going wrong after 5 million (committed) records. A rollback would then mean to revert all database changes towards a backup.
The Cure:
There is no definitive cure to this issue. But beware that you are very very rarely in a situation where it is OK to simply update 10 million records of a live and online table outside of an actual scheduled maintenance window. The simplest acceptable workaround is indeed to commit your work after N inserts / updates.
The Takeaway:
By this time, NoSQL aficionados will claim (again due to excessive marketing by aforementioned companies) that NoSQL has solved this by dropping schemas and typesafety. “Don’t update, just add another property!” – they said.
First off, I can add columns to my database without any issue at all. An ALTER TABLE ADD statement is executed instantly on live databases. Filling the column with data doesn’t bother anyone either, because no one reads the column yet (remember, don’t SELECT * !). So adding columns in RDBMS is as cheap as adding JSON properties to a MongoDB document.
But what about altering columns? Removing them? Merging them?
It is simply not true that denormalisation takes you anywhere far. Denormalisation is always a short-term win for the developer. Hardly a long-term win for the operations teams. Having redundant data in your database for the sake of speeding up an ALTER TABLE statement is like sweeping dirt under the carpet.
Don’t believe the marketers. And while you’re at it, perform some doublethink and forget that we’re SQL tool vendors ourselves ;-) Here’s again the “correct” message:
10. Not using the BOOLEAN type correctly
This is not really a mistake per se. It’s just again something that hardly anyone knows. When the SQL:1999 standard introduced the new BOOLEAN data type, they really did it right. Because before, we already had something like booleans in SQL. We’ve had <search condition> in SQL-92, which are essentially predicates for use in WHERE, ON, and HAVING clauses, as well as in CASE expressions.
SQL:1999, however, simply defined the new <boolean value expression> as a regular <value expression>, and redefined the <search condition> as such:
<search condition> ::=
<boolean value expression>
Done! Now, for most of us Java / Scala / etc. developers, this doesn’t seem like such an innovation. Heck it’s a boolean. Obviuosly it can be interchangeably used as predicate and as variable.
But in the mind-set of the keyword-heavy SQL folks who have taken inspiration from COBOL when designing the language, this was quite a step forward.
Now, what does this mean? This means that you can use any predicate also as a column! For instance:
SELECT a, b, c
FROM (
SELECT EXISTS (SELECT ...) a,
MY_COL IN (1, 2, 3) b,
3 BETWEEN 4 AND 5 c
FROM MY_TABLE
) t
WHERE a AND b AND NOT(c)
This is a bit of a dummy query, agreed, but are you aware of how powerful this is?
Luckily, again, PostgreSQL fully supports this (unlike Oracle, which still doesn’t have any BOOLEAN data type in SQL).
The Cure:
Every now and then, using BOOLEAN types feels very right, so do it! You can transform boolean value expressions into predicates and predicates into boolean value expressions. They’re the same. This makes SQL all so powerful.
Conclusion
SQL has evolved steadily over the past years through great standards like SQL:1999, SQL:2003, SQL:2008 and now SQL:2011. It is the only surviving mainstream declarative language, now that XQuery can be considered pretty dead for the mainstream. It can be easily mixed with procedural languages, as PL/SQL and T-SQL (and other procedural dialects) have shown. It can be easily mixed with object-oriented or functional languages, as jOOQ has shown.
At Data Geekery, we believe that SQL is the best way to query data. You don’t agree with any of the above? That’s fine, you don’t have to. Sometimes, even we agree with Winston Churchill who is known to have said:
SQL is the worst form of database querying, except for all the other forms.
Say you have a big database with lots of tables and foreign key references. Now you would like to know all tables that are somehow inter-connected by their respective foreign key relationship “paths”. You could call this a “convex hull” around all of your “correlated tables”. Here’s a pseudo-algorithm to achieve this:
// Initialise the hull with an "origin" table
Set tables = {"any table"};
int size = 0;
// Grow the "tables" result until no new tables are added
while (size < tables.size) {
size = tables.size;
for (table in tables) {
tables.addAll(table.referencedTables);
tables.addAll(table.referencingTables);
}
}
At the end of this algorithm, you would have all tables in the “tables” set, that are somehow connected with the original “any table”.
Calculate this with jOOQ
With jOOQ’s generated classes, you can easily implement the above algorithm in Java. This would be an example implementation
public class Hull {
public static Set<Table<?>> hull(Table<?>... tables) {
Set<Table<?>> result =
new HashSet<Table<?>>(Arrays.asList(tables));
// Loop as long as there are no new result tables
int size = 0;
while (result.size() > size) {
size = result.size();
for (Table<?> table : new ArrayList<Table<?>>(result)) {
// Follow all outbound foreign keys
for (ForeignKey<?, ?> fk : table.getReferences()) {
result.add(fk.getKey().getTable());
}
// Follow all inbound foreign keys from tables
// within the same schema
for (Table<?> other : table.getSchema().getTables()) {
if (other.getReferencesTo(table).size() > 0) {
result.add(other);
}
}
}
}
return result;
}
public static void main(String[] args) {
// Calculate the "convex hull" for the T_AUTHOR table
System.out.println(hull(T_AUTHOR));
}
}
Do it with SQL
Now this still looks straightforward. But we’re SQL pro’s and we love weird queries, so let’s give Oracle SQL a shot at resolving this problem in a single SQL statement. Here goes (warning, some serious SQL ahead)!
-- "graph" denotes an undirected foreign key reference graph
-- for schema "TEST"
with graph as (
select c1.table_name t1, c2.table_name t2
from all_constraints c1
join all_constraints c2
on c1.owner = c2.r_owner
and c1.constraint_name = c2.r_constraint_name
where c1.owner = 'TEST'
union all
select c2.table_name t1, c1.table_name t2
from all_constraints c1
join all_constraints c2
on c1.owner = c2.r_owner
and c1.constraint_name = c2.r_constraint_name
where c1.owner = 'TEST'
),
-- "paths" are all directed paths within that schema
-- as a #-delimited string
paths as (
select sys_connect_by_path(t1, '#') || '#' path
from graph
connect by nocycle prior t1 = t2
),
-- "subgraph" are all those directed paths that go trough
-- a given table T_AUTHOR
subgraph as (
select distinct t.table_name,
regexp_replace(p.path, '^#(.*)#$', '\1') path
from paths p
cross join all_tables t
where t.owner = 'TEST'
and p.path like '%#' || t.table_name || '#%'
),
-- This XML-trick splits paths and generates rows for every distinct
-- table name
split_paths as (
select distinct table_name origin,
cast(t.column_value.extract('//text()') as varchar2(4000)) table_names
from
subgraph,
table(xmlsequence(xmltype(
'<x><x>' || replace(path, '#', '</x><x>') ||
'</x></x>').extract('//x/*'))) t
),
-- "table_graphs" lists every table and its associated graph
table_graphs as (
select
origin,
count(*) graph_size,
listagg(table_names, ', ') within group (order by 1) table_names
from split_paths
group by origin
)
select
origin,
graph_size "SIZE",
dense_rank() over (order by table_names) id,
table_names
from table_graphs
order by origin
When run against the jOOQ integration test database, this beautiful query will return:
I was recently wondering about some issue I had encountered between two tables. If tables undergo a lot of INSERT / UPDATE / DELETE statements, it may appear to be better to remove some constraints, at least temporarily for the loading of data. In this particular case, the foreign key relationship was permanently absent and I found the join between the two tables to be a potential source of bad query execution plans in greater contexts. So my intuition told me that this could be optimised by adding the constraint again, as Oracle would then be able to formally use that information for query transformations. I asked the question on Stack Overflow and it seems I was right:
But what does that mean? It’s simple. If you have two tables A and B, and you join them on A.ID = B.A_ID, having a foreign key constraint on B.A_ID may make all the difference. Let’s say, you execute this:
select B.* from B
join A on A.ID = B.A_ID
Without a foreign key on B.A_ID
When B.A_ID is set (i.e. is not null), there is still no guarantee that there actually exists a corresponding A.ID. A is not part of the projection. Intuitively, it is thus not needed, but it cannot be optimised away, because the query will actually have to check for existing A.ID per B.A_ID.
With a foreign key constraint
When B.A_ID is set, then there must be a corresponding, unique A.ID. Hence, the JOIN can be ignored in this case. This has powerful implications on all sorts of transformation operations.
For more details, have a look at Tom Kyte’s presentation “Meta Data Matters”