How to Quickly Rename all Primary Keys in Oracle


Are you working with someone else’s schema and they haven’t declared nice names for all their constraints?

Unfortunately, it is all too easy to create a table like this:

CREATE TABLE order1 (
  order_id NUMBER(18) NOT NULL PRIMARY KEY
);

Or like this:

CREATE TABLE order2 (
  order_id NUMBER(18) NOT NULL,

  PRIMARY KEY (order_id)
);

Sure, you get a little convenience when writing the table. But from now on, you’re stuck with weird, system generated names both for the constraint and for the backing index. For instance, when doing execution plan analyses:

EXPLAIN PLAN FOR
SELECT *
FROM order1
JOIN order2 USING (order_id)
WHERE order_id = 1;

SELECT * FROM TABLE (dbms_xplan.display);

The simplified execution plan (output of the above queries) is this:

-------------------------------------------
| Id  | Operation          | Name         |
-------------------------------------------
|   0 | SELECT STATEMENT   |              |
|   1 |  NESTED LOOPS      |              |
|*  2 |   INDEX UNIQUE SCAN| SYS_C0042007 |
|*  3 |   INDEX UNIQUE SCAN| SYS_C0042005 |
-------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ORDER2"."ORDER_ID"=1)
   3 - access("ORDER1"."ORDER_ID"=1)

So, I got these system generated index names called SYS_C0042007 and SYS_C0042005. What do they mean? I can derive the actual meaning perhaps from the predicate information, as SYS_C0042007 is accessed in operation #2, which uses an access predicate on ORDER2. Fine. But do I really need to look these things up all the time?

Just name your constraints. Always!

Don’t be fooled into this convenience. It’ll hurt you time and again, not just when doing analyses. You might not be able to easily import / export your schema to some other database, because another database might already occupy these generated names.

So, do this instead:

CREATE TABLE order2 (
  order_id NUMBER(18) NOT NULL,

  CONSTRAINT pk_order2 PRIMARY KEY (order_id)
);

Find a naming schema (any naming scheme), like for instance PK_[table name]. If you’re cleaning up an existing database, this might help:

SET SERVEROUTPUT ON
BEGIN
  FOR stmts IN (
    SELECT 
      'ALTER TABLE ' || table_name || 
      ' RENAME CONSTRAINT ' || constraint_name || 
      ' TO PK_' || table_name AS stmt
    FROM user_constraints
    WHERE constraint_name LIKE 'SYS%'
    AND constraint_type = 'P'
  ) LOOP
    dbms_output.put_line(stmts.stmt);
    EXECUTE IMMEDIATE stmts.stmt;
  END LOOP;
  
  FOR stmts IN (
    SELECT 
      'ALTER INDEX ' || index_name || 
      ' RENAME TO PK_' || table_name AS stmt
    FROM user_constraints
    WHERE index_name LIKE 'SYS%'
    AND constraint_type = 'P'
  ) LOOP
    dbms_output.put_line(stmts.stmt);
    EXECUTE IMMEDIATE stmts.stmt;
  END LOOP;
END;
/

The above yields (and runs)

ALTER TABLE ORDER1 RENAME CONSTRAINT SYS_C0042005 TO PK_ORDER1
ALTER TABLE ORDER2 RENAME CONSTRAINT SYS_C0042007 TO PK_ORDER2
ALTER INDEX SYS_C0042005 RENAME TO PK_ORDER1
ALTER INDEX SYS_C0042007 RENAME TO PK_ORDER2

You can of course repeat the exercise for unique constraints, etc. I omit the example here because the naming scheme might be a bit more complicated there. Now re-calculate the execution plan and check this out:

EXPLAIN PLAN FOR
SELECT *
FROM order1
JOIN order2 USING (order_id)
WHERE order_id = 1;

SELECT * FROM TABLE (dbms_xplan.display);

The simplified execution plan (output of the above queries) is this:

----------------------------------------
| Id  | Operation          | Name      |
----------------------------------------
|   0 | SELECT STATEMENT   |           |
|   1 |  NESTED LOOPS      |           |
|*  2 |   INDEX UNIQUE SCAN| PK_ORDER2 |
|*  3 |   INDEX UNIQUE SCAN| PK_ORDER1 |
----------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("ORDER2"."ORDER_ID"=1)
   3 - access("ORDER1"."ORDER_ID"=1)

There! That’s much more like it.

Be Careful When Emulating Parameterised Views with SYS_CONTEXT in Oracle


Everyone who writes complex SQL (or dare I say, move business logic into the database?) wishes for parameterised views from time to time. This hypothetical syntax would be terrific:

CREATE VIEW v_categories_per_actor(
  p_actor_id NUMBER
) AS
SELECT DISTINCT c.name
FROM category c
JOIN film_category fc USING (category_id)
JOIN film_actor fa USING (film_id)
WHERE fa.actor_id = p_actor_id

Or as user-defined functions if you insist (as available in SQL Server):

CREATE FUNCTION v_categories_per_actor(
  p_actor_id NUMBER
) RETURNS TABLE AS
SELECT DISTINCT c.name
FROM category c
JOIN film_category fc USING (category_id)
JOIN film_actor fa USING (film_id)
WHERE fa.actor_id = p_actor_id

As always on this blog, I’m using the useful Sakila database for examples.

The above parameterised view could now be used very nicely with awesome SQL features like APPLY / LATERAL:

-- Oracle 12c / SQL Server syntax:
SELECT a.first_name, a.last_name, c.name
FROM actor a
OUTER APPLY v_categories_per_actor(a.actor_id)

-- Oracle 12c / PostgreSQL / SQL standard syntax:
SELECT a.first_name, a.last_name, c.name
FROM actor a
LEFT JOIN LATERAL v_categories_per_actor(a.actor_id) ON 1 = 1

More info about LATERAL here. The good news about the above is: SQL Server supports this through inline table valued functions, where the database optimiser can actually go inside the function / view and optimise it away, potentially implementing it with ordinary JOIN algorithms!

Unfortunately, this isn’t true for Oracle (yet). Even in Oracle 12c (where APPLY / LATERAL is now supported), table valued functions are opaque for the optimiser, which cannot “peek inside” to apply nice SQL transformation operations to optimise such statements.

People emulate parameterised views with SYS_CONTEXT

A lot of people use Oracle’s SYS_CONTEXT feature to emulate parameterised views:

A SYS_CONTEXT is essentially a global variable (per session), which you can set e.g. when getting a connection out of your connection pool in your Java client application. From then on, all SQL statements that use SYS_CONTEXT are globally “parameterised”. For example, our previous view becomes this:

CREATE VIEW v_categories_per_actor AS
SELECT DISTINCT c.name
FROM category c
JOIN film_category fc USING (category_id)
JOIN film_actor fa USING (film_id)
WHERE fa.actor_id = TO_NUMBER(sys_context('MY_APP', 'ACTOR_ID'))

Now, of course, this view cannot be used (yet), because it will always return an empty result:

SELECT * 
FROM v_categories_per_actor
ORDER BY name;

Which yields:

NAME 
----

Now, let’s do that SYS_CONTEXT initialisation magic:

-- Just some boilerplate you have to do once
CREATE CONTEXT my_app USING set_ctx;

CREATE OR REPLACE PROCEDURE set_ctx(
  p_actor_id NUMBER := NULL
) IS
BEGIN
  dbms_session.set_context('MY_APP', 'ACTOR_ID', p_actor_id);
END;
/

-- This will now set the ACTOR_ID for our session
EXEC set_ctx(1);

If we re-run our previous statement that queries the view now:

SELECT * 
FROM v_categories_per_actor
ORDER BY name;

We’ll get an actual result!

NAME
-----------
Animation
Children
Classics
Comedy
Documentary
Family
Foreign
Games
Horror
Music
New
Sci-Fi
Sports

Now, of course this is nowhere near as powerful as actual parameterised views, because I cannot lateral join this thing to an actual actor table, setting that context on each row. But it already helps if you want your Java client application be able to set a parameter for such a view. Let’s call it:

Oracle’s poor man’s parameterised views

How does SYS_CONTEXT impact performance?

A client of mine makes heavy use of SYS_CONTEXT for security purposes, in order to implement something similar to row-level security. I’ll show an example later on.

You’ll also find a simpler “security” use-case on an interesting Ask Tom article:

Note how in the WHERE clause, I always use the SYS_CONTEXT function. I never put the literal value into the query鈥攖hat would be very bad for Performance Scalability Shared pool utilization Perhaps most important: security (SQL injection)

I don’t really agree with this, because in my opinion, bind variables are a much better approach to this simplistic security concern.

In addition to that, since Oracle 11g, we have adaptive cursor sharing and bind variable peeking, which means that the same query can produce distinct execution plans depending on the actual bind variable values. This isn’t (currently) being done for SYS_CONTEXT. See also this very interesting article by Connor McDonald:
https://connormcdonald.wordpress.com/2016/10/20/taking-a-peek-at-sys_context

Connor was kind enough to write this article as a reply to a question about SYS_CONTEXT I asked on Twitter:

Not peeking at SYS_CONTEXT may be good or bad, depending on your taste and on your opinion about Oracle 11g’s adaptive cursor sharing feature. I’m personally a big fan of it, although I do appreciate that some experts don’t like it too much (see also Tony Hasler’s opinions in Expert Oracle SQL).

Row level security with SYS_CONTEXT

The important thing is just to remember, that no peeking is done at SYS_CONTEXT values. What does this mean?

This client of mine has always used SYS_CONTEXT just like I mentioned before, for security reasons. More particularly, they implemented sophisticated row-level security with it (they did this before Oracle had out of the box support for row-level security). Again, if you look at the previous query:

CREATE VIEW v_categories_per_actor AS
SELECT DISTINCT c.name
FROM category c
JOIN film_category fc USING (category_id)
JOIN film_actor fa USING (film_id)
WHERE fa.actor_id = TO_NUMBER(sys_context('MY_APP', 'ACTOR_ID'))

You could interpret this as being a security feature. Actors can only access their own data across the entire database. In this case, if clients don’t get GRANTs to tables, but only to these views, it’s possible to prevent access to all sorts of data that is not related to “my own actor_id”, even if client code forgets to add the appropriate predicate.

The execution plan for a query against this view shows that it’s quite decent:

-----------------------------------------------------------------------------
| Id  | Operation                 | Name         | Starts | E-Rows | A-Rows |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |              |      1 |        |     13 |
|   1 |  SORT ORDER BY            |              |      1 |     16 |     13 |
|   2 |   VIEW                    | V_CATEGORIE..|      1 |     16 |     13 |
|   3 |    HASH UNIQUE            |              |      1 |     16 |     13 |
|*  4 |     HASH JOIN             |              |      1 |     27 |     19 |
|   5 |      NESTED LOOPS         |              |      1 |     27 |     19 |
|   6 |       INDEX FAST FULL SCAN| PK_FILM_CAT..|      1 |   1000 |   1000 |
|*  7 |       INDEX UNIQUE SCAN   | PK_FILM_ACTOR|   1000 |      1 |     19 |
|   8 |      TABLE ACCESS FULL    | CATEGORY     |      1 |     16 |     16 |
-----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("C"."CATEGORY_ID"="FC"."CATEGORY_ID")
   7 - access("FA"."ACTOR_ID"=TO_NUMBER(SYS_CONTEXT('MY_APP','ACTOR_ID')) 
              AND "FC"."FILM_ID"="FA"."FILM_ID")

As you can see, the cardinality estimates are all pretty OK and this query does the right thing as the database knows it can use the single SYS_CONTEXT value for a fast access predicate on the FILM_ACTOR primary key.

Let’s ignore for a moment the slightly wrong cardinality estimate on PK_FILM_ACTOR, whose effects can be fixed by using a /*+LEADING(fa fc)*/ hint inside of the view. But this is not related to using SYS_CONTEXT.

Using SYS_CONTEXT to overload view behaviour

Now, that particular client evaluated whether SYS_CONTEXT can be used for something entirely else: To put two UNION ALL subqueries in a view and depending on the “session mode”, they wanted to execute one or the other query. Let’s assume you have different types of users in your client application:

  • Normal customers: “C”
  • Operators: “O”

The latter have more privileges and can see entirely different data. In the terms of our view this might mean the following:

CREATE OR REPLACE VIEW v_categories_per_actor AS

SELECT DISTINCT actor_id, c.name
FROM category c
JOIN film_category fc USING (category_id)
JOIN film_actor fa USING (film_id)
WHERE fa.actor_id = TO_NUMBER(sys_context('MY_APP', 'ACTOR_ID'))
AND sys_context('MY_APP', 'USER_TYPE') = 'C'

UNION ALL

SELECT DISTINCT actor_id, c.name
FROM category c
JOIN film_category fc USING (category_id)
JOIN film_actor fa USING (film_id)
WHERE sys_context('MY_APP', 'USER_TYPE') = 'O'

What this means now is that the same view is reused for both types of users: customers/actors (C) and operators (O). Customers only get to see their own data whereas operators get to see all data. Imagine that the real query is much more complex.

So, this is really nice, because then you can start reusing complex views and put these views in other complex views and the behaviour of your entire application starts changing depending on who is logging in. So, this shouldn’t just be called row level security, it should be called access control list, because that’s what they’re really doing.

Excellent!

So where’s the problem? The problem lies in the execution plans. Let’s update our procedure again:

CREATE OR REPLACE PROCEDURE set_ctx(
  p_actor_id NUMBER := NULL
) IS
BEGIN
  dbms_session.set_context('MY_APP', 'ACTOR_ID', p_actor_id);
  dbms_session.set_context('MY_APP', 'USER_TYPE', 
    CASE WHEN p_actor_id IS NULL THEN 'O' ELSE 'C' END);
END;
/

EXEC set_ctx(1);

Not high security. If we initialise the context with an P_ACTOR_ID, then we’re a customer (C), otherwise, we’re an operator (O). What’s our plan now?

SELECT * FROM TABLE (dbms_xplan.display_cursor(
  sql_id => 'cmdfbydppfqbu',
  format => 'ALLSTATS LAST'
));

Which yields…

-----------------------------------------------------------------------------------------
| Id  | Operation                   | Name                   | Starts | E-Rows | A-Rows |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                        |      1 |        |     13 |
|   1 |  SORT ORDER BY              |                        |      1 |   2275 |     13 |
|   2 |   VIEW                      | V_CATEGORIES_PER_ACTOR |      1 |   2275 |     13 |
|   3 |    UNION-ALL                |                        |      1 |        |     13 |
|   4 |     HASH UNIQUE             |                        |      1 |     12 |     13 |
|*  5 |      FILTER                 |                        |      1 |        |     19 |
|*  6 |       HASH JOIN             |                        |      1 |     27 |     19 |
|   7 |        NESTED LOOPS         |                        |      1 |     27 |     19 |
|   8 |         INDEX FAST FULL SCAN| PK_FILM_CATEGORY       |      1 |   1000 |   1000 |
|*  9 |         INDEX UNIQUE SCAN   | PK_FILM_ACTOR          |   1000 |      1 |     19 |
|  10 |        TABLE ACCESS FULL    | CATEGORY               |      1 |     16 |     16 |
|  11 |     HASH UNIQUE             |                        |      1 |   2263 |      0 |
|* 12 |      FILTER                 |                        |      1 |        |      0 |
|* 13 |       HASH JOIN             |                        |      0 |   5462 |      0 |
|* 14 |        HASH JOIN            |                        |      0 |   1000 |      0 |
|  15 |         TABLE ACCESS FULL   | CATEGORY               |      0 |     16 |      0 |
|  16 |         INDEX FAST FULL SCAN| PK_FILM_CATEGORY       |      0 |   1000 |      0 |
|  17 |        INDEX FAST FULL SCAN | PK_FILM_ACTOR          |      0 |   5462 |      0 |
-----------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - filter(SYS_CONTEXT('MY_APP','USER_TYPE')='C')
   6 - access("C"."CATEGORY_ID"="FC"."CATEGORY_ID")
   9 - access("FA"."ACTOR_ID"=TO_NUMBER(SYS_CONTEXT('MY_APP','ACTOR_ID')) 
              AND "FC"."FILM_ID"="FA"."FILM_ID")
  12 - filter(SYS_CONTEXT('MY_APP','USER_TYPE')='O')
  13 - access("FC"."FILM_ID"="FA"."FILM_ID")
  14 - access("C"."CATEGORY_ID"="FC"."CATEGORY_ID")

As you can see, the first part of the UNION-ALL concatenation is still roughly the same, except we now got this FILTER operation on operation #5. The second part of the UNION-ALL operation, however, got its cardinalities completely wrong. Operation 11 estimates 2263 rows even if there are none.

The E-Rows column (estimated rows) estimate that all data is selected from our Sakila database, i.e. 1000 FILM_CATEGORY relationships and 5462 FILM_ACTOR relationships, which is all of our data. But the A-Rows column (actual rows) is zero, as expected, because we set our USER_TYPE context value to 'C' for customer, not 'O' for operator.

Interesting also the Starts column, which shows that the operations below the FILTER operation on line #12 aren’t started. This was also confirmed by Franck Pachot in that interesting Twitter conversation:

And also by Chris Antognini:

Good news, no?

Yes and no.

  • YES: Because even if the plan looks quite bad (for customer usage), it performed optimally. At least, Oracle knew when to stop even if estimates were wrong.
  • NO: Because all these cardinality (and cost) estimates will propagate leading to bigger and bigger errors, depending on how you use this view.

Check out the following query:

SELECT actor_id, name, COUNT(*)
FROM v_categories_per_actor ca
JOIN category c USING (name)
JOIN film_category fc USING (category_id)
JOIN film_actor fa USING (film_id, actor_id)
GROUP BY actor_id, name
ORDER BY actor_id, name;

The above query could of course be stored in a view again to fit our security concept of giving grants only to views, not tables… In any case, what we’re doing here is the following:

We take all categories per actor from our previous view, then we want to count the number of total films in that category for that actor. If we’re still running this in a customer (C) context with ACTOR_ID = 1, we’ll get:

  ACTOR_ID NAME                        COUNT(*)
---------- ------------------------- ----------
         1 Animation                          1
         1 Children                           1
         1 Classics                           2
         1 Comedy                             1
         1 Documentary                        1
         1 Family                             2
         1 Foreign                            1
         1 Games                              2
         1 Horror                             3
         1 Music                              1
         1 New                                2
         1 Sci-Fi                             1
         1 Sports                             1

Which translates to: I (my user = ACTOR_ID = 1) have played in these categories, and these categories have so many total films that I played in. Again, the actual query might be much more complex, where we can’t easily factor out things (e.g. avoid doubling access to various tables). I’m just trying to make a point here.

What’s the plan of this query? It’s very bad

---------------------------------------------------------------------------------------
| Id  | Operation                       | Name             | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                  |      1 |        |     13 |
|   1 |  SORT ORDER BY                  |                  |      1 |   2263 |     13 |
|   2 |   HASH GROUP BY                 |                  |      1 |   2263 |     13 |
|   3 |    NESTED LOOPS                 |                  |      1 |   3883 |     19 |
|*  4 |     HASH JOIN                   |                  |      1 |  62130 |    247 |
|*  5 |      HASH JOIN                  |                  |      1 |   2275 |     13 |
|   6 |       TABLE ACCESS FULL         | CATEGORY         |      1 |     16 |     16 |
|   7 |       VIEW                      | V_CATEGORIES_... |      1 |   2275 |     13 |
|   8 |        UNION-ALL                |                  |      1 |        |     13 |
|   9 |         HASH UNIQUE             |                  |      1 |     12 |     13 |
|* 10 |          FILTER                 |                  |      1 |        |     19 |
|* 11 |           HASH JOIN             |                  |      1 |     27 |     19 |
|  12 |            NESTED LOOPS         |                  |      1 |     27 |     19 |
|  13 |             INDEX FAST FULL SCAN| PK_FILM_CATEGORY |      1 |   1000 |   1000 |
|* 14 |             INDEX UNIQUE SCAN   | PK_FILM_ACTOR    |   1000 |      1 |     19 |
|  15 |            TABLE ACCESS FULL    | CATEGORY         |      1 |     16 |     16 |
|  16 |         HASH UNIQUE             |                  |      1 |   2263 |      0 |
|* 17 |          FILTER                 |                  |      1 |        |      0 |
|* 18 |           HASH JOIN             |                  |      0 |   5462 |      0 |
|* 19 |            HASH JOIN            |                  |      0 |   1000 |      0 |
|  20 |             TABLE ACCESS FULL   | CATEGORY         |      0 |     16 |      0 |
|  21 |             INDEX FAST FULL SCAN| PK_FILM_CATEGORY |      0 |   1000 |      0 |
|  22 |            INDEX FAST FULL SCAN | PK_FILM_ACTOR    |      0 |   5462 |      0 |
|  23 |      INDEX FAST FULL SCAN       | PK_FILM_ACTOR    |      1 |   5462 |   5462 |
|* 24 |     INDEX UNIQUE SCAN           | PK_FILM_CATEGORY |    247 |      1 |     19 |
---------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("CA"."ACTOR_ID"="FA"."ACTOR_ID")
   5 - access("CA"."NAME"="C"."NAME")
  10 - filter(SYS_CONTEXT('MY_APP','USER_TYPE')='C')
  11 - access("C"."CATEGORY_ID"="FC"."CATEGORY_ID")
  14 - access("FA"."ACTOR_ID"=TO_NUMBER(SYS_CONTEXT('MY_APP','ACTOR_ID'))
       AND "FC"."FILM_ID"="FA"."FILM_ID")
  17 - filter(SYS_CONTEXT('MY_APP','USER_TYPE')='O')
  18 - access("FC"."FILM_ID"="FA"."FILM_ID")
  19 - access("C"."CATEGORY_ID"="FC"."CATEGORY_ID")
  24 - access("FC"."FILM_ID"="FA"."FILM_ID" AND "C"."CATEGORY_ID"="FC"."CATEGORY_ID")

Why is it so bad?

Observe how at some point, I had a cardinality estimate of 62130 (operation #4), and the whole query was still expected to return 2263 rows (operation #1), when in fact, I got only 13.

Even if the optimiser got the number of rows almost right for the first UNION ALL subquery (estimated 12, got 13 on operation #9), the estimate for the second UNION ALL subquery made it think that with so many rows coming out of the view (2263 on operation #16), a hash join will be optimal to count the number of films (operation #4 and #5). While the actual numbers aren’t as bad as it was estimated, the hash join operation is much more costly for small data sets, than an equivalent nested loop join operation.

If we remove again that UNION ALL operation from the view, restricting the view back to the original customer (C) only use case:

CREATE OR REPLACE VIEW v_categories_per_actor AS
SELECT DISTINCT actor_id, c.name
FROM category c
JOIN film_category fc USING (category_id)
JOIN film_actor fa USING (film_id)
WHERE fa.actor_id = TO_NUMBER(sys_context('MY_APP', 'ACTOR_ID'))
AND sys_context('MY_APP', 'USER_TYPE') = 'C'

… then, we get a much better plan for that aggregation, where we now get those nested loops:

-------------------------------------------------------------------------------------
| Id  | Operation                     | Name             | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                  |      1 |        |     13 |
|   1 |  SORT ORDER BY                |                  |      1 |     12 |     13 |
|   2 |   HASH GROUP BY               |                  |      1 |     12 |     13 |
|   3 |    VIEW                       | VM_NWVW_1        |      1 |     47 |     19 |
|   4 |     HASH UNIQUE               |                  |      1 |     47 |     19 |
|*  5 |      FILTER                   |                  |      1 |        |     33 |
|   6 |       NESTED LOOPS            |                  |      1 |     47 |     33 |
|*  7 |        HASH JOIN              |                  |      1 |    746 |    361 |
|   8 |         NESTED LOOPS          |                  |      1 |     27 |     19 |
|*  9 |          HASH JOIN            |                  |      1 |   1000 |   1000 |
|* 10 |           HASH JOIN           |                  |      1 |     16 |     16 |
|  11 |            TABLE ACCESS FULL  | CATEGORY         |      1 |     16 |     16 |
|  12 |            TABLE ACCESS FULL  | CATEGORY         |      1 |     16 |     16 |
|  13 |           INDEX FAST FULL SCAN| PK_FILM_CATEGORY |      1 |   1000 |   1000 |
|* 14 |          INDEX UNIQUE SCAN    | PK_FILM_ACTOR    |   1000 |      1 |     19 |
|* 15 |         INDEX RANGE SCAN      | PK_FILM_ACTOR    |      1 |     27 |     19 |
|* 16 |        INDEX UNIQUE SCAN      | PK_FILM_CATEGORY |    361 |      1 |     33 |
-------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - filter(SYS_CONTEXT('MY_APP','USER_TYPE')='C')
   7 - access("FA"."ACTOR_ID"="FA"."ACTOR_ID")
   9 - access("C"."CATEGORY_ID"="FC"."CATEGORY_ID")
  10 - access("C"."NAME"="C"."NAME")
  14 - access("FA"."ACTOR_ID"=TO_NUMBER(SYS_CONTEXT('MY_APP','ACTOR_ID')) 
              AND "FC"."FILM_ID"="FA"."FILM_ID")
  15 - access("FA"."ACTOR_ID"=TO_NUMBER(SYS_CONTEXT('MY_APP','ACTOR_ID')))
  16 - access("FC"."FILM_ID"="FA"."FILM_ID" AND "C"."CATEGORY_ID"="FC"."CATEGORY_ID")

All the cardinality estimates are now much better, unsurprisingly. Let’s benchmark and measure, too:

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 1000;
BEGIN
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT actor_id, name, COUNT(*)
      FROM v_categories_per_actor ca -- No UNION ALL
      JOIN category c USING (name)
      JOIN film_category fc USING (category_id)
      JOIN film_actor fa USING (film_id, actor_id)
      GROUP BY actor_id, name
      ORDER BY actor_id, name
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;
    
  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT actor_id, name, COUNT(*)
      FROM v_categories_per_actor2 ca -- With UNION ALL
      JOIN category c USING (name)
      JOIN film_category fc USING (category_id)
      JOIN film_actor fa USING (film_id, actor_id)
      GROUP BY actor_id, name
      ORDER BY actor_id, name
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;
    
  dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
END;
/

Which yields:

Statement 1 : +000000000 00:00:01.940000000
Statement 2 : +000000000 00:00:02.923000000

Clearly, that UNION ALL and its resulting hash joins are hurting us drastically!

Conclusion: Use SYS_CONTEXT with care

SYS_CONTEXT can be useful to emulate parameterised views as we’ve seen. There’s nothing wrong about doing “extended row-level security” by adding SYS_CONTEXT predicates in views and granting access only to views, not tables, e.g.

CREATE OR REPLACE VIEW v_categories_per_actor AS
SELECT DISTINCT actor_id, c.name
FROM category c
JOIN film_category fc USING (category_id)
JOIN film_actor fa USING (film_id)
WHERE fa.actor_id = TO_NUMBER(sys_context('MY_APP', 'ACTOR_ID'))
AND sys_context('MY_APP', 'USER_TYPE') = 'C'

However, due to the lack of peeking at those SYS_CONTEXT values, we cannot profit from the adaptive cursor sharing feature. This might have been possible with real parameterised views (such as they are supported in SQL Server), but in this case, we don’t get multiple alternative execution plans for the same SQL query, depending on SYS_CONTEXT values. This has been shown by Connor McDonald in his blog post:
https://connormcdonald.wordpress.com/2016/10/20/taking-a-peek-at-sys_context

I do hope that a future version of Oracle will treat SYS_CONTEXT more like bind variables, because ultimately, that’s what they are: Constant external values for the scope of a single query execution, or “parameters” like in “parameterised views”. Until we have that (or real parameterised views), I strongly advise against using SYS_CONTEXT for the use-case that my client was testing (cutting off individual UNION ALL subtrees from execution plans).

But the good news is: There’s nothing wrong with the ordinary use-case (forming predicates in WHERE clauses).

PS: If you like the idea of peeking at SYS_CONTEXT, be sure to also leave a comment here: https://community.oracle.com/ideas/15826

jOOQ Tuesdays: Daniel Dietrich Explains the Benefits of Object-Functional Programming


Welcome to the jOOQ Tuesdays series. In this series, we鈥檒l publish an article on the third Tuesday every other month where we interview someone we find exciting in our industry from a jOOQ perspective. This includes people who work with SQL, Java, Open Source, and a variety of other related topics.

danieldietrich

I’m very excited to feature today Daniel Dietrich whose popular library J螞V螞SL螞NG is picking up a lot of momentum among functional programming afictionados working with Java.

Daniel, you created J螞V螞SL螞NG – Object-Functional Programming in Java, a library that is becoming more and more popular among functional programmers. Why is Javaslang so popular?

Thank you Lukas for giving me the opportunity to share my thoughts.

I think that many users were disappointed about Java 8 in the whole, especially those who are already familiar with more advanced languages. The Java language architects did an awesome job. Java 8 brought groundbreaking new features like Lambdas, the new Stream API and CompletableFuture. But the new abstractions were only poorly integrated into the language from an API perspective.

There is already an increasing amount of write-ups about the disadvantages of Java 8, starting with the drawbacks of the Optional type. We read that we have to take care when using parallel Streams. These are self-made problems that keep us busy, stealing our expensive time. Javaslang provides us with alternatives.

There is no reason to reinvent the wheel. My vision is to bring as much as possible of the Scala goodness to Java. In fact Scala emerged from Java in the form of the Pizza language. Back in 2001 it had features like generics, function pointers (aka lambdas), case classes (aka value types) and pattern matching. In 2004 Java got generics, in 2014 came lambdas, and hopefully Java 10 will include value types. Scala left Java far behind. It used the last 15 year to evolve.

Object-functional programming is nothing new. It is the best of both worlds, object-oriented programming and functional programming. Scala is one of the better choices to do it on the JVM. Java’s Lambdas are an enabling feature. They allowed us to create a Javaslang API that is similar to Scala.

Java developers who get their hands on Javaslang often react in a way that I call the nice-effect: “Wow that’s nice, it feels like Scala”.

You have published a guest post on the jOOQ blog about Javaslang more than one year ago. Since then, Javaslang has moved forward quite a bit and you鈥檝e recently published the roadmap for version 3.0. What have you done since then and where are you going?

Yes, that is true, it has changed a lot since then. We released Javaslang 1.2.2 two weeks before the first jOOQ guest post went online. Beside enriched functions that release offered popular Scala features like Option for null-safety, Try for performing computations headache-free in the presence of exceptions and a fluent pattern matching DSL. Also notably we shipped two common persistent collections, an eagerly evaluated linked List and the lazy form of it, also called Stream.

Roughly one year later we released Javaslang 2.0.0. We hardened the existing features and most notably included Future and Promise for concurrent programming and a full-fledged, Scala-like persistent collection library. Beside that, we replaced the pattern matching DSL with a more powerful pattern matching API that allows us to recursively match arbitrary object trees.

I spent a significant amount of time and energy abstracting on the type level over the mentioned features, as far as this is possible in Java. For Java developers it is not important to call things monads, sum-types or products. For example we do not need to know group theory in order to calculate 1 + 1. My duty as library developer is to make it as simple as possible for users of Javaslang to reach their goals. The need to learn new APIs and DSLs should be reduced to the minimum. This is the main reason for aligning Javaslang to Scala.

Our efforts for the next release concentrate on adding more syntactic sugar and missing persistent collections beyond those of Scala. It will be sufficient to add one import to reach 90% of Javaslang’s API. There will be new persistent collections BitSet, several MultiMaps and a PriorityQueue. We are improving the performance of our collections, most notably our persistent Vector. It will be faster than Java’s Stream for some operations and have a smaller memory footprint than Java’s ArrayList for primitive elements.

Beyond library features we pay special attention on three things: backward compatibility, controlled growth and integration aspects. Web is important. Our Jackson module ensures that all Javaslang types can be sent over the wire as serialized JSON. The next release will include a GWT module, first tests already run Javaslang in the browser. However, the Javaslang core will stay thin. It will not depend on any other libraries than the JDK.

Towards the next major release 3.0.0 I’m starting to adjust the roadmap I sketched in a previous blog post. I’ve learned that it is most important to our users that they can rely on backward compatibility. Major releases should not appear often, following the 2.x line is a better strategy. We will start to deprecate a few APIs that will be removed in a future major release. Also I keep an eye on some interesting developments that will influence the next major release. For example a new major Scala release is in the works and there are new interesting Java features that will appear in Java 10.

Looking at the current issues I don’t have to be an oracle to foresee that the next minor release 2.1.0 will take some more time. I understand that users want to start using the new Javaslang features but we need the time and the flexibility to get things right. Therefore we target a first beta release of 2.1.0 in Q4 2016.

In the meantime, there is a variety of functional(-ish) libraries for Java 8, like our own jOO位, StreamEx, Cyclops, or the much older FunctionalJ位v位. How do all these libraries compare and how is yours different?

This question goes a little bit in the philosophical direction, maybe it is also political. These are my subjective thoughts, please treat them as such.

Humans have the ability to abstract over things. They express themselves in various ways, e.g. with painting and music. These areas split into different fields. For example in literature things are expressed in manifold ways like rhythmic prose and poetry. Furthermore different styles can be applied within these fields, like the iambic trimeter in poetry. The styles across different areas are often embossed by outer circumstances, bound to time, like an epoch.

In the area of mathematics there are also several fields, like algebra and mathematical analysis. Both have a notion of functions. Which field should I take when I want to express myself in a functional style?

Personally, I’m not able to afford the time to write non-trivial applications in each of the mentioned libraries. But I took a look at the source code and followed discussions. I see that nearly all libraries are embossed by the outer circumstance that lambdas finally made it to all curly-braces languages, especially to Java in our case. Library designers are keen to modernize their APIs in order to keep pace. But library designers are also interested in staying independent of 3rd party libraries for reasons like stability and progression.

The field of jOOQ is SQL in Java, the field of Cyclops is asynchronous systems. Both libraries are similar in the way that they adapted the new Java Lambda feature. I already mentioned that the new Java features are only poorly integrated into the language. This is the reason why we see a variety of new libraries that try to close this gap.

jOOQ needs jOO位 in order to stay independent. On the technical level StreamEx is similar to jOO位 in the way that both sit on top of Java’s Stream. They augment it with additional functionality that can be accessed using a fluent API. The biggest difference between them is that StreamEx supports parallel computations while jOO位 concentrates on sequential computations only. Looking at the SQL-ish method names it shows that jOO位 is tailored to be used with jOOQ.

Cyclops states to be the answer to the cambrian explosion of functional(-ish) libraries. It offers a facade that is backed by one of several integration modules. From the developer perspective I see this with skepticism. The one-size-fits-all approach did not work well for me in the past because it does not cover all features of the backing libraries. An abstraction layer adds another source of errors, which is unnecessary.

Many names of Cyclops look unfamiliar to me, maybe because of the huge amount of types. Looking at the API, the library seems to be a black hole, a cambrian implosion of reactive and functional features. John McClean did a great job abstracting over all the different libraries and providing a common API but I prefer to use a library directly.

FunctionalJ位v位 is different. It existed long before the other libraries and has the noble goal of purely functional programming: If it does compile, it is correct. FunctionalJ位v位 was originally driven by people well known from the Scala community, more specifically from the Scalaz community. Scalaz is highly influenced by Haskell, a purely functional language.

Haskell and Scala are much more expressive than Java. Porting the algebraic abstractions from Scalaz to Java turned out to be awkward. Java’s type system isn’t powerful enough, it does not allow us to reach that goal in a practical way. The committers seem to be disillusioned to me. Some state that Java is not the right language for functional programming.

Javaslang is a fresh take on porting Scala functionality to Java. At its core it is not as highly influenced by Scalaz and Haskell as FunctionalJ位v位 is. However, for purely functional abstractions it offers an algebra module that depends on the core. The relation algebra/core can be compared to Scalaz/Scala.

Javaslang is similar to StreamEx in the way that it is not bound to a specific domain, in contrast to jOO位 and Cyclops. It is different from StreamEx in the sense that it does not build on top of Java’s Stream. I understand Javaslang as language addition that integrates well with existing Java features.

You have never spoken at conferences, you let other people do that for you. What鈥檚 your secret?馃檪

In fact I never attended a conference at all. My secret is to delegate the real work to others.

Joking aside, I feel more comfortable spending my time on the Javaslang source code than preparing conferences and travelling. Currently I am working on Javaslang beside my job but I’m still looking for opportunities to do it full-time.

It is awesome to see other people jumping on the Javaslang train. We receive help from all over the world. Beside IntelliJ and YourKit we recently got TouK as new sponsor and produced Javaslang stickers that are handed out at conferences.

Because of the increasing popularity of Javaslang there is also an increasing amount of questions and pull requests. Beside the conception and development I concentrate on code-reviews, discussions and managing the project.

Where do you see Java鈥檚 future with projects like Valhalla?

Java stands for stability and safety. New language features are moderately added, like salt to a soup. This is what we can expect from a future Java.

In his recent mission statement Brian Goetz gives us a great overview about the goals of Project Valhalla. From the developer point of view I really love to see that the Java language architects attach great importance to improve the expressiveness of Java. Value types for example will reduce a lot of redundant code and ceremony we are currently confronted with. It is also nice to see that value types will be immutable.

Another feature I’m really looking forward to is the extension of generics. It will allow us to remove several specializations that exist only for primitive types and void. Popular functional interfaces like Predicate, Consumer, Supplier and Runnable will be equivalent to Function. In Javaslang we currently provide additional API for performing side-effects. Having extended generics that API can be reduced to the general case, like it should have been from the beginning.

There are two more features I’m really interested in: local variable type inference, that will come to Java, and reified generics, that might come. Reified generics are needed when we want to get the type of a generic parameter at runtime. We already have type inference for lambdas. Extending it to local variables will increase conciseness and readability of method and lambda bodies while preserving type-safety. I think it is a good idea that we will still have to specify the return type of methods. It is a clear documentation of the API of an application.

I’m deeply impressed how Java and the JVM evolve over time without breaking backward compatibility. It is a safe platform we can rely on. The gap between Java and other, more modern languages is getting smaller but Java is still behind. Some popular features might never come and most probably outdated API will not get a complete refresh or a replacement. This is a field where libraries such as Javaslang can help.

Does Your Database Really Use Your Index?


Adding the right index to speed up your queries is essential. But after a while, as your system grows, you may find yourself with tons of indexes, which all slow down writing to the database – as with each write to the table, the index needs to be updated as well in the same transaction.

Perhaps, 5 years later, your database (and your queries) have evolved in a way for some indexes to no longer be needed. For instance, there are some obvious cases when two indexes are redundant:

-- Original design
CREATE INDEX ON customer (first_name);

-- 5 years later
CREATE INDEX ON customer (first_name, last_name);

But in many other cases, things aren’t so obvious. For instance…

  • you may have added an index on a foreign key, but as the table grew, your database started running more hash joins rather than nested loop joins, in case of which indexes are not used.
  • Or, you’ve just entirely stopped querying for first names / last names.
  • Or you’ve started using a predicate that is way more selective than actual names.
  • Or your customers are suddenly all called Smith.
Everyone is called Smith - Tough for Indexing

Everyone is called Smith – tough luck for indexing!

If your indexes are no longer used, you can (and should) delete them.

But how to find unused indexes

If you’re using Oracle database and you have access to the production system, there’s actually a very nice way to query the diagnostics tables in order to know whether there is any query in the cursor cache that is currently using your index. Just run:

SELECT sql_fulltext
FROM v$sql
WHERE sql_id IN (
  SELECT sql_id
  FROM v$sql_plan
  WHERE (object_owner, object_name)
     = (('OWNER', 'IDX_CUSTOMER_FIRST_NAME'))
)
ORDER BY sql_text;

What does this query do? It runs through all the SQL statements in the cursor cache (v$sql) and for each one of them, checks if there is any execution plan element in the cursor cache (v$sql_plan) which accesses the index. Done.

Of course, if this doesn’t return any results, it doesn’t mean that no one uses your index. There might still be a very rare query that happens only once a year, which gets purged from the cursor cache, otherwise.

But if you run this as a job over a while, you can already conclude that if this query doesn’t return any rows, your index is probably no longer needed.

Can I discover unneeded indexes?

Sure! Run a similar query that lists all the indexes that are not referenced from the v$sql_plan table:

SELECT owner, index_name
FROM all_indexes
WHERE owner = 'OWNER'
AND (owner, index_name) NOT IN (
  SELECT object_owner, object_name
  FROM v$sql_plan
  WHERE object_owner IS NOT NULL
  AND object_name IS NOT NULL
)
ORDER BY 1, 2

Again, this doesn’t say that your indexes will never be used, but they haven’t been used recently.

Now, I won’t actually show you the query that would use the above statement, run across its result in a PL/SQL loop and drop all the indexes using EXECUTE IMMEDIATE, because you might just actually do that to try it in production. But just in case you do want to try, here’s a hint馃檪

BEGIN
  FOR i IN (/* above query here */) LOOP
    EXECUTE IMMEDIATE 
     'DR0P INDEX "' || i.owner || '"."' || i.index_name || '"';
  END LOOP;
END;
/

But as I said. Don’t actually do this!

Update: New Oracle 12cR2 feature: DBA_INDEX_USAGE

As you can see in the comments section, Oracle 12c now has this exciting new DBA_INDEX_USAGE feature, which you can see here:

Thanks, Dan McGhan for mentioning this!

Why You Should Design Your Database to Optimise for Statistics


In my SQL Masterclass, I frequently remind participants of the fact how important statistics are for a modern cost based optimiser. For instance, if you consider the fact that in an average E-Banking system’s bookings table, you will probably have a transaction amount histogram like the following:

histogram

In other words, most of your transactions are probably small things like small payments in a restaurant (around USD -50.00 for example), or on the other side, you’ll have the restaurant, which receives the payment (e.g. around USD +50.00 for example). Only few bookings have an amount of several thousands, or if you’re an insanely good SQL trainer like me, several USD 100k, autographs included.

That’s a classic bell curve, although nothing really indicates that you should have a bell curve rather than any other distribution of your data across your database. Let’s not delve into the different types of distributions but just accept the fact that there are some value ranges for amount that appear much more frequently than others, and that it’s not always easy to predict this distribution.

So, what does this mean for a database?

Concretely, let’s look at the following two predicates:

-- Predicate 1
WHERE amount BETWEEN -2000 AND 2000

-- Predicate 2
WHERE amount BETWEEN 5000 AND 9000

Both predicates query a range of amount whose size is 4000. The two ranges are of equal size. But do both queries return the same number of rows? In other words, are both predicates equally selective? Or, again, do both predicates profit from indexing?

The answer is: Not at all!

Here’s what the first predicate does:

histogram-1

It’s not selective at all. We might as well scan the entire table to select almost 95% of our rows!

Whereas the second predicate …

histogram-2

Now we’re talking! Selecting only 2% – 3% of our data from the table. We’ll definitely want to use an index for this!

The importance of statistics

I’m frequently surprised how many developers are shielded off from productive environments in a way that they don’t know about the importance of database statistics. Or maybe they know (that they exist), but they’re not sure about how important they are.

This is extremely important, folks. Databases make a lot of decisions based on statistics, because cost based optimisers (most databases have one) will use a tremendous amount of information about your data (including statistics) in order to decide what algorithm is the best option to run your SQL query. There’s absolutely no way you could possibly figure out and manually write a better algorithm with a non-4GL language because your assumptions are always completely wrong when you go to production. That’s why SQL is so strong. Stuff still works because the optimiser adapts execution plans as your data characteristics change.

Let me repeat this:

It’s crazy to think that there are still some folks out there who think you (and your ORM) can outsmart a database optimiser when running a complex SQL query. You cannot. Just write the query in SQL already and use the power of your database!

Obviously, you (both you the developer and you the DBA) must always ensure that your statistics are up to date.

OK, they’re up to date, why is my query still slow?

Check out the above histogram again. Real world histograms may work slightly (or entirely) differently, but the key here is that in our example, we only have 22 buckets of equal range size (e.g. amount BETWEEN 1000 AND 2000). Each bucket comprises a range [A, B[ where B - A = 1000. It’s just an example, but you get the idea. This is already great help in assessing that most of our data falls into the middle 4 buckets amount BETWEEN -2000 AND 2000.

The way statistics are calculated is simple to understand. We can’t have an infinite number of buckets – we must keep the number of buckets small, because when we write a predicate like amount BETWEEN -2000 AND 2000, the optimiser will need to get a cardinality estimate from 4 buckets (in our example). Imagine the optimiser having to run through thousands of buckets to sum their sizes – it would take a long time for the estimate alone, and the optimiser has to evaluate a large number of possible execution plans to find the best one – we simply don’t have that time.

So, statistics are high level approximations of our data. Fine, some error is probably OK, we’re still fast enough, right?

Usually yes, but let’s look at a very peculiar edge case, where you might not have thought of statistics at all. Let’s look at a hypothetical YEAR_MONTH data type:

The hypothetical YEAR_MONTH data type

In the following part of the post, I’m using the Oracle database, but everything I’m saying is true for other databases as well.

When we’re in the domain of date time arithmetic, we humans often like to have dates or date parts in a human readable version. Date time arithmetic is hard enough on its own. Here’s some fun reading to illustrate what I mean.

Let’s say we’re in the business of doing reporting. And in reporting (and accounting), we often like to have stuff aggregated by a YEAR_MONTH, i.e. something of the form YYYY-MM.

There’s no native data type for this particular type in most databases. But we can encode it in different ways. E.g. by using a string, number, or date. And for each type, there are different encodings, too. Let’s look at four strategies:

CREATE TABLE dates (
  d_date              DATE        NOT NULL,
  d_string            VARCHAR2(6) NOT NULL,
  d_number            NUMBER(6)   NOT NULL,
  d_number_continuous NUMBER(6)   NOT NULL
);

The idea here is that the d_date column will contain the YEAR_MONTH at the first day. For instance, we’ll store 2016-12-01 when we really mean 201612. That looks totally, OK, right? If you want to enforce data integrity (and you should), you could optionally add a CHECK constraint as such:

ALTER TABLE dates
ADD CONSTRAINT d_date_chk 
  CHECK (d_date = TRUNC(d_date, 'MM'));

The d_string and d_number columns will contain string and number representations of values like '201612', or 201612 respectively. Again, we could add a CHECK constraint, but that’s not the point.

And the magic d_number_continuous column will encode the month in terms of “number of months since Jan 1970”. Just using some random epoch, by convention, let’s use the unix epoch and start counting from 1, because … well, months (what kind of troll designed JavaScript months to be zero-based is beyond me). So, 201612 is encoded as

564 = (2016 - 1970) * 12 + 12

Yes, 564 is not human readable, but we’ll see afterwards why this is so magic for our query.

Create the table

Let’s see how these encodings fare and compare against each other. First off, let’s add some actual data:

INSERT INTO dates
SELECT 
  trunc(d, 'MM'),
  to_char(d, 'YYYYMM'),
  to_number(to_char(d, 'YYYYMM')),
  (to_number(to_char(d, 'YYYY')) - 1970) * 12 +
  (to_number(to_char(d, 'MM')))
FROM (
  SELECT DATE '2010-01-01' + level AS d 
  FROM dual
  CONNECT BY level <= 100000
) t
ORDER BY dbms_random.random -- Btw: Don鈥檛 do this
;

The above query generates 100000 consecutive dates, truncates the YYYY-MM-DD format to YYYYMM only, orders the values randomly to have some more entropy in the test setup (don’t do this in real queries, btw!) and then stores each YEAR_MONTH value in each of the 4 encodings. Easy.

Now, let’s add indexes on each column

CREATE INDEX idx_date              ON dates(d_date);
CREATE INDEX idx_string            ON dates(d_string);
CREATE INDEX idx_number            ON dates(d_number);
CREATE INDEX idx_number_continuous ON dates(d_number_continuous);

And, of course, calculate statistics!

BEGIN
  dbms_stats.gather_table_stats (user, 'DATES');
END;
/

Showtime

Here, we have 4 queries that produce the exact same result, each with a predicate on one of the 4 encodings of our YEAR_MONTH values, each for the range of December 2016 – January 2017, i.e. we should be getting 62 each time:

SELECT count(*) FROM dates 
WHERE d_date BETWEEN DATE '2016-12-01' AND DATE '2017-01-01';

SELECT count(*) FROM dates
WHERE d_string BETWEEN '201612' AND '201701';

SELECT count(*) FROM dates 
WHERE d_number BETWEEN 201612 AND 201701;

SELECT count(*) FROM dates 
WHERE d_number_continuous BETWEEN 564 AND 565;

The result is:

  COUNT(*)
----------
        62

  COUNT(*)
----------
        62

  COUNT(*)
----------
        62

  COUNT(*)
----------
        62

Now, what happens if we SELECT *? Let’s do it, and instead of executing the query, let’s focus on the execution plan:

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_date BETWEEN DATE '2016-12-01' AND DATE '2017-01-01';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates
WHERE d_string BETWEEN '201612' AND '201701';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number BETWEEN 201612 AND 201701;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number_continuous BETWEEN 564 AND 565;

SELECT * FROM TABLE(dbms_xplan.display);

The above queries will yield:

--------------------------------------------------------
| Id  | Operation                   | Name     | Rows  |
--------------------------------------------------------
|   0 | SELECT STATEMENT            |          |    92 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES    |    92 |
|*  2 |   INDEX RANGE SCAN          | IDX_DATE |    92 |
--------------------------------------------------------

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   387 |
|*  1 |  TABLE ACCESS FULL| DATES |   387 |
-------------------------------------------

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   387 |
|*  1 |  TABLE ACCESS FULL| DATES |   387 |
-------------------------------------------

---------------------------------------------------------------------
| Id  | Operation                   | Name                  | Rows  |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                       |    91 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES                 |    91 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER_CONTINUOUS |    91 |
---------------------------------------------------------------------

Interesting! All of our cardinality estimates (the rows columns) are off, but the amount of rows they’re off is excessive for the middle two encodings (d_string and d_number). This leads to the database erroneously thinking that a full table scan is the best option when we query for those columns.

On the other hand, the estimates on the d_date and d_number_continous columns are still “good enough” for the query to be reasonably run on the index.

Why is that the case?

In all cases, our statistics have been calculated, but not in all cases have they been accurate and precise enough. Let’s look at a subtle difference between the different encodings, and keep in mind, we’re querying the range 201612 to 201701, where we have a full year switch in between:

  • d_date “wastes” quite a few values in between each valid date, but that’s OK because the amount of values that are left out of the type’s domain is equally distributed across our data set. We always have between 27 and 30 days that never appear in our data. In other words, while we have gaps between our values, the uniform distribution will distribute values “evenly” in each statistics bucket, preventing side-effects in this case.
  • d_string on the other hand, has a lot of possible values between '201612' and '201701'. There’s no way the optimiser can know from a sample that there will never be a value like '2016aa' in between that might have just slipped by the statistics. So, the range that spans over the end of year is an entirely different range in this data type than, for instance, d_string BETWEEN '201701' AND '201702'
  • d_number is similar to d_string and suffers from the same problem. There are 89 possible values between 201612 and 201701, which don’t really exist, but the optimiser cannot know this.
  • d_number_continuous appears to be an encoding just as good as d_date with respect to statistics as there are no possible gaps between consecutive values, and our statistics are thus most accurate. Bonus: Prove by finding an edge case that this is a better encoding than d_date (or prove that it is not)

Edge case

Do note that the above query exposes an edge case where we query for a range that spans over the end of year. We could compare these queries here where we query between May and June:

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_date BETWEEN DATE '2016-05-01' AND DATE '2016-06-01';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates
WHERE d_string BETWEEN '201605' AND '201606';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number BETWEEN 201605 AND 201606;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number_continuous BETWEEN 557 AND 558;

SELECT * FROM TABLE(dbms_xplan.display);

And we would now get index range scans for all predicates. Yay! Hmm… Yay? The d_string and d_number encodings would even get better cardinality estimates in this case, which is the opposite side of the coin, i.e. the other extreme:

--------------------------------------------------------
| Id  | Operation                   | Name     | Rows  |
--------------------------------------------------------
|   0 | SELECT STATEMENT            |          |    92 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES    |    92 |
|*  2 |   INDEX RANGE SCAN          | IDX_DATE |    92 |
--------------------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    65 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    65 |
|*  2 |   INDEX RANGE SCAN          | IDX_STRING |    65 |
----------------------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    65 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    65 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER |    65 |
----------------------------------------------------------

---------------------------------------------------------------------
| Id  | Operation                   | Name                  | Rows  |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                       |    91 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES                 |    91 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER_CONTINUOUS |    91 |
---------------------------------------------------------------------

However! Don’t be fooled into thinking that the accidentally better cardinality estimate is a good thing. It just says that the estimate’s error is fluctuating a lot more for d_string and d_number. Let’s query May – August, for instance, querying the double amount of data (123 rows):

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_date BETWEEN DATE '2016-05-01' AND DATE '2016-08-01';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates
WHERE d_string BETWEEN '201605' AND '201608';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number BETWEEN 201605 AND 201608;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number_continuous BETWEEN 557 AND 560;

SELECT * FROM TABLE(dbms_xplan.display);

Here’s what we’re getting:

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   153 |
|*  1 |  TABLE ACCESS FULL| DATES |   153 |
-------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    72 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    72 |
|*  2 |   INDEX RANGE SCAN          | IDX_STRING |    72 |
----------------------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    72 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    72 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER |    72 |
----------------------------------------------------------

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   152 |
|*  1 |  TABLE ACCESS FULL| DATES |   152 |
-------------------------------------------

Now we’re getting the inverse situation. Our cardinality estimates are way too low for the d_string and d_number column queries, which results in an index range scan that was estimated to be rather cheap, when it is in fact much more expensive. The full table scan is what we should be doing in this case, as it will ultimately perform faster for this query.

For this example, we can conclude that sometimes we’re lucky (May – June), but very often we’re very unlucky (December – January, or May – August) and the optimiser will choose the wrong plan, despite our statistics being up to date.

Conclusion

These things matter! When we design our data in a way that makes it hard for a database to gather accurate statistics (without domain knowledge), we don’t exactly help the optimiser as much as we can. Some domains are very hard to encode in a way such that statistics become accurate. Other domains (especially temporal ones) are extremely easy to encode “properly”. As a general rule of thumb, do remember that:

  • Actual DATE or TIMESTAMP types are great
  • Discrete numeric encodings that start with some epoch are good as well (e.g. the unix timestamp), although date time arithmetic may become a bit harder
  • String representations are bad, because they are never uniformly distributed even if they should be. This is a very high price to pay for a bit of readability

The conclusion is:

Sometimes you should help the machine do its job a bit more than you should help the human reading the data. The latter can still be done in the UI

Intrigued? Want to learn more? Book our SQL Masterclass!

When to Choose SQL and When to Choose NoSQL


Some people make architecture decisions purely based on the loudest consultant:

For most others, however, decisions are not as simple as this. For instance: When should we start evaluating NoSQL storage systems as an alternative to RDBMS?

RDBMS as the hammer for all nails

This question obviously assumes that you’re starting out with an RDBMS, which is classically the database system that solves pretty much any problem decently enough not to be replaced easily. What does this mean? Simply put:

  • RDBMS have been around forever, so they have a huge advantage compared to “newcomers” in the market, who don’t have all the excellent tooling, community, support, maturity yet
  • E.F. Codd’s work may have been the single biggest influence on our whole industry. There has hardly been anything as revolutionary as the relational model ever since. It’s hard for an alternative database to be equally universal, i.e. they’re mostly solving niche problems

Having said so, sometimes you do have a niche problem. For instance a graph database problem. In fact, a graph is nothing fundamentally different from what you can represent in the relational model. It is easy to model a graph with a many-to-many relationship table.

The same is true for XML/JSON in the database (don’t forget, JSON is just XML with less syntax and less features, so it’s less awesome than XML). Sometimes you need to actually store document structures (hierarchical data) in their hierarchical form in your database, instead of normalising them. Sure, you could normalise a document, but that’s a lot of work and not necessarily beneficial. This was actually the question on Twitter by Gareth, that triggered the subsequent discussion:

Most modern RDBMS offer means to store (and much more importantly, query) XML/JSON data structures in simple forms. This includes PostgreSQL, Oracle, DB2, SQL Server and others.

So, when do we decide to switch?

As developers, we may be inclined to switch quite quickly. E.g. when working with graphs, we’d love to use Neo4j which implements the awesome Cypher query language. When working with JSON, we’d love to use something like Couchbase which implements the interesting N1QL query language. Both of these query languages are heavily inspired by SQL, which were wise vendor choices in my opinion (as opposed to MongoDB’s weird JSON-based query language), because in the end, the SQL language is still the most powerful and popular 4GL that was ever created.

But as developers, we shouldn’t make that decision lightly. Yes, at first, these specialised databases seem like a better fit. But the additional cost for operations teams to maintain, monitor, patch, tune production systems must not be underestimated. This is even true within the RDBMS ecosystem. A prominent recent example was Uber’s switch (back!) from PostgreSQL to MySQL:

Why Uber Engineering Switched from Postgres to MySQL

Do note that they switched the other way round before, only to regret things. Truth is, there are tons of reasons why your operations teams prefer to always use the same database, even if that’s quite expensive in terms of licensing. In many cases, however, it would be even more expensive to:

  • Engage in additional license and/or support contracts with new database vendors
  • Find skilled DBA for the new technology (can be very hard with niche databases)
  • Maintain two data silos and possibly sustain the cost of keeping them in sync

Ultimately, there’s a threshold:

In terms of using JSON in the database, it’s simple:

  • Occasional JSON storage: Stick with RDBMS.
  • Everything is JSON: Perhaps not RDBMS.

The same is true for graph problems. SQL can totally handle graphs and traverse them with recursive SQL. Here’s a funky statement that recursively calculates the subset sum problem:

subset-sum

Full article here: How to Find the Closest Subset Sum with SQL

If you only have a few tree/graph traversals to calculate (e.g. for a simple menu structure), don’t jump ship with RDBMS yet. If graphs are your business and that’s all you do, then probably RDBMS aren’t the right choice.

Conclusion

Whatever problem you’re solving. Do remember. Yes, all you have is a hammer, every problem starts looking like a nail. But don’t dismiss RDBMS as a silly hammer. It is a very mighty hammer and in 2016, it can do a lot of non-relational niche things decently well.

RDBMS is still the best default choice for all sorts of data problems. Only once you have gone above a certain threshold (or if you can foresee doing that), then you should look for alternatives. Because those alternatives will give you a much rougher time when you go outside of that niche (JSON, graphs, etc.) back to your “ordinary” relational business.

A Hidden jOOQ Gem: Foreach Loop Over ResultQuery


A recent question on Stack Overflow about jOOQ caught my attention. The question essentially asked:

Why do both of these loops work?

// With fetch()
for (MyTableRecord rec : DSL
    .using(configuration)
    .selectFrom(MY_TABLE)
    .orderBy(MY_TABLE.COLUMN)
    .fetch()) { // fetch() here

    doThingsWithRecord(rec);
}

// Without fetch()
for (MyTableRecord rec : DSL
    .using(configuration)
    .selectFrom(MY_TABLE)
    .orderBy(MY_TABLE.COLUMN)) { // No fetch() here

    doThingsWithRecord(rec);
}

And indeed, just like in PL/SQL, you can use any jOOQ ResultQuery as a Java 5 Iterable, because that’s what it is. An Iterable<R> where R extends Record.

The semantics is simple. When Iterable.iterator() is invoked, the query is executed and the Result.iterator() is returned. So, the result is materialised in the client memory just like if I called fetch(). Unsurprisingly, this is the implementation of AbstractResultQuery.iterator():

@Override
public final Iterator<R> iterator() {
    return fetch().iterator();
}

No magic. But it’s great that this works like PL/SQL:

FOR rec IN (SELECT * FROM my_table ORDER BY my_table.column)
LOOP
  doThingsWithRecord(rec);
END LOOP;

Note, unfortunately, there’s no easy way to manage resources through Iterable, i.e. there’s no AutoCloseableIterable returning an AutoCloseableIterator, which could be used in an auto-closing try-with-resources style loop. This is why the entire result set needs to be fetched at the beginning of the loop. For lazy fetching, you can still use ResultQuery.fetchLazy()

try (Cursor<MyTableRecord> cursor = DSL
    .using(configuration)
    .selectFrom(MY_TABLE)
    .orderBy(MY_TABLE.COLUMN)
    .fetchLazy()) {

    for (MyTableRecord rec : cursor)
        doThingsWithRecord(rec);
}

Happy coding!