Why You Should Design Your Database to Optimise for Statistics

In my SQL Masterclass, I frequently remind participants of the fact how important statistics are for a modern cost based optimiser. For instance, if you consider the fact that in an average E-Banking system’s bookings table, you will probably have a transaction amount histogram like the following: histogram In other words, most of your transactions are probably small things like small payments in a restaurant (around USD -50.00 for example), or on the other side, you’ll have the restaurant, which receives the payment (e.g. around USD +50.00 for example). Only few bookings have an amount of several thousands, or if you’re an insanely good SQL trainer like me, several USD 100k, autographs included. That’s a classic bell curve, although nothing really indicates that you should have a bell curve rather than any other distribution of your data across your database. Let’s not delve into the different types of distributions but just accept the fact that there are some value ranges for amount that appear much more frequently than others, and that it’s not always easy to predict this distribution.

So, what does this mean for a database?

Concretely, let’s look at the following two predicates:

-- Predicate 1
WHERE amount BETWEEN -2000 AND 2000

-- Predicate 2
WHERE amount BETWEEN 5000 AND 9000

Both predicates query a range of amount whose size is 4000. The two ranges are of equal size. But do both queries return the same number of rows? In other words, are both predicates equally selective? Or, again, do both predicates profit from indexing? The answer is: Not at all! Here’s what the first predicate does: histogram-1 It’s not selective at all. We might as well scan the entire table to select almost 95% of our rows! Whereas the second predicate … histogram-2 Now we’re talking! Selecting only 2% – 3% of our data from the table. We’ll definitely want to use an index for this!

The importance of statistics

I’m frequently surprised how many developers are shielded off from productive environments in a way that they don’t know about the importance of database statistics. Or maybe they know (that they exist), but they’re not sure about how important they are. This is extremely important, folks. Databases make a lot of decisions based on statistics, because cost based optimisers (most databases have one) will use a tremendous amount of information about your data (including statistics) in order to decide what algorithm is the best option to run your SQL query. There’s absolutely no way you could possibly figure out and manually write a better algorithm with a non-4GL language because your assumptions are always completely wrong when you go to production. That’s why SQL is so strong. Stuff still works because the optimiser adapts execution plans as your data characteristics change. Let me repeat this:
It’s crazy to think that there are still some folks out there who think you (and your ORM) can outsmart a database optimiser when running a complex SQL query. You cannot. Just write the query in SQL already and use the power of your database!
Obviously, you (both you the developer and you the DBA) must always ensure that your statistics are up to date.

OK, they’re up to date, why is my query still slow?

Check out the above histogram again. Real world histograms may work slightly (or entirely) differently, but the key here is that in our example, we only have 22 buckets of equal range size (e.g. amount BETWEEN 1000 AND 2000). Each bucket comprises a range [A, B[ where B - A = 1000. It’s just an example, but you get the idea. This is already great help in assessing that most of our data falls into the middle 4 buckets amount BETWEEN -2000 AND 2000. The way statistics are calculated is simple to understand. We can’t have an infinite number of buckets – we must keep the number of buckets small, because when we write a predicate like amount BETWEEN -2000 AND 2000, the optimiser will need to get a cardinality estimate from 4 buckets (in our example). Imagine the optimiser having to run through thousands of buckets to sum their sizes – it would take a long time for the estimate alone, and the optimiser has to evaluate a large number of possible execution plans to find the best one – we simply don’t have that time. So, statistics are high level approximations of our data. Fine, some error is probably OK, we’re still fast enough, right? Usually yes, but let’s look at a very peculiar edge case, where you might not have thought of statistics at all. Let’s look at a hypothetical YEAR_MONTH data type:

The hypothetical YEAR_MONTH data type

In the following part of the post, I’m using the Oracle database, but everything I’m saying is true for other databases as well. When we’re in the domain of date time arithmetic, we humans often like to have dates or date parts in a human readable version. Date time arithmetic is hard enough on its own. Here’s some fun reading to illustrate what I mean. Let’s say we’re in the business of doing reporting. And in reporting (and accounting), we often like to have stuff aggregated by a YEAR_MONTH, i.e. something of the form YYYY-MM. There’s no native data type for this particular type in most databases. But we can encode it in different ways. E.g. by using a string, number, or date. And for each type, there are different encodings, too. Let’s look at four strategies:

CREATE TABLE dates (
  d_date              DATE        NOT NULL,
  d_string            VARCHAR2(6) NOT NULL,
  d_number            NUMBER(6)   NOT NULL,
  d_number_continuous NUMBER(6)   NOT NULL
);

The idea here is that the d_date column will contain the YEAR_MONTH at the first day. For instance, we’ll store 2016-12-01 when we really mean 201612. That looks totally, OK, right? If you want to enforce data integrity (and you should), you could optionally add a CHECK constraint as such:

ALTER TABLE dates
ADD CONSTRAINT d_date_chk 
  CHECK (d_date = TRUNC(d_date, 'MM'));

The d_string and d_number columns will contain string and number representations of values like '201612', or 201612 respectively. Again, we could add a CHECK constraint, but that’s not the point. And the magic d_number_continuous column will encode the month in terms of “number of months since Jan 1970”. Just using some random epoch, by convention, let’s use the unix epoch and start counting from 1, because … well, months (what kind of troll designed JavaScript months to be zero-based is beyond me). So, 201612 is encoded as
564 = (2016 - 1970) * 12 + 12
Yes, 564 is not human readable, but we’ll see afterwards why this is so magic for our query.

Create the table

Let’s see how these encodings fare and compare against each other. First off, let’s add some actual data:

INSERT INTO dates
SELECT 
  trunc(d, 'MM'),
  to_char(d, 'YYYYMM'),
  to_number(to_char(d, 'YYYYMM')),
  (to_number(to_char(d, 'YYYY')) - 1970) * 12 +
  (to_number(to_char(d, 'MM')))
FROM (
  SELECT DATE '2010-01-01' + level AS d 
  FROM dual
  CONNECT BY level <= 100000
) t
ORDER BY dbms_random.random -- Btw: Don’t do this
;

The above query generates 100000 consecutive dates, truncates the YYYY-MM-DD format to YYYYMM only, orders the values randomly to have some more entropy in the test setup (don’t do this in real queries, btw!) and then stores each YEAR_MONTH value in each of the 4 encodings. Easy. Now, let’s add indexes on each column

CREATE INDEX idx_date              ON dates(d_date);
CREATE INDEX idx_string            ON dates(d_string);
CREATE INDEX idx_number            ON dates(d_number);
CREATE INDEX idx_number_continuous ON dates(d_number_continuous);

And, of course, calculate statistics!

BEGIN
  dbms_stats.gather_table_stats (user, 'DATES');
END;
/

Showtime

Here, we have 4 queries that produce the exact same result, each with a predicate on one of the 4 encodings of our YEAR_MONTH values, each for the range of December 2016 – January 2017, i.e. we should be getting 62 each time:

SELECT count(*) FROM dates 
WHERE d_date BETWEEN DATE '2016-12-01' AND DATE '2017-01-01';

SELECT count(*) FROM dates
WHERE d_string BETWEEN '201612' AND '201701';

SELECT count(*) FROM dates 
WHERE d_number BETWEEN 201612 AND 201701;

SELECT count(*) FROM dates 
WHERE d_number_continuous BETWEEN 564 AND 565;

The result is:
  COUNT(*)
----------
        62

  COUNT(*)
----------
        62

  COUNT(*)
----------
        62

  COUNT(*)
----------
        62
Now, what happens if we SELECT *? Let’s do it, and instead of executing the query, let’s focus on the execution plan:

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_date BETWEEN DATE '2016-12-01' AND DATE '2017-01-01';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates
WHERE d_string BETWEEN '201612' AND '201701';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number BETWEEN 201612 AND 201701;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number_continuous BETWEEN 564 AND 565;

SELECT * FROM TABLE(dbms_xplan.display);

The above queries will yield:
--------------------------------------------------------
| Id  | Operation                   | Name     | Rows  |
--------------------------------------------------------
|   0 | SELECT STATEMENT            |          |    92 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES    |    92 |
|*  2 |   INDEX RANGE SCAN          | IDX_DATE |    92 |
--------------------------------------------------------

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   387 |
|*  1 |  TABLE ACCESS FULL| DATES |   387 |
-------------------------------------------

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   387 |
|*  1 |  TABLE ACCESS FULL| DATES |   387 |
-------------------------------------------

---------------------------------------------------------------------
| Id  | Operation                   | Name                  | Rows  |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                       |    91 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES                 |    91 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER_CONTINUOUS |    91 |
---------------------------------------------------------------------
Interesting! All of our cardinality estimates (the rows columns) are off, but the amount of rows they’re off is excessive for the middle two encodings (d_string and d_number). This leads to the database erroneously thinking that a full table scan is the best option when we query for those columns. On the other hand, the estimates on the d_date and d_number_continous columns are still “good enough” for the query to be reasonably run on the index.

Why is that the case?

In all cases, our statistics have been calculated, but not in all cases have they been accurate and precise enough. Let’s look at a subtle difference between the different encodings, and keep in mind, we’re querying the range 201612 to 201701, where we have a full year switch in between:
  • d_date “wastes” quite a few values in between each valid date, but that’s OK because the amount of values that are left out of the type’s domain is equally distributed across our data set. We always have between 27 and 30 days that never appear in our data. In other words, while we have gaps between our values, the uniform distribution will distribute values “evenly” in each statistics bucket, preventing side-effects in this case.
  • d_string on the other hand, has a lot of possible values between '201612' and '201701'. There’s no way the optimiser can know from a sample that there will never be a value like '2016aa' in between that might have just slipped by the statistics. So, the range that spans over the end of year is an entirely different range in this data type than, for instance, d_string BETWEEN '201701' AND '201702'
  • d_number is similar to d_string and suffers from the same problem. There are 89 possible values between 201612 and 201701, which don’t really exist, but the optimiser cannot know this.
  • d_number_continuous appears to be an encoding just as good as d_date with respect to statistics as there are no possible gaps between consecutive values, and our statistics are thus most accurate. Bonus: Prove by finding an edge case that this is a better encoding than d_date (or prove that it is not)

Edge case

Do note that the above query exposes an edge case where we query for a range that spans over the end of year. We could compare these queries here where we query between May and June:

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_date BETWEEN DATE '2016-05-01' AND DATE '2016-06-01';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates
WHERE d_string BETWEEN '201605' AND '201606';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number BETWEEN 201605 AND 201606;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number_continuous BETWEEN 557 AND 558;

SELECT * FROM TABLE(dbms_xplan.display);

And we would now get index range scans for all predicates. Yay! Hmm… Yay? The d_string and d_number encodings would even get better cardinality estimates in this case, which is the opposite side of the coin, i.e. the other extreme:
--------------------------------------------------------
| Id  | Operation                   | Name     | Rows  |
--------------------------------------------------------
|   0 | SELECT STATEMENT            |          |    92 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES    |    92 |
|*  2 |   INDEX RANGE SCAN          | IDX_DATE |    92 |
--------------------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    65 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    65 |
|*  2 |   INDEX RANGE SCAN          | IDX_STRING |    65 |
----------------------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    65 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    65 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER |    65 |
----------------------------------------------------------

---------------------------------------------------------------------
| Id  | Operation                   | Name                  | Rows  |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                       |    91 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES                 |    91 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER_CONTINUOUS |    91 |
---------------------------------------------------------------------
However! Don’t be fooled into thinking that the accidentally better cardinality estimate is a good thing. It just says that the estimate’s error is fluctuating a lot more for d_string and d_number. Let’s query May – August, for instance, querying the double amount of data (123 rows):

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_date BETWEEN DATE '2016-05-01' AND DATE '2016-08-01';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates
WHERE d_string BETWEEN '201605' AND '201608';

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number BETWEEN 201605 AND 201608;

SELECT * FROM TABLE(dbms_xplan.display);

EXPLAIN PLAN FOR 
SELECT * FROM dates 
WHERE d_number_continuous BETWEEN 557 AND 560;

SELECT * FROM TABLE(dbms_xplan.display);

Here’s what we’re getting:
-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   153 |
|*  1 |  TABLE ACCESS FULL| DATES |   153 |
-------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    72 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    72 |
|*  2 |   INDEX RANGE SCAN          | IDX_STRING |    72 |
----------------------------------------------------------

----------------------------------------------------------
| Id  | Operation                   | Name       | Rows  |
----------------------------------------------------------
|   0 | SELECT STATEMENT            |            |    72 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES      |    72 |
|*  2 |   INDEX RANGE SCAN          | IDX_NUMBER |    72 |
----------------------------------------------------------

-------------------------------------------
| Id  | Operation         | Name  | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT  |       |   152 |
|*  1 |  TABLE ACCESS FULL| DATES |   152 |
-------------------------------------------
Now we’re getting the inverse situation. Our cardinality estimates are way too low for the d_string and d_number column queries, which results in an index range scan that was estimated to be rather cheap, when it is in fact much more expensive. The full table scan is what we should be doing in this case, as it will ultimately perform faster for this query. For this example, we can conclude that sometimes we’re lucky (May – June), but very often we’re very unlucky (December – January, or May – August) and the optimiser will choose the wrong plan, despite our statistics being up to date.

Conclusion

These things matter! When we design our data in a way that makes it hard for a database to gather accurate statistics (without domain knowledge), we don’t exactly help the optimiser as much as we can. Some domains are very hard to encode in a way such that statistics become accurate. Other domains (especially temporal ones) are extremely easy to encode “properly”. As a general rule of thumb, do remember that:
  • Actual DATE or TIMESTAMP types are great
  • Discrete numeric encodings that start with some epoch are good as well (e.g. the unix timestamp), although date time arithmetic may become a bit harder
  • String representations are bad, because they are never uniformly distributed even if they should be. This is a very high price to pay for a bit of readability
The conclusion is:
Sometimes you should help the machine do its job a bit more than you should help the human reading the data. The latter can still be done in the UI
Intrigued? Want to learn more? Book our SQL Masterclass!

3 thoughts on “Why You Should Design Your Database to Optimise for Statistics

  1. Thanks for this thorough review. While I hadn’t considered the d_string and d_number options, it was interesting to see how an optimizer could potentially be fooled into making the wrong decision (as well as be lucky in making the right decision). d_number_continuous gives me something to consider.

  2. Great article!
    „ It’s crazy to think that there are still some folks out there who think you (and your ORM) can outsmart a database optimiser when running a complex SQL query.“
    That‘s my intuition as well. Still, I often hear the advice to avoid joins by executing multiple queries, effectively doing the join yourself. E.g. to fetch all posts of a user given the username, instead of doing it directly in a single query, you would first lookup the user‘s ID and then use it to filter the posts.
    So, the query is quite simple and has an „obviously“ preferred execution plan, which you enforce by doing it yourself. What are your thoughts on this?

Leave a Reply