A very interesting feature of the SQL Server and PostgreSQL databases (and some others, including SQLite) is the
partial index (
sometimes also called “filtered index”). That’s an index that contains only “parts” of the table data. For instance, we can write the following index in SQL Server and PostgreSQL:
CREATE INDEX i ON message WHERE deleted = 1;
Let’s imagine you have a house keeping job that periodically removes deleted messages. Now, let’s assume you have discovered, that only 0.1% of all messages are really deleted, so an index on the
DELETED
column is very selective
if you’re looking for deleted messages.
On the other hand, it’s not selective
at all if you’re looking for non-deleted messages, as such a query would return 99.9% of all messages, in case of which a full table scan is more efficient.
So, since the index is never useful for non-deleted messages, why index those messages at all? If we can avoid indexing non-deleted messages, then we can:
- Save a lot of disk space, as the index will be much smaller
- Save a lot of time inserting into the messages table, since we don’t have to update the index all the time
- Save a lot of time scanning the index, since it will contain a lot less blocks
Unfortunately, Oracle doesn’t support this feature
… but “luckily”, Oracle has another controversial “feature”. In Oracle,
all indexes are partial indexes, because they don’t contain
NULL
values. This is probably due to an ancient optimisation (remember, partial indexes are smaller), which occasionally gets into your way, performance wise, if you do want to query for
NULL
values.
But in this case, it’s useful. Check this out:
CREATE TABLE message(deleted number(1));
CREATE INDEX i ON message (
CASE WHEN deleted > 0 THEN deleted END
);
The above index is a function-based index, i.e. an index that contains not the value of the deleted column itself, but an expression based on it. Concretely, it contains only deleted values that are strictly greater than zero, because if the value is zero, then it is turned to
NULL
by the
CASE
expression, and Oracle doesn’t index
NULL
values. Check this out:
INSERT INTO message
SELECT DECODE(LEVEL, 1, 1, 0)
FROM dual
CONNECT BY LEVEL < 100000;
The above query is inserting a single row containing a deleted value of 1, and almost 100k rows containing a value of 0. The insert is very quick, because only one row has to be added to the index. The other almost 100k rows are skipped:
EXEC dbms_stats.gather_table_stats('SANDBOX', 'MESSAGE');
SELECT NUM_ROWS, BLOCKS
FROM SYS.ALL_TAB_STATISTICS
WHERE TABLE_NAME = 'MESSAGE';
SELECT NUM_ROWS, LEAF_BLOCKS
FROM SYS.ALL_IND_STATISTICS
WHERE TABLE_NAME = 'MESSAGE';
The result is:
NUM_ROWS BLOCKS
---------------------
99999 152 <-- table size
NUM_ROWS LEAF_BLOCKS
---------------------
1 1 <-- index size
The “trouble” with this kind of emulation is: It’s a function-based index. We can use this index only if we really reproduce the same “function” (or in this case, expression) as in the index itself. So, in order to fetch all the deleted messages, we must not write the following query:
SELECT *
FROM message
WHERE deleted = 1;
But this one, instead:
SELECT *
FROM message
WHERE CASE WHEN deleted > 0 THEN deleted END = 1;
Check out execution plans:
EXPLAIN PLAN FOR
SELECT *
FROM message
WHERE deleted = 1;
SELECT * FROM TABLE(dbms_xplan.display);
EXPLAIN PLAN FOR
SELECT *
FROM message
WHERE CASE WHEN deleted > 0 THEN deleted END = 1;
SELECT * FROM TABLE(dbms_xplan.display);
The output being:
------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 50000 | 146K| 44 (3)|
|* 1 | TABLE ACCESS FULL| MESSAGE | 50000 | 146K| 44 (3)|
------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("DELETED"=1)
And
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 2 (0)|
| 1 | TABLE ACCESS BY INDEX ROWID| MESSAGE | 1 | 3 | 2 (0)|
|* 2 | INDEX RANGE SCAN | I | 1 | | 1 (0)|
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(CASE WHEN "DELETED">0 THEN "DELETED" END =1)
As you can see, the first query runs a full table scan, estimating to retrieve 50% of all the rows, when the actual result is only 1 row as can be seen in the second execution plan!
Insertion speed
What’s even more impressive is the difference in insertion speed. Consider the following code block, which measures the time it takes to insert 1 million times 0 and one million times 1:
SET SERVEROUTPUT ON
DECLARE
ts TIMESTAMP;
BEGIN
ts := SYSTIMESTAMP;
INSERT INTO message
SELECT 0 FROM dual CONNECT BY level <= 1000000;
dbms_output.put_line(SYSTIMESTAMP - ts);
ts := SYSTIMESTAMP;
INSERT INTO message
SELECT 1 FROM dual CONNECT BY level <= 1000000;
dbms_output.put_line(SYSTIMESTAMP - ts);
END;
/
The result being:
+000000000 00:00:01.501000000
+000000000 00:00:08.162000000
The insertion is much faster if we don’t have to modify the index!
Conclusion
Partial indexes are a very neat trick in cases where your data is highly skewed and some values in a column are extremely rare and very frequently queried. They may drastically reduce the index size, which greatly improves performance in some situations, including inserting into the table, and querying the index.
In Oracle, they can be emulated using function-based indexes, which means you have to use the exact function expression from the index also in queries, in order to profit. But it may well be worth the trouble!
Like this:
Like Loading...
It’s really cool. With just one mentioned caveat. You have to alway remember to write it this way.
And it also becomes more ugly with multicolumn indexes.
Yes, sure. Those are significant caveats, unfortunately…
As of version 11.2, define a virtual column and index that column. As of version 12c, optionally make that column “invisible”.
alter table message add is_deleted invisible as (case when deleted > 0 then 1 end);
create index i on message(is_deleted);
select * from message where is_deleted = 1;
Interesting, thanks for your hint. I hadn’t thought of virtual columns. Will investigate!