SQL is a really cool language. I can write really complex business
logic with this
logic programming language. I was again thrilled about SQL recently, at a customer site:
But whenever I tweet something like the above, the inevitable happened.
I was nerd sniped.
Oleg Šelajev from ZeroTurnaround challenged me to prove that SQL is so awesome:
Given a string, find all substrings from that string, which are palindromes. Challenge accepted! (For the moment, let’s forget about algorithmic complexity.)
Here’s the Full Query
Spoiler first.
Bear with me, I’ll explain it step by step afterwards. Here’s with PostgreSQL syntax:
WITH RECURSIVE
words (word) AS (
VALUES
('pneumonoultramicroscopicsilicovolcanoconiosis'),
('pseudopseudohypoparathyroidism'),
('floccinaucinihilipilification'),
('antidisestablishmentarianism'),
('supercalifragilisticexpialidocious'),
('incomprehensibilities'),
('honorificabilitudinitatibus'),
('tattarrattat')
),
starts (word, start) AS (
SELECT word, 1 FROM words
UNION ALL
SELECT word, start + 1 FROM starts WHERE start < length(word)
),
palindromes (word, palindrome, start, length) AS (
SELECT word, substring(word, start, x), start, x
FROM starts CROSS JOIN (VALUES(0), (1)) t(x)
UNION ALL
SELECT word, palindrome, start, length + 2
FROM (
SELECT
word,
substring(word, start - length / 2, length) AS palindrome,
start, length
FROM palindromes
) AS p
WHERE start - length / 2 > 0
AND start + (length - 1) / 2 <= length(word)
AND substring(palindrome, 1, 1) =
substring(palindrome, length(palindrome), 1)
)
SELECT DISTINCT
word,
trim(replace(word, palindrome, ' ' || upper(palindrome) || ' '))
AS palindromes
FROM palindromes
WHERE length(palindrome) > 1
ORDER BY 2
(
You can run it yourself on SQLFiddle)
The result being:
word |palindromes
----------------------------------------------|-----------------------------------------------
antidisestablishmentarianism |ant IDI sestablishmentarianism
antidisestablishmentarianism |antidi SES tablishmentarianism
floccinaucinihilipilification |flo CC inaucinihilipilification
floccinaucinihilipilification |floccinauc INI hilipilification
floccinaucinihilipilification |floccinaucin IHI lipilification
floccinaucinihilipilification |floccinaucinih ILI p ILI fication
floccinaucinihilipilification |floccinaucinih ILIPILI fication
floccinaucinihilipilification |floccinaucinihi LIPIL ification
floccinaucinihilipilification |floccinaucinihil IPI lification
floccinaucinihilipilification |floccinaucinihilipil IFI cation
honorificabilitudinitatibus |h ONO rificabilitudinitatibus
honorificabilitudinitatibus |honor IFI cabilitudinitatibus
honorificabilitudinitatibus |honorificab ILI tudinitatibus
honorificabilitudinitatibus |honorificabilitud INI tatibus
honorificabilitudinitatibus |honorificabilitudin ITATI bus
honorificabilitudinitatibus |honorificabilitudini TAT ibus
incomprehensibilities |incompr EHE nsibilities
incomprehensibilities |incomprehens IBI lities
incomprehensibilities |incomprehensib ILI ties
incomprehensibilities |incomprehensibil ITI es
pneumonoultramicroscopicsilicovolcanoconiosis |pneum ONO ultramicroscopicsilicovolcanoconios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopics ILI covolcanoconios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilic OVO lcanoconios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilicovolca NOCON ios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilicovolcan OCO nios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilicovolcanoconio SIS
pseudopseudohypoparathyroidism |pseudopseudohy POP arathyroidism
pseudopseudohypoparathyroidism |pseudopseudohypop ARA thyroidism
pseudopseudohypoparathyroidism |pseudopseudohypoparathyro IDI sm
supercalifragilisticexpialidocious |supercalifrag ILI sticexpialidocious
tattarrattat |t ATTA rr ATTA t
tattarrattat |t ATTARRATTA t
tattarrattat |ta TT arra TT at
tattarrattat |ta TTARRATT at
tattarrattat |tat TARRAT tat
tattarrattat |TAT tarrat TAT
tattarrattat |tatt ARRA ttat
tattarrattat |tatta RR attat
tattarrattat |TATTARRATTAT
This query uses a couple of nice features:
Common Table Expressions
They’re the only way to declare variables in SQL – i.e. you can “store” a query in such a table expression, assign a name to it, and reuse it several times. This is done with the
WITH
clause. The nice thing about common table expressions is that they are allowed to be
RECURSIVE
(depending on the database, the
RECURSIVE
keyword may be required / optional / not available).
VALUES() clause
The
VALUES()
clause is a very handy tool to create ad-hoc data in the form of tables. We did this to create a table called
WORDS
, which contains a couple of words inside of which we’d like to look for palindromes. In some databases (including DB2 and PostgreSQL), it’s totally possible to just use the
VALUES()
clause as a standalone clause instead of
SELECT
:
VALUES
('pneumonoultramicroscopicsilicovolcanoconiosis'),
('pseudopseudohypoparathyroidism'),
('floccinaucinihilipilification'),
('antidisestablishmentarianism'),
('supercalifragilisticexpialidocious'),
('incomprehensibilities'),
('honorificabilitudinitatibus'),
('tattarrattat')
Recursive Generation of Integers
In order to look for palindromes, the algorithm used here lists, for each word, the character index of each individual character in the word:
WITH RECURSIVE
...
starts (word, start) AS (
SELECT word, 1 FROM words
UNION ALL
SELECT word, start + 1 FROM starts WHERE start < length(word)
),
...
For example, if we used the word “word”…
WITH RECURSIVE
words (word) AS (VALUES('word')),
starts (word, start) AS (
SELECT word, 1 FROM words
UNION ALL
SELECT word, start + 1 FROM starts WHERE start < length(word)
)
SELECT *
FROM starts
We’d get the following result:
word |start
-----|------
word |1
word |2
word |3
word |4
The idea of the algorithm is that we start from a given character and “fan out” in both directions of a character, again recursively. The characters to the left and to the right of such a character must be the same and so on.
If you’re interested in the syntax, stay tuned. There will be a blog post coming up soon.
Using CROSS JOIN to run the Algorithm Twice
There are two types of palindromes:
- Those with an even amount of characters: “”, “aa”, “abba”, “babbab”
- Those with an odd amount of characters: “b”, “aba”, “aabaa”
In our algorithm, we’re treating them in almost the same way by running the algorithm twice. Whenever you think “hey, I need to run this twice in SQL”,
CROSS JOIN
could be a good candidate, as we’re creating a cartesian product between:
- The previous “starts” table, giving the starting character index
- A table containing values 0 (palindromes with even amounts of letters) and 1 (palindromes with odd amounts of letters)
For more information about CROSS JOIN
, read our article about JOINs.
Run this query for illustration:
WITH RECURSIVE
words (word) AS (VALUES('word')),
starts (word, start) AS (
SELECT word, 1 FROM words
UNION ALL
SELECT word, start + 1 FROM starts WHERE start < length(word)
)
SELECT *
FROM starts CROSS JOIN (VALUES(0),(1)) AS t(x)
Output:
word |start |x
-----|------|--
word |1 |0 <-- Even length palindromes centering around position 1 length 0
word |1 |1 <-- Odd length palindromes centering around position 1 length 1
word |2 |0
word |2 |1
word |3 |0
word |3 |1
word |4 |0
word |4 |1
Trivially, a palindrome of length 0 (empty string) or length 1 (“w”, “o”, “r”, “d”) is acceptable in principle, but boring. We’ll filter them out later, but keep them in the algorithm as this simplifies the algorithm. If you want to tune the query, you could prevent generating them in the first place.
The Palindrome Algorithm
Thus far, we’ve just prepared utility data to calculate palindromes:
- The words inside of which to search for palindromes
- The individual character indexes to start fanning out from, in each individual word
- The minimum palindrome length 0 (even) or 1 (odd)
Now comes the interesting part. Recursively fanning out from a starting character to find more palindromes, stopping the fanning out as soon as the new candidate is no longer a plaindrome:
WITH RECURSIVE
...
palindromes (word, palindrome, start, length) AS (
-- This part just creates the even/odd semantics
SELECT word, substring(word, start, x), start, x
FROM starts CROSS JOIN (VALUES(0), (1)) t(x)
UNION ALL
-- This part recurses by "fanning out"
SELECT word, palindrome, start, length + 2
FROM (
SELECT
word,
substring(word, start - length / 2, length) AS palindrome,
start, length
FROM palindromes
) AS p
WHERE start - length / 2 > 0
AND start + (length - 1) / 2 <= length(word)
AND substring(palindrome, 1, 1) =
substring(palindrome, length(palindrome), 1)
)
...
It isn’t really so hard in fact. The recursion part selects recursively from the
PALINDROMES
table (the previously calculated palindromes). That table has 4 columns:
WORD
: The word we’re looking for palindromes in. This is always the same per recursion
PALINDROME
: The palindrome, i.e. the substring inside of a word. This changes per recursion
START
: The start character index from which we started fanning out. This is always the same per recursion
LENGTH
: The palindrome length. This increases by 2 per recursion
Let’s look at the result for “floccinaucinihilipilification”:
flo CC inaucinihilipilification
floccinauc INI hilipilification
floccinaucin IHI lipilification
floccinaucinih ILI p ILI fication
floccinaucinih ILIPILI fication
floccinaucinihi LIPIL ification
floccinaucinihil IPI lification
floccinaucinihilipil IFI cation
There are a total of 8 distinct palindromes contained in this word (
and believe it or not, the word exists, too. Pronunciation is another thing).
Let’s “debug” the algorithm to get to this list (remember, SQL indexes are 1 based):
Start 1-4: No palindromes
Start 5: Even palindrome [4:5]
flo CC inaucinihilipilification
Start 6-11: No palindromes (I wont' repeat this further down)
Start 12: Odd palindrome [11:13]
floccinauc INI hilipilification
Start 14: Odd palindrome [13:15]
floccinaucin IHI lipilification
Start 16: Odd palindrome [15:17]
floccinaucinih ILI p ILI fication
Start 18: Odd palindrome [17:19], [16:20], [15:21] (Fanning out 3 times)
floccinaucinihil IPI lification
floccinaucinihi LIPIL ification
floccinaucinih ILIPILI fication
Start 20: Odd palindrome [19:17] (already found this)
floccinaucinih ILI p ILI fication
Start 22: Odd palindrome [21:23]
floccinaucinihilipil IFI cation
The IPI, LIPIL, ILIPILI chain is the most interesting. We’ve succeeded to fan out 3 times adding new characters from WORD on both sides of the initial character.
When do we stop fanning out? Whenever one of these conditions hold true:
WHERE start - length / 2 > 0
AND start + (length - 1) / 2 <= length(word)
AND substring(palindrome, 1, 1) =
substring(palindrome, length(palindrome), 1)
I.e.
- When we’ve reached the beginning of WORD (no more characters to the left)
- When we’ve reached the end of WORD (no more characters to the right)
- When the letter to the left of the palindrome of the previous recursion doesn’t match the letter to the right of the palindrome
That last predicate could also simply read:
AND palindrome = reverse(palindrome)
But that might be a bit slower as it compares something that we’ve already proven to be true in the previous recursion.
Finally, formatting the result
The final part isn’t too interesting anymore:
SELECT DISTINCT
word,
trim(replace(word, palindrome, ' ' || upper(palindrome) || ' '))
AS palindromes
FROM palindromes
WHERE length(palindrome) > 1
ORDER BY 2
We’ll simply:
- Select
DISTINCT
results only, as palindromes might appear several times in a word
- Replace the palindrome substring by its upper case version and add some whitespace to better visualise it. That’s totally optional, of course
- Remove the trivial palindromes of length 0 and 1
And we’re done! Again,
feel free to play around with this on SQLFiddle, or, much better, provide a cool palindrome algorithm implementation in your favourite language in the comments section, or on Twitter:
More beautiful SQL in these articles here:
Like this:
Like Loading...
This made me chuckle …. Thanks Lukas for sharing it!
I got asked to do this as an interview test recently in PL/SQL. I did a simple nested loops solution. Here’s the rough equivalent in Oracle SQL:
Yeah, nice. Thanks for sharing. Oracle’s CONNECT BY syntax is much simpler than the CTE / WITH syntax for simple iterations.
At first, I was also going to simply iterate over all the positions and possible lengths and check for palindromes with REVERSE(). But I got nerd sniped into further optimising this and avoiding REVERSE() ;-)