SQL Multi-Substring Search in Yandex Clickhouse
Nowadays, also analytical databases are more frequently used to search for substrings or patterns in character strings. Thus, most databases provide a like operator or more sophisticated regex functions. However, especially long regex patterns or many of them can be challenging. In a recent project, I had to search for not one, but for 150 substrings at the same time. Thankfully, my database of choice (Yandex Clickhouse) offers several possibilities to handle this:
Option 1: match
The match(haystack, pattern)
function can be used with the help of the regex or-operator |
to search for different substrings at the same time.
The downside of this is, that 1) it yields to a really long regex pattern and 2) it requires to invoke a regex engine.
Option 2: multiMatchAny
An alternative is the multiMatchAny(haystack, [pattern1, pattern2, ..., patternn])
fuction. Here an array of separate patterns is passed as the second argument:
However, as the documentation points out, there is another function which can be used, whenever we search for substrings and no regex is required…
For patterns to search substrings in a string, it is better to use
multiSearchAny
since it works much faster.
Although, clickhouse uses Intel’s highly optimized Hyperscan library, regex is still expensive.
Option 3: multiSearchAny
The function multiSearchAny(haystack, [needle1, needle2, ..., needlen])
has a similar signature as multiMatchAny
, however, it is explicitly recommended in case of searching for substrings and not regex.
Given the information from the official documentation one would assume that especially multiSearchAny
is going to perform well. After all, this function was created to handle many substring searches. However, a small benchmark reveiled interesting results. I compared all three options using clickhouse-local
and a 20 million observation dataset consisting primarly of short texts (max. 280 characters).
As seen in graph above, multiMatchAny
performed the best on average. I do not want to generalize these results prematurely, but it pays off to evaluate the different possibilities for heavily queries. Finally, I have to remark, that this benchmark is based on clickhouse-local and on a file input. The results may look different in a production environment.
Final Remarks
- iPython is used to build & run the commands.
%%timeit
magic is used for benchmarks and a old Intel-Server with 32 cores as machine - substrings is a Python list which contains approx. 150 very short substrings (unicode emojis)
- 20 million observations are used for the benchmark such that caching or other side effects are avoided (at least to some extent)