SQL Multi-Substring Search in Yandex Clickhouse
Nowadays, also analytical databases are more frequently used to search for substrings or patterns in character strings. Thus, most databases provide a like operator or more sophisticated regex functions. However, especially long regex patterns or many of them can be challenging. In a recent project, I had to search for not one, but for 150 substrings at the same time. Thankfully, my database of choice (Yandex Clickhouse) offers several possibilities to handle this:
Option 1: match
The match(haystack, pattern)
function can be used with the help of the regex or-operator |
to search for different substrings at the same time.
single_regex = '|'.join(substrings)
!cat 20mtexts.jsonl| clickhouse-local \
--input-format "JSONEachRow" \
-S "author String, text String, followers String, latitude Nullable(Float64), longitude Nullable(Float64), language String, date DateTime" \
-q "select count(*) FROM table WHERE match(text, '{single_regex}')"
The downside of this is, that 1) it yields to a really long regex pattern and 2) it requires to invoke a regex engine.
Option 2: multiMatchAny
An alternative is the multiMatchAny(haystack, [pattern1, pattern2, ..., patternn])
fuction. Here an array of separate patterns is passed as the second argument:
inner_part = "','".join(substrings)
multi_match = f"multiMatchAny(text,['{inner_part}'])"
!cat 20mtexts.jsonl| clickhouse-local \
--input-format "JSONEachRow" \
-S "author String, text String, followers String, latitude Nullable(Float64), longitude Nullable(Float64), language String, date DateTime" \
-q "select count(*) FROM table WHERE {multi_match}"
However, as the documentation points out, there is another function which can be used, whenever we search for substrings and no regex is required…
For patterns to search substrings in a string, it is better to use
multiSearchAny
since it works much faster.
Although, clickhouse uses Intel’s highly optimized Hyperscan library, regex is still expensive.
Option 3: multiSearchAny
The function multiSearchAny(haystack, [needle1, needle2, ..., needlen])
has a similar signature as multiMatchAny
, however, it is explicitly recommended in case of searching for substrings and not regex.
inner_part = "','".join(substrings)
multisearch = f"multiSearchAny(text,['{inner_part}'])"
!cat 20mtexts.jsonl |clickhouse-local \
--input-format "JSONEachRow" \
-S "author String, text String, followers String, latitude Nullable(Float64), longitude Nullable(Float64), language String, date DateTime" \
-q "select count(*) FROM table WHERE {multisearch}"
Given the information from the official documentation one would assume that especially multiSearchAny
is going to perform well. After all, this function was created to handle many substring searches. However, a small benchmark reveiled interesting results. I compared all three options using clickhouse-local
and a 20 million observation dataset consisting primarly of short texts (max. 280 characters).
As seen in graph above, multiMatchAny
performed the best on average. I do not want to generalize these results prematurely, but it pays off to evaluate the different possibilities for heavily queries. Finally, I have to remark, that this benchmark is based on clickhouse-local and on a file input. The results may look different in a production environment.
Final Remarks
- iPython is used to build & run the commands.
%%timeit
magic is used for benchmarks and a old Intel-Server with 32 cores as machine - substrings is a Python list which contains approx. 150 very short substrings (unicode emojis)
- 20 million observations are used for the benchmark such that caching or other side effects are avoided (at least to some extent)