Nowadays, also analytical databases are more frequently used to search for substrings or patterns in character strings. Thus, most databases provide a like operator or more sophisticated regex functions. However, especially long regex patterns or many of them can be challenging. In a recent project, I had to search for not one, but for 150 substrings at the same time. Thankfully, my database of choice (Yandex Clickhouse) offers several possibilities to handle this:

Option 1: match

The match(haystack, pattern) function can be used with the help of the regex or-operator | to search for different substrings at the same time.

single_regex = '|'.join(substrings)

!cat 20mtexts.jsonl| clickhouse-local \
    --input-format "JSONEachRow" \
    -S "author String, text String, followers String, latitude Nullable(Float64), longitude Nullable(Float64), language String, date DateTime" \
    -q "select count(*) FROM table WHERE match(text, '{single_regex}')"
Python

The downside of this is, that 1) it yields to a really long regex pattern and 2) it requires to invoke a regex engine.

Option 2: multiMatchAny

An alternative is the multiMatchAny(haystack, [pattern1, pattern2, ..., patternn]) fuction. Here an array of separate patterns is passed as the second argument:

inner_part = "','".join(substrings)
multi_match = f"multiMatchAny(text,['{inner_part}'])"

!cat 20mtexts.jsonl| clickhouse-local \
    --input-format "JSONEachRow" \
    -S "author String, text String, followers String, latitude Nullable(Float64), longitude Nullable(Float64), language String, date DateTime" \
    -q "select count(*) FROM table WHERE {multi_match}"
Python

However, as the documentation points out, there is another function which can be used, whenever we search for substrings and no regex is required…

For patterns to search substrings in a string, it is better to usemultiSearchAnysince it works much faster.

Although, clickhouse uses Intel’s highly optimized Hyperscan library, regex is still expensive.

Option 3: multiSearchAny

The function multiSearchAny(haystack, [needle1, needle2, ..., needlen]) has a similar signature as multiMatchAny , however, it is explicitly recommended in case of searching for substrings and not regex.

inner_part = "','".join(substrings)
multisearch = f"multiSearchAny(text,['{inner_part}'])"

!cat 20mtexts.jsonl |clickhouse-local \
    --input-format "JSONEachRow" \
    -S "author String, text String, followers String, latitude Nullable(Float64), longitude Nullable(Float64), language String, date DateTime" \
    -q "select count(*) FROM table WHERE {multisearch}"
Python

Given the information from the official documentation one would assume that especially multiSearchAny is going to perform well. After all, this function was created to handle many substring searches. However, a small benchmark reveiled interesting results. I compared all three options using clickhouse-local and a 20 million observation dataset consisting primarly of short texts (max. 280 characters).

As seen in graph above, multiMatchAny performed the best on average. I do not want to generalize these results prematurely, but it pays off to evaluate the different possibilities for heavily queries. Finally, I have to remark, that this benchmark is based on clickhouse-local and on a file input. The results may look different in a production environment.

Final Remarks

  • iPython is used to build & run the commands. %%timeit magic is used for benchmarks and a old Intel-Server with 32 cores as machine
  • substrings is a Python list which contains approx. 150 very short substrings (unicode emojis)
  • 20 million observations are used for the benchmark such that caching or other side effects are avoided (at least to some extent)