Nowadays, also analytical databases are more frequently used to search for substrings or patterns in character strings. Thus, most databases provide a like operator or more sophisticated regex functions. However, especially long regex patterns or many of them can be challenging. In a recent project, I had to search for not one, but for 150 substrings at the same time. Thankfully, my database of choice (Yandex Clickhouse) offers several possibilities to handle this:

Option 1: match

The match(haystack, pattern) function can be used with the help of the regex or-operator | to search for different substrings at the same time.

single_regex = '|'.join(substrings)

!cat 20mtexts.jsonl| clickhouse-local \
    --input-format "JSONEachRow" \
    -S "author String, text String, followers String, latitude Nullable(Float64), longitude Nullable(Float64), language String, date DateTime" \
    -q "select count(*) FROM table WHERE match(text, '{single_regex}')"

The downside of this is, that 1) it yields to a really long regex pattern and 2) it requires to invoke a regex engine.

Option 2: multiMatchAny

An alternative is the multiMatchAny(haystack, [pattern1, pattern2, ..., patternn]) fuction. Here an array of separate patterns is passed as the second argument:

inner_part = "','".join(substrings)
multi_match = f"multiMatchAny(text,['{inner_part}'])"

!cat 20mtexts.jsonl| clickhouse-local \
    --input-format "JSONEachRow" \
    -S "author String, text String, followers String, latitude Nullable(Float64), longitude Nullable(Float64), language String, date DateTime" \
    -q "select count(*) FROM table WHERE {multi_match}"

However, as the documentation points out, there is another function which can be used, whenever we search for substrings and no regex is required…

For patterns to search substrings in a string, it is better to usemultiSearchAnysince it works much faster.

Although, clickhouse uses Intel’s highly optimized Hyperscan library, regex is still expensive.

Option 3: multiSearchAny

The function multiSearchAny(haystack, [needle1, needle2, ..., needlen]) has a similar signature as multiMatchAny , however, it is explicitly recommended in case of searching for substrings and not regex.

inner_part = "','".join(substrings)
multisearch = f"multiSearchAny(text,['{inner_part}'])"

!cat 20mtexts.jsonl |clickhouse-local \
    --input-format "JSONEachRow" \
    -S "author String, text String, followers String, latitude Nullable(Float64), longitude Nullable(Float64), language String, date DateTime" \
    -q "select count(*) FROM table WHERE {multisearch}"

Given the information from the official documentation one would assume that especially multiSearchAny is going to perform well. After all, this function was created to handle many substring searches. However, a small benchmark reveiled interesting results. I compared all three options using clickhouse-local and a 20 million observation dataset consisting primarly of short texts (max. 280 characters).

As seen in graph above, multiMatchAny performed the best on average. I do not want to generalize these results prematurely, but it pays off to evaluate the different possibilities for heavily queries. Finally, I have to remark, that this benchmark is based on clickhouse-local and on a file input. The results may look different in a production environment.

Final Remarks

  • iPython is used to build & run the commands. %%timeit magic is used for benchmarks and a old Intel-Server with 32 cores as machine
  • substrings is a Python list which contains approx. 150 very short substrings (unicode emojis)
  • 20 million observations are used for the benchmark such that caching or other side effects are avoided (at least to some extent)