SQL Multi-Substring Search in Yandex Clickhouse

Nowadays, also analytical databases are more frequently used to search for substrings or patterns in character strings. Thus, most databases provide a like operator or more sophisticated regex functions. However, especially long regex patterns or many of them can be challenging. In a recent project, I had to search for not one, but for 150 substrings at the same time. Thankfully, my database of choice (Yandex Clickhouse) offers several possibilities to handle this:...

January 16, 2021 · 3 min · Christian Hotz-Behofsits

Bigquery: Least but not NULL

Let’s start with a simple SQL statement that returns the smallest value from each of several columns (similar to min only not across observations, but columns): SELECT LEAST(date1, date2, date3) as min_date FROM tbl In BigQuery it works nicely till you encounter NULL values, then it will allways pick the NULL. Unfortunately it does not provide a IGNORE NULLS option such as ARRAY_AGG or others do. However, similar behaviour can be archieved with the following UDF:...

January 14, 2021 · 1 min · Christian Hotz-Behofsits

Merge large CSV files with headers

If you have ever worked with classic data warehousing tools, you may already know the problem: The CSV export splits the output into different files, often with one file containing no more than 1 million observations. However, for analysis purposes, this format is not always optimal and must be merged into a single file. In bash this is not a big problem with cat, but only if there is no header present....

January 9, 2021 · 1 min · Christian Hotz-Behofsits

Parallel JSONLines in R

JSON is becoming more and more widely used in data lakes, especially as a replacement for CSV. In particular, JSONLines where one JSON object is stored per line is to be mentioned here. Although the format has some disadvantages (i.e., overhead, supports only few data types), the advantages mostly outweigh them. For example, the individual lines are easier to read by humans and the format is much more standardized. In addition, each line has its own JSON object that can be processed independently....

January 6, 2021 · 2 min · Christian Hotz-Behofsits

Interpolate missing values in SQL

In this I demonstrate how missing values can be interpolated in SQL. The code is written for BigQuery and I provide a user defined function, however, similar solutions may exist for other databases. Although, SQL may not be the first choice to do such tasks, it is possible. Before one can replace missing values, you have to have missing values. At least in my applications, often the whole rows are missing and not just selective fields....

May 24, 2020 · 2 min · Christian Hotz-Behofsits