First steps with the Twitter Academic API

Finally, Twitter released the academic track, a new API endpoint just for researchers! The application is already open, and I am one of the lucky guys with approved access. However, it is relatively new and, therefore, it is not easy to find many tutorials or related materials online. I want to change this and start with a quick-start on loading historic tweets (incl. package suggestion and configuration). I highly recommend the python package TwitterAPI....

March 11, 2021 · 2 min · Christian Hotz-Behofsits

SQL Multi-Substring Search in Yandex Clickhouse

Nowadays, also analytical databases are more frequently used to search for substrings or patterns in character strings. Thus, most databases provide a like operator or more sophisticated regex functions. However, especially long regex patterns or many of them can be challenging. In a recent project, I had to search for not one, but for 150 substrings at the same time. Thankfully, my database of choice (Yandex Clickhouse) offers several possibilities to handle this:...

January 16, 2021 · 3 min · Christian Hotz-Behofsits

Bigquery: Least but not NULL

Let’s start with a simple SQL statement that returns the smallest value from each of several columns (similar to min only not across observations, but columns): SELECT LEAST(date1, date2, date3) as min_date FROM tbl In BigQuery it works nicely till you encounter NULL values, then it will allways pick the NULL. Unfortunately it does not provide a IGNORE NULLS option such as ARRAY_AGG or others do. However, similar behaviour can be archieved with the following UDF:...

January 14, 2021 · 1 min · Christian Hotz-Behofsits

Merge large CSV files with headers

If you have ever worked with classic data warehousing tools, you may already know the problem: The CSV export splits the output into different files, often with one file containing no more than 1 million observations. However, for analysis purposes, this format is not always optimal and must be merged into a single file. In bash this is not a big problem with cat, but only if there is no header present....

January 9, 2021 · 1 min · Christian Hotz-Behofsits

Parallel JSONLines in R

JSON is becoming more and more widely used in data lakes, especially as a replacement for CSV. In particular, JSONLines where one JSON object is stored per line is to be mentioned here. Although the format has some disadvantages (i.e., overhead, supports only few data types), the advantages mostly outweigh them. For example, the individual lines are easier to read by humans and the format is much more standardized. In addition, each line has its own JSON object that can be processed independently....

January 6, 2021 · 2 min · Christian Hotz-Behofsits