MonetDBLite as batching backend

If you have to analyse a lot of data you need a good backend to fetch and store your results piecewise in small chunks try MonetDB(Lite). I often used CSV or TSV files, because it is convenient. But from time to time you have to do some filtering, joins or other data-operations. So I looked for a nice database and MonetDB seems to have some nice properties...

I had to conduct a sentiment analysis on 1.4bn text comments. This is not a lot, especially if you use the really nice fastText. But I wanted to select only english comments in a first step. So I used fastText for language detection and stored the result and a unique comment-id in a bunch of CSV-files. In a second step I had to select the content of only english comments and put it into the next fastText model for sentiment analysis. But how to join large amounts of CSV files? This is a perfect use case for a column store database!

Clickhouse and Subqueries

My first pick was clickhouse, the new kid on the block, which is fast, efficient and already provides some nice interfaces. Sadly the query execution got always interupted by an exception and the Yandex guys told me that there is a bug if the resultset of a subquery gets bigger than 16GiB - which was the case in this scenario.

MonetDBLite

My second pick was monetdblite, because it is an old and well tested column base solution and it is really easy to use it in R and import CSV files. I did this in a quick and dirty unparallelized way ...

for (rel_filepath in list.files('data/comments', full.names=T)) {
  csvfile <- paste0(getwd(), '/', rel_filepath)
  # copy the contents of a CSV file into the database, using the MonetDB COPY INTO command
  dbSendQuery(con, paste0("COPY INTO comments FROM '", csvfile, "' USING DELIMITERS '\t','\n','' NULL as ''"))
}

... but it worked. In a second step, I joined my tables, selected only english comments and stored the result in a new table.

CREATE TABLE comments_en AS SELECT uuid, text FROM comments_lang INNER JOIN comments using (uuid) WHERE lang = '__label__en'

Well, this worked nice, but exporting the new table to CSV files took really long. After some hours I stopped the process and opened an issue in the monetdb's github repository. They told me that the export does not work at the moment (which was obviously the case :D ). So, never do something like this in MonetDBLite, otherwise you get old:

DBI::dbSendQuery(con, "COPY select * from comments_en INTO '/home/chris/Sentiment/data/comments_en.csv'")

Then I tried to export it using R, fetch the results in chunks and store them into a CSV-file.

rs <- dbSendQuery(con, "SELECT * FROM comments_en")

i <- 0
while (!dbHasCompleted(rs)) {
  chunk <- dbFetch(rs, 5000000)

   readr::write_tsv(chunk, 'some/path', col_names=F)
}

It worked, but it also took far too long, so I stopped the process again. Then I slightly rewrote it and now I use SQLs LIMIT and OFFSET options to split the data. But does this approach scale well? A dump database would read every entry till it gets to the right offset, but is monetDB dump?

rs <- dbSendQuery(con, "SELECT * FROM table LIMIT 500000")
i <- 1
while (!dbHasCompleted(rs)) {
  chunk <- dbFetch(rs)

  // the query works even if there is no result any more
  if(length(chunk) == 0){
    break
  }

  // do something

  dbClearResult(rs)
  rs <- dbSendQuery(con, paste0("SELECT * FROM table LIMIT 500000 OFFSET ", as.integer(500000 * i)))
  i <- i + 1
}
dbClearResult(rs)

My code stored the query result as a csv file, processed 0,5m comments at once (one chunk) and processed more than 1000 chunks at all. Surprisingly, the query execution time (with incresing offset) seems to be more or less static after some point (offset of around 60m) and even slightly decline afterwards.

I used the following R-code to measure it (the input was the CSV-creation time of the single files):

library(readr)
library(dplyr)
library(ggplot2)

dataset <- read_csv("~/times.csv", col_names = c("time", "file"),
    col_types = cols(
      time = col_time(format = "%H:%M"),
      file = col_integer())
    )

trend <- dataset %>% mutate(
  duration = as.double(difftime(time, lag(time, n = 5), units = "mins")/5)
  ) %>% filter(!is.na(duration))

ggplot(data=trend, aes(x=file, y=duration)) + geom_smooth(method = 'loess') + 
  labs(y = "Duration of a single query\n(mean of 5 queries in s)",
       x = "number of chunk\n(increased offset)")

Of course, you can argue that this benchmark is not scientific and the numbers are influenced by several things. But as you can see clearly, the query execution time does not increase on a higher offset. So monetdb is really not dump and I don't miss the CSV-export at all.