Parallel JSONLines in R

JSON is becoming more and more widely used in data lakes, especially as a replacement for CSV. In particular, JSONLines where one JSON object is stored per line is to be mentioned here. Although the format has some disadvantages (i.e., overhead, supports only few data types), the advantages mostly outweigh them. For example, the individual lines are easier to read by humans and the format is much more standardized. In addition, each line has its own JSON object that can be processed independently. The overhead caused by the repeating field names can be well managed with compression (e.g., gzip, zstd, brotli).

Some systems split larger exports into smaller parts, so that one is also confronted with multiple JSONLines files. These can be processed in R at the same time and so loading is significantly shortened.

library(furrr)
    
# my machine has more than 30 cores and a quite fast SSD
# Therefore, we can utilises 20 cores
plan(multisession, workers = 20)

df <- future_map_dfr(
   # this returns a list containing all my jsonline files
   list.files(path = "../data/panel", pattern="00*", full.names=T),
   # each file is parsed separately 
   function(f) jsonlite::stream_in(file(f))
)

This code selects all files from a folder which match a specific pattern and loads them in parallel. In this example case the data is not compressed, but that would not be a problem either. For example, gzip-compressed files can be processed directly with the gzfile(.) connection (other connection types can be found in the documentation).