Parallel JSONLines in R

JSON is becoming more and more widely used in data lakes, especially as a replacement for CSV. In particular, JSONLines where one JSON object is stored per line is to be mentioned here. Although the format has some disadvantages (i.e., overhead, supports only few data types), the advantages mostly outweigh them. For example, the individual lines are easier to read by humans and the format is much more standardized. In addition, each line has its own JSON object that can be processed independently....

January 6, 2021 · 2 min · Christian Hotz-Behofsits

Learning Emoji Representations from Observed Usage

Nowadays it is hard to imagine daily communication without emojis like 😄, 😬 or ❤️. These cute pictograms are not only ideal for expressing emotions, they are also standardized. This makes them ideal for analysis. However, before classical analyses such as clustering can be performed, a numerical representation is required. A simple method would be one-hot-encoding, which would be obvious considering the very limited vocabulary (there are only a few thousand emojis)....

November 5, 2019 · 3 min · Christian Hotz-Behofsits

Recommender Systems in R

A former WU-member, Michael Hahsler, created a really nice package called recommenderlab, which allows you to build collaborative filtering systems. But before you can use it, you have to install all required packages: install.packages(c("recommenderlab", "dplyr", "readr")) … and load them: library(recommenderlab) library(tibble) library(dplyr) library(readr) I asked my students to answer some questions. One question was about their favourite TV series, which is a good example for a recommender system (Netflix for example does pretty much the same)....

October 26, 2019 · 4 min · Christian Hotz-Behofsits

Using Embeddings in R

There exist different file formats to store distributed vector or word representations also known as embeddings. However, one of the most convenient ways is to use the text format used by the original word2vec implementation. In this format, each row starts with a label (an item of the vocabulary) followed by the vector components. Furthermore, each field is separated by an ordinary space. The following function processes such *.vec-files and can be used to load them directly into R:...

August 8, 2019 · 3 min · Christian Hotz-Behofsits