Recommender Systems in R

A former WU-member, Michael Hahsler, created a really nice package called recommenderlab, which allows you to build collaborative filtering systems.

First you have to install all required packages

install.packages(c("recommenderlab", "dplyr", "readr"))

... and load them:

library(recommenderlab)
library(tibble)
library(dplyr)
library(readr)

I asked my students to answer some questions. One question was about their favourite TV series, which is a good example for a recommender system (Netflix for example does pretty much the same). So they filled out this survey and I generated a table by adding a new row for each student and a new column for each TV show. I put a 1 in the cell if the user likes the show and nothing if not. You can download and import this table (CSV-format) using the following command (all columns are by default parsed as integers except user, which is a string):

tv_shows <- read_csv("./data/serien.csv", col_types =  cols(
  .default = col_integer(),
  user = col_character()
))

Now we have to prepare the data by converting it to a format that recommenderlab can process. We use binary ratings, because they provide good results and are easier to handle. Interesting Fact: Netflix decided to replace their 1-5 star rating with a binary one, because more people are willing to use this kind of feedback form

# tv_shows is the dataframe where the "raw" data is stored
ratings <- tv_shows %>% 
  # now we replace all NAs by 0
  mutate_each(funs(replace(., is.na(.), 0))) %>%
  # now we convert all 0 to false and 1 to true
  # (except the user column, because it contains names and neighter 0 nor 1)
  mutate_each(funs(as.logical), -user) %>% 
  # dplyr uses a own structure called tibble, but we want dataframes
  # .. so we convert it to one
  as.data.frame() %>%
  # .. and we convert a column to the rowname of the dataframe
  column_to_rownames(var ="user") 

Great, we have a dataframe with boolean values (true, false), where each row represents a user and each column a choice (e.g., does user A like option 1). Recommender lab uses it's own data structures and this scheme would fit to an structure known as binaryRatingMatrix. So we convert it to one (it is not possible to do this directly, therefore we convert it to an matrix first):

binaryTVShowRatings <- as.matrix(ratings) %>% as("binaryRatingMatrix")

The package provides several similarity measurements like Jaccard, cosine or pearson similarity and different methods: UBCF is the abbreviation for User Based Collaborative Filtering and IBCN for an item based filter. So a model can be defined within a one-liner:

model <- Recommender(data = binaryTVShowRatings, method = "UBCF", 
                     parameter = list(method = "cosine"))

This one-liner just defines the model, but it does not run it. To predict recommendations we call the generic predict method and provide the previously defined model, the data and the number of recommendations. In this case we predict and train the model with the same data (in-sample), which makes not really sense - but this is just an example how to use the toolchain. Normally you would split the dataset and use different parts for training and testing (n-fold cross validation) or train the model with all your existing data and predict results for new users. We just have not enough observations to do it, but feel free to generate more data and test it in a more useful setting :)

recommendations <- predict(model, binaryTVShowRatings, n = 6)

The prediction function returns an topNList, but we can convert it to an list. It will return a list of 6 TV show recommendations for each user:

getList(recommendations)

It is not possible to convert it directly to an dataframe, but we can convert it to an matrix first and then to an dataframe:

as(recommendations, "matrix") %>% as.data.frame()
simpsons six feed under true blood breaking bad
I
H
G 0.32
F
E 0.18
D 0.40
C 0.30
B
A