Using Embeddings in R
There exist different file formats to store distributed vector or word representations also known as embeddings. However, one of the most convenient ways is to use the text format used by the original word2vec implementation. In this format, each row starts with a label (an item of the vocabulary) followed by the vector components. Furthermore, each field is separated by an ordinary space. The following function processes such *.vec-files and can be used to load them directly into R:
load_embedding <- function(file_path){
# load full file
lines <- readLines(file_path)
# create new environment
embeddings_env <- new.env(hash = TRUE, parent = emptyenv())
# this function is used to convert vectors to unit vectors
# by dividing their components by vector length
normalize_vector <- function(a){
a/sqrt(sum(a**2))
}
# iterate through the whole file line by line
for (i in 1:length(lines)) {
line <- lines[[i]]
values <- strsplit(line, " ")[[1]]
label <- values[[1]]
embeddings_env[[label]] <- normalize_vector(as.double(values[-1]))
}
embeddings_env
}
For example, we trained a model on music data. Specifically, similarity between songs is measured by the co-occurrence on playlists. Therefore, all “labels” are in this case ISRC song IDs. Using our function, the embeddings can be imported by running emb <- load_embedding("example.vec")
. This will create a new environment and store all vectors in it. The environment can be accessed by the variable emb
. Sometimes you want to list all items in your vocabulary. Assuming your embedding is stored in a variable named emb
it is as easy as calling ls.str(emb)
.
By using the $
-operator one can also extract the vector representation for a given label. In the music example, you get the vector for the song with the id NLB390800100 (aka Punk Rock Song by Bad Religion) by running emb$NLB390800100
.
Calculating cosine similarity
The similarity between two items is commonly measured by the angle between their vectors. Because, we normalized all vectors to unit-vectors while importing, the cosine is given by the dot-product.
For example, we can calculate the similarity between NLB390800100 and USUM71213718 like this: sum(emb$NLB390800100*emb$USUM71213718)
. Alternatively, one can shorten this expression by using a single matrix multiplication (e.g., emb$NLB390800100 %*% emb$USUM71213718
).
Get most similar elements
We can use the pairwise similarity to find the n most similar items given a reference item.
cosine_similarity <- function(a,b){
# assuming unit vectors
# the cosine is just the dot-product
a %*% b
}
most_similar <- function(embeddings, ref_item, n_top = 10){
# calculate cos similarity to ref_item for all elements
cos_sims <- eapply(embeddings, cosine_similarity, b = ref_item)
# only look at cos values smaller than 1
# this will ignore the same element
cos_sims <- cos_sims[cos_sims < 1]
# return top elements
cos_sims[order(unlist(cos_sims),decreasing=TRUE)][1:n_top]
}
A proper call to this function passes three different arguments. First, a variable containing the embedding-environment. Secondly, the reference object representation or an artificial vector (e.g., normalized mean of some vectors). Finally and optionally, you can pass the number of most similar elements. This value defaults to 10 such that a valid call could look like this most_similar(emb, emb$NLB390800100)
.
You can read more about environments in Hadley Wickham`s amazing Advanced R book. Furthermore, the full source code can be found in a gist right here.