Learning Emoji Representations from Observed Usage

November 2019 ยท 3 minute read

Nowadays it is hard to imagine daily communication without emojis like ๐Ÿ˜„, ๐Ÿ˜ฌ or โค๏ธ. These cute pictograms are not only ideal for expressing emotions, they are also standardized. This makes them ideal for analysis. However, before classical analyses such as clustering can be performed, a numerical representation is required. A simple method would be one-hot-encoding, which would be obvious considering the very limited vocabulary (there are only a few thousand emojis). However, embeddings can not only provide more memory friendly representations, but also those that capture a meaning.

There are already some approaches to how such embeddings are created - among others Eisner et al. have tried to exploit the description of Emojis. But one can also analyze only observed use. Thus it is suggested that emojis are more similar if, for example, they occur in the same context as other emojis. Illendula et. al., for example, also pursued such an approach.

Facebook has released some time ago software called StarSpace. With this software you can train all kinds of embeddings. The software is written in C++ and offers a Python interface. Fortunately there is also a suitable R-package called ruimtehol (please don’t ask me how they got the name).

After a one-time installation of the package (install.packages("ruimtehol")) it can be loaded as follows:

library(ruimtehol)

Now we only need the data - concretely Emoji sequences (e.g. โค๏ธ๐Ÿ˜„โค๏ธ ). These can be downloaded from social media platforms (in my case they come from Twitter).

Suppose you now have the data in a vector called emo_seq. Then you can pass it to the method embed_wordspace:

model <- embed_wordspace(
	emo_seq, 
	dim = 8
) 

The first argument is the vector of emoji sequences and the second argument is the size of the representation. The larger the representation, the richer the information that can be captured, but the more memory is required. In addition, more data and time is needed to train larger models and the added value decreases at some point. A rule of thumb is that in embeddings the root is taken to base 4 of the vocabulary. Since there are around 3000 emojis, you get about 7.4 (rounded up to 8) dimensions.

wordvectors <- as.matrix(model)
head(wordvectors, 5)
1 2 3 4 5 6
๐Ÿคท 0.00 0.00 0.00 0.00 -0.00 -0.00
๐Ÿคฆ 0.00 -0.00 -0.00 -0.00 0.00 0.00
๐Ÿ˜‚ -0.00 0.00 -0.00 0.00 -0.00 0.00
๐Ÿ‘‰ 0.00 -0.00 0.00 0.00 -0.00 -0.00
๐Ÿ”ฅ 0.00 -0.00 -0.00 0.00 -0.00 0.00
๐Ÿ˜ -0.00 0.00 -0.00 -0.00 -0.00 -0.00

Assessing Similarity between Emojis

Given this vector representation, it is very easy to define a similarity between elements. This is mostly done by using the cosine similarity between the vectors, which is ultimately based on the angle of the vectors.

Ruimtehol offers a method to calculate the similarity between two vectors, but we only want the most similar ones. Therefore we build our own method that sorts all emojis by similarity and returns the most similar ones.

most_similar <- function(e, m, top_n = 10) {
    distances <- embedding_similarity(m, m[e, ])[, 1]
    data.frame(cos_sim=head(sort(distances, decreasing = TRUE), top_n))
}
most_similar("๐Ÿ”ฅ", wordvectors, top_n = 5)
cos_sim
๐Ÿ”ฅ 1.00
๐ŸŽฅ 0.95
๐Ÿ˜ฒ 0.92
๐Ÿ‘น 0.89
โ›ฝ 0.88

Obviously, the most similar object is the object itself (so it is usually excluded).

ยฉ Christian Hotz-Behofsits. All rights reserved.