Learning Emoji Representations from Observed Usage

Nowadays it is hard to imagine daily communication without emojis like 😄, 😬 or ❤️. These cute pictograms are not only ideal for expressing emotions, they are also standardized. This makes them ideal for analysis. However, before classical analyses such as clustering can be performed, a numerical representation is required. A simple method would be one-hot-encoding, which would be obvious considering the very limited vocabulary (there are only a few thousand emojis)....

November 5, 2019 · 3 min · Christian Hotz-Behofsits

Bigquery & Embeddings

One can argue if it is wise to store embeddings directly in bigquery or calculate the similarities in SQL. For sure, in some cases a library (e.g. gensim) or approximations (e.g. Facebook faiss) are more appropriate. However, in our setting we wanted to use BigQuery. Therefore, arrays are used to store the word vectors and I created SQL functions to calculate pairwise cosine similarities. .notice{padding:18px;line-height:24px;margin-bottom:24px;border-radius:4px;color:#444;background:#e7f2fa}.notice p:last-child{margin-bottom:0}.notice-title{margin:-18px -18px 12px;padding:4px 18px;border-radius:4px 4px 0 0;font-weight:700;color:#fff;background:#6ab0de}....

October 27, 2019 · 3 min · Christian Hotz-Behofsits

Using Embeddings in R

There exist different file formats to store distributed vector or word representations also known as embeddings. However, one of the most convenient ways is to use the text format used by the original word2vec implementation. In this format, each row starts with a label (an item of the vocabulary) followed by the vector components. Furthermore, each field is separated by an ordinary space. The following function processes such *.vec-files and can be used to load them directly into R:...

August 8, 2019 · 3 min · Christian Hotz-Behofsits