Applying Cosine Similarity on a Large Scale

I wrote my first blog post about cosine similarity back in 2019 when the pandemic was out of sight, and most marketing people were unaware of “representational learning.” However, times have changed, and nowadays, many quantitative marketing papers apply word2vec, prod2vec, or similar approaches.

Thanks to gensim, training such a model is straightforward. For example, one can learn the representations of tracks a within few lines of python code:

from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

model = Word2Vec(
    sentences=LineSentence('./data/tmp_isrc_training.txt'), 
    vector_size=50, window=99, epochs=5, min_count=1, workers=8,
    sg=1, negative=10
)

The similarity between two songs, for example “Atemlos durch die Nacht” by Helene Fischer and ”Amerika” by Rammstein can be calculated like this:

model.wv.similarity('DEUM71303188', 'DEN120404987')

0.2481286

The cosine similarity takes a value between -1 and 1. While a small value means dissimilarity and a high one similarity, a value next to zero indicates independence. Since Helene Fischer is a pop artist and Rammstein is a metal band, a relatively low similarity of 0.25 seems reasonable.

While training is super easy, applying the embeddings is often not trivial. The problem is that similarity is defined for item pairs, and the number of item pairs is quadratic. Therefore, in most cases, it does not make sense to calculate all pair-wise similarities but to calculate the required ones “on demand.” This means we must calculate the similarity in the system we are generating our panel. In my case, this is an OLAP database management system (DBMS) named Clickhouse.

In this post, I’ll demonstrate how you can calculate cosine similarity directly in Clickhouse. Still, the same approach should be (more or less) easy to adopt for other DBMS.

The good and the bad news

When planning to implement a particular functionality in a database management system, two properties are essential: (1) supported data types and (2) supported functions.

Our learned vector representations can be seen as arrays or tuples. Tuples have a fixed length but are more resource intensive. Arrays, on the other hand, are more flexible/lightweight but have one drawback: In line with BigQuery, the order of elements within an Array is not fixed. However, I assume in the following that the imported array values are not permuted; and it seems like this claim holds in practice.

Consequently, my database schema looks as follows:

DROP TABLE IF EXISTS myproject.aux_w2v_tracks;
CREATE TABLE myproject.aux_w2v_tracks (
      isrc String,
      vec Array(Float32)
)
ENGINE=Join(ANY, LEFT, isrc);

Notably, I am using the Join engine here. This allows to directly query the table using the joinGet(tbl, col, key) function. The key is an element of the vocabulary, and you will get NULL if the key is not part of the vocab.

Regarding functions, Clickhouse already provides a pretty substantial set of pre-defined distance norms and distance measures. However, some of them (including cosine distance) only apply to tuples. Since we are using Arrays here and a conversion to tuples is not feasible, I’ll re-implement the same functionality in SQL user-defined functions (UDFs).

I’ll use JSONEachRow to serialize our Word2Vec model (trained with gensim). This format can be directly imported to Clickhouse.

import json

with open('./data/model_w2v_tracks.jsonl', 'w') as f:
    for isrc in model.wv.index_to_key:
        
        f.write(json.dumps({
            'isrc': isrc,
            'vec': model.wv.get_vector(isrc, norm=True).tolist()
        }) + '\n')

The norm=True argument normalizes the vectors to unit length. This makes our lives easier later cause the cosine calculation reduces to the dot product of the vectors. The resulting file looks like the following. It is essentially a JSON object per line.

head -n 1 ./data/model_w2v_tracks.jsonl

{“isrc": “QZES71982312”, “vec”: [-0.06215152144432068, 0.3074309825897217, -0.03476686775684357, -0.0048563010059297085, -0.22848837077617645, 0.030492885038256645, 0.2259601503610611,… ]}

I always use bash to import data to my database. The main reason is speed, but it is also quite convenient (if you are already used to bash).

cat ./data/model_w2v_tracks.jsonl|clickhouse-client  -q "INSERT INTO myproject.aux_w2v_tracks FORMAT JSONEachRow"

Now we can check if the vectors are correctly stored and if we indeed have unit vectors in place by calculating the vector length:

SELECT L2Norm(joinGet(‘myproject.aux_w2v_tracks', 'vec', 'DEUM71303188'))

This query should yield a value near one (numerical error may lead to minor deviations).

Simple Cosine Similarity

In its most simple version, we calculate the similarity between two artists in SQL using the L2Distance function. It returns the Euclidean distance, which we transform into the cosine distance.

SELECT
    1-pow(L2Distance(
        joinGet('myproject.aux_w2v_tracks', 'vec', 'DEUM71303188'),
        joinGet('myproject.aux_w2v_tracks', 'vec', 'DEN120404987')
    ), 2)/2

We can simply check the approach’s validity by comparing this function’s result with the genism equivalent. Again, the results should only deviate by a small margin. Alternatively, we can also implement the same logic ourselves. Since we have unit vectors, we only have to calculate the dot product.

SELECT 
    sumArray(arrayMap((x, y) -> x*y, 
        joinGet('myproject.aux_w2v_tracks', 'vec', 'DEUM71303188'),
        joinGet('myproject.aux_w2v_tracks', 'vec', 'DEN120404987')
    ))

In practice, we won’t calculate the similarity between two predefined tracks, and we also have to check if the tracks are part of the vocabulary.

DROP FUNCTION IF EXISTS cos_similarity;

CREATE FUNCTION cos_similarity AS (isrc1, isrc2, vec_len) -> (
    CASE
        WHEN empty(joinGet('myproject.aux_w2v_tracks', 'vec', isrc1)) THEN NULL
        WHEN empty(joinGet('myproject.aux_w2v_tracks', 'vec', isrc2)) THEN NULL
        ELSE sumArray(
            arrayMap((x, y) -> x*y, 
            IF(empty(joinGet('myproject.aux_w2v_tracks', 'vec', isrc1)), range(0, vec_len), joinGet('myproject.aux_w2v_tracks', 'vec', isrc1)), 
            IF(empty(joinGet('myproject.aux_w2v_tracks', 'vec', isrc2)), range(0, vec_len), joinGet('myproject.aux_w2v_tracks', 'vec', isrc2))
        ))
    END
);

Calculating the similarity in a UDF is challenging due to various reasons. First, UDFs are still a work in progress and subject to a diverse set of issues (e.g., see ). For example, I could not use the WITH/SELECT statements since CH has trouble finding the columns. Second, there is not an easy way to catch/handle exceptions. In particular, there is no equivalent to BQ SAFE-predicate. Consequently, I have to create “fake vectors,” replaced afterward. To this end, I have also to provide the length of the hidden space (latent factors) such that these fake arrays have the correct dimensionality.

When you try to aggregate similarities, things get even more ~~problematic~~ interesting. I don’t want to elaborate if aggregating cosine similarities make sense. Still, in case you want to do so, here is how.

The following user-defined function takes two arguments, the first is a focal track, and the second is an array of songs. The function calculates the average cosine Similarity between the focal song and the others. Since we have to check if all of these tracks are covered by the vocabulary, we have to add maaaaany conditionals.

DROP FUNCTION IF EXISTS cos_similarity_agg;

CREATE FUNCTION cos_similarity_agg AS (isrc1, arr_isrc2) -> (
    IF(
        empty(joinGet('myproject.aux_w2v_tracks', 'vec', isrc1)),
        NULL,
        arrayAvg(
            arrayMap(
                isrc2 -> assumeNotNull(
                    IF(
                        empty(joinGet('myproject.aux_w2v_tracks', 'vec', isrc1)),

                        -- HINT: there is no super-type Array(Float32) and Array(Nullable)
                        -- s.t. we add empty arrays that we remove in the next step
                        NULL,

                        arraySum(
                            arrayMap(
                                (x, y) -> x*y, 
                                joinGet('myproject.aux_w2v_tracks', 'vec', isrc1),
                                isrc2
                            )
                        )
                    )
                ),
                --// remove empty vectors (used as input for arrayMap)
                arrayFilter(
                    x -> not empty(x),

                    --// map isrcs to vectors
                    arrayMap(x -> (
                        IF(
                            empty(joinGet('myproject.aux_w2v_tracks', 'vec', x)),

                            -- HINT: there is no super-type Array(Float32) and Array(Nullable)
                            -- s.t. we add empty arrays that we remove in the next step
                            array(), 
                            joinGet('myproject.aux_w2v_tracks', 'vec', x)
                        )
                    ), arr_isrc2)
                )
            )
        )
    )
);

SELECT (cos_similarity_agg('DEUM71303188', ['DEUM71303188', 'DEN120404987', 'DEUM71303188']))

Hope this helps someone out there.