Bigquery & Embeddings

October 2019 · 3 minute read

One can argue if it is wise to store embeddings directly in bigquery or calculate the similarities in SQL. For sure, in some cases a library (e.g. gensim) or approximations (e.g. Facebook faiss) are more appropriate. However, in our setting we wanted to use BigQuery. Therefore, arrays are used to store the word vectors and I created SQL functions to calculate pairwise cosine similarities.

WARNING: One downside of using BigQuery arrays to store vectors is that there is no guarantee that the ordering of the components is preserved. However, in most situtations the order will not change.

Basic vector operations

Vector calculations are often much simplier on unit vectors, therefore, we will calcualte the length first and then introduce a method that normalizes the vector by dividing each component by the euclidean length of the vector:

$$||a|| = \sqrt(a_1^2 + … a_n^2)$$

CREATE TEMP FUNCTION vectorLength(arr ARRAY<Float64>) AS ((
  SELECT SQRT(SUM(POW(a,2))) FROM UNNEST(arr) as a
));

SELECT vectorLength([0., 8, 10.3])
13.041855696180663

The same can be achieved by calculating the dot-product between the vector and its identity $dot(arr1, arr1)$ and taking the sqare root:

$$||a|| = \sqrt{\boldsymbol{a \cdot a}}$$

.. or as BigQuery SQL function:

CREATE TEMP FUNCTION dot(arr1 ARRAY<Float64>, arr2 ARRAY<Float64>) AS ((
  SELECT
    SUM(arr1[OFFSET(idx)] * arr2[OFFSET(idx)])
    FROM UNNEST(GENERATE_ARRAY(0, ARRAY_LENGTH(arr1) - 1)) as idx
));

SELECT sqrt(dot([0., 8, 10.3],[0., 8, 10.3]))
CREATE TEMP FUNCTION normVector(arr1 ARRAY<Float64>) AS ((
  SELECT ARRAY_AGG(a/vectorLength(arr1)) FROM UNNEST(arr1) as a
));

SELECT normVector([0., 8, 10.3])
[0.0, 0.6134096394229249, 0.7897649107570158]

Cosine Similarity

The cosine similarity is one of the most common distances used for embeddings. Although, it is not a real distance (cause it is bounded) it is also used in this example.

$$cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}$$

As one can see, the calculation reduces to the dot product given unit vector. Therefore, we first divide by the dot product between the vector and itself to get unit vectors and then calculate the dot product between those unit vectors.

In addition to the normalization we also add some checks to assert that the arrays have the same length, no nulls are given and the vectors contain at least one component.

CREATE TEMP FUNCTION cosSimilarity(arr1 ARRAY<Float64>, arr2 ARRAY<Float64>) AS ((
  SELECT
    CASE
      WHEN ARRAY_LENGTH(arr1) != ARRAY_LENGTH(arr2) THEN NULL
      WHEN arr1 is NULL OR arr2 is NULL THEN NULL
      WHEN ARRAY_LENGTH(arr1) < 1 THEN NULL
      ELSE ROUND(dot(normVector(arr1), normVector(arr2)), 2) 
    END
));

SELECT cosSimilarity([1., 0.2],[0.8, 10.])
0.27

Those changes lead to the final calculation of a cosine. The value of 0.27 indicates that those vectors are not really similar.

© Christian Hotz-Behofsits. All rights reserved.