Interpolate missing values in SQL

May 2020 · 2 minute read

In this I demonstrate how missing values can be interpolated in SQL. The code is written for BigQuery and I provide a user defined function, however, similar solutions may exist for other databases. Although, SQL may not be the first choice to do such tasks, it is possible. Before one can replace missing values, you have to have missing values. At least in my applications, often the whole rows are missing and not just selective fields. In those cases, try to insert the rows before.

Simple average

The most simple method would be to replace the missing values with the columns average like this


SELECT 
	time, 
	IFNULL(
		value,
		AVG(value) OVER()
	) value
FROM tbl 

In this example, missing values of the column value are replaced by the column’s average value.

Linear interpolation

In the following, I’ll demonstrate a more sophisticated approach. A user defined function is used to interpolate the values as follows: It calculates the mean between the preceeding and following values. If it is the last observation, the last value is replicated and vice versa if it is the first observation. If you use the data afterwards for prediction tasks, you may change this behaviour, because it leaks feature information into the past. Nevertheless, for other use cases it should be fine.

CREATE TEMP FUNCTION arrStripNulls(arr ANY TYPE) AS ((
  SELECT ARRAY_AGG(a) FROM UNNEST(arr) a WHERE a is not NULL
));

CREATE TEMP FUNCTION linearInterpolation(val INT64, prev ARRAY<INT64>, next ARRAY<INT64>) AS ((
  SELECT 
    CASE
      WHEN val is not NULL
        THEN val
      WHEN prev[OFFSET(0)] is not NULL and next[OFFSET(0)] is not NULL
        THEN (prev[OFFSET(0)] + next[OFFSET(0)]) / 2
      ELSE 
        (arrStripNulls(prev)[SAFE_OFFSET(0)] + arrStripNulls(next)[SAFE_OFFSET(0)]) / 2
    END as x
));

These two function can be easily applied to some sample data as follows:

WITH 
  t as (SELECT [
    STRUCT(1 as t, 4 as v), 
    STRUCT(2 as t, NULL as v), 
    STRUCT(3 as t, 6 as v),
    STRUCT(4 as t, NULL as v),
    STRUCT(5 as t, NULL as v),
    STRUCT(6 as t, NULL as v),
    STRUCT(7 as t, 9 as v),
    STRUCT(8 as t, 3 as v),
    STRUCT(9 as t, NULL as v),
    STRUCT(10 as t, 4 as v)
   ] as x)
  
  
SELECT 
  tbl.t,
  tbl.v as raw,
  linearInterpolation(
    tbl.v,
    ARRAY_AGG(tbl.v) OVER (ORDER BY tbl.t desc ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING),
    ARRAY_AGG(tbl.v) OVER (ORDER BY tbl.t asc ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
  ) as v_int
  FROM t, unnest(x) as tbl ORDER BY t asc
© Christian Hotz-Behofsits. All rights reserved.