Interpolate missing values in SQL
In this I demonstrate how missing values can be interpolated in SQL. The code is written for BigQuery and I provide a user defined function, however, similar solutions may exist for other databases. Although, SQL may not be the first choice to do such tasks, it is possible. Before one can replace missing values, you have to have missing values. At least in my applications, often the whole rows are missing and not just selective fields. In those cases, try to insert the rows before.
Simple average
The most simple method would be to replace the missing values with the columns average like this
SELECT
time,
IFNULL(
value,
AVG(value) OVER()
) value
FROM tbl
In this example, missing values of the column value are replaced by the column’s average value.
Linear interpolation
In the following, I’ll demonstrate a more sophisticated approach. A user defined function is used to interpolate the values as follows: It calculates the mean between the preceeding and following values. If it is the last observation, the last value is replicated and vice versa if it is the first observation. If you use the data afterwards for prediction tasks, you may change this behaviour, because it leaks feature information into the past. Nevertheless, for other use cases it should be fine.
CREATE TEMP FUNCTION arrStripNulls(arr ANY TYPE) AS ((
SELECT ARRAY_AGG(a) FROM UNNEST(arr) a WHERE a is not NULL
));
CREATE TEMP FUNCTION linearInterpolation(val INT64, prev ARRAY<INT64>, next ARRAY<INT64>) AS ((
SELECT
CASE
WHEN val is not NULL
THEN val
WHEN prev[OFFSET(0)] is not NULL and next[OFFSET(0)] is not NULL
THEN (prev[OFFSET(0)] + next[OFFSET(0)]) / 2
ELSE
(arrStripNulls(prev)[SAFE_OFFSET(0)] + arrStripNulls(next)[SAFE_OFFSET(0)]) / 2
END as x
));
These two function can be easily applied to some sample data as follows:
WITH
t as (SELECT [
STRUCT(1 as t, 4 as v),
STRUCT(2 as t, NULL as v),
STRUCT(3 as t, 6 as v),
STRUCT(4 as t, NULL as v),
STRUCT(5 as t, NULL as v),
STRUCT(6 as t, NULL as v),
STRUCT(7 as t, 9 as v),
STRUCT(8 as t, 3 as v),
STRUCT(9 as t, NULL as v),
STRUCT(10 as t, 4 as v)
] as x)
SELECT
tbl.t,
tbl.v as raw,
linearInterpolation(
tbl.v,
ARRAY_AGG(tbl.v) OVER (ORDER BY tbl.t desc ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING),
ARRAY_AGG(tbl.v) OVER (ORDER BY tbl.t asc ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
) as v_int
FROM t, unnest(x) as tbl ORDER BY t asc