# Standard deviation numerically stable calculation for SQL

if you are intrested to receive information please write to

Despite of Standard Deviation is one of most relevant parameter of descriptive statistics, it isn't available as common aggregation function for some SQL server. Here is described how to calculate and use standard deviation in a single select query. Finally a better and numerically stable sql procedure is shown.

First of all recall common formula for sample standard deviation

Common formula is double pass because it requires of average value of dataset. Double pass formula it's hard to implement in SQL and generally it isn’t really efficient.

We can rewrite the formula like here:

Et voilà! Now we can write an SQL statement like this:

SELECT SQRT( (COUNT(x) * SUM(POWER(x,2)) - POWER(SUM(x), 2) ) / (COUNT(x) * (COUNT(x) - 1)) ) FROM data_table

Checking for num of samples and negative sqrt, it becomes:

SELECT CAST ( (CASE WHEN (COUNT(x) > 1) AND ( (COUNT(x) * SUM(POWER(x,2)) - POWER(SUM(x), 2) ) > 0) THEN SQRT((COUNT(x) * SUM(POWER(x,2)) - POWER(SUM(x), 2) ) / (COUNT(x)*(COUNT(x) - 1))) ELSE CAST(NULL AS DOUBLE PRECISION) END) AS double precision) AS stdev FROM data_table

If samples are large and their differences are small we will have a round-off error or arithmetic overflow/underflow.

Try it ! Check the above SQL with a dataset where you have `X[i]=round(0,1)` and `Xbig[i]=x[i]+10000000`. Both standard deviation must be same but you will be surprised to see 2 different values. More exactly the code returns right result for `stdver(X)` but `stdver(Xbig)` is **wrong** !!

Than the above SQL statement has not much sense due to numerically instability when `X` are big numbers. Remember that this issue is present also if you use the common (double pass) formula for standard deviation.

## The solution: running statistics calculation

There are available many algorithms for standard deviation calculation, most of them are numerically stable. In addition there are algorithms that can calculate standard deviation in incremental or recursive way.

One if this is due to Knuth & Welford. Here is a simple implementation in C:

double RunningStDev(double *dataset, int numOfSamples) { if(numOfSamples<2) return 0; int n = 0; double mean = 0; double M2 = 0; for(int i=0;i<numOfSamples;i++) { n++; double delta = x - mean; mean = mean + delta/n; M2 = M2 + delta*(x – mean); } double variance = M2/(n – 1); double stdev = sqrt( variance); return variance; }

Above algorithm can be rewritten as SQL stored procedure. Below is an implementation for Firebird SQL server:

SET TERM ^ ; CREATE OR ALTER PROCEDURE STDEV2 ( tablename varchar(100), fieldname varchar(100)) returns ( wavg double precision, wvar double precision, wstdev double precision) AS declare variable mean double precision; declare variable m2 double precision; declare variable delta double precision; declare variable n integer; declare variable x double precision; begin n = 0; mean = 0.0; m2 = 0.0; FOR execute statement 'SELECT ' || FieldName || ' FROM ' || TableName INTO :x do begin n = n + 1; delta = x - mean; mean = mean + delta/n; M2 = M2+delta*(x-mean); end -- Set the average variable wavg = 0.0; IF(n>0) then wavg = mean; -- Set the variance of sample variable wvar = 0.0; IF(n>1) then wvar = M2 / (n-1); -- Set the Standard Deviation of sample variable wstdev = 0.0; IF (wvar>0) then wstdev = sqrt(wvar); suspend; end ^ SET TERM ; ^ /* Existing privileges on this procedure */ GRANT EXECUTE ON PROCEDURE STDEV2 TO SYSDBA;

Now you can type a simply query to and get average, variance and standard deviation you want !

SELECT * FROM STDEV2('YOUR TABLE NAME', 'YOUR FIELD NAME');

#### Useful link

- Accurately computing running variance (http://www.johndcook.com/standard_deviation.html)
- Algorithms for calculating variance (http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Incremental_algorithm)

This work is property of Pk Lab. You can use it for free but you must retain author's copyright.