Interview problem: Welford's online algorithm

Problem: 
Provide an online algorithm to compute the sample mean and sample variance of stream data. 





Solution: Let $x_1, x_2, \cdots, x_n, \cdots$ be the stream data, the sample mean $\bar{x}_n \equiv\sum_{i=1}^n x_i / n$ can be computed by \begin{equation}\bar{x}_n=\bar{x}_{n-1} +\frac{x_n - \bar{x}_{n-1}}{n}\,.\tag{1}\end{equation}
For the sample variance $s^2_{n}\equiv \frac{1}{n-1}\sum_{i=1}^n\left(x_i-\bar{x}_n\right)^2$, a naive method is to compute \begin{equation} s^2_n = \frac{1}{n-1}\left(\sum_{i=1}^n x_i^2 - \frac{\left(\sum_{i=1}^n x_i\right)^2}{n}\right)\end{equation} by the statistics of the sum and sum of squared. However, such method is numerically unstable since it may involve the cancellation of two very similar numbers. Instead, as suggested in this wiki, we use Welford's algorithm: \begin{eqnarray} M_{2,n} &=& M_{2, n-1}+\left(x_n-\bar{x}_{n-1}\right)\left(x_n-\bar{x}_n\right)\,, \\ s_n^2 &=&\frac{M_{2, n}}{n-1}\,.\tag{2}\end{eqnarray}
Proof: Obviously $M_{2,n}=\sum_{i=1}^n\left(x_i-\bar{x}_n\right)^2$. Note that \begin{eqnarray}\sum_{i=1}^{n-1} \left(\bar{x}_n-\bar{x}_{n-1}\right)^2&=&(n-1)\bar{x}_n^2+(n-1)\bar{x}^2_{n-1}-2(n-1) \bar{x}_n \bar{x}_{n-1}\\&=&-\bar{x}_n^2 +\bar{x}_n\bar{x}_{n-1} + \bar{x}_n\left[n\bar{x}_n - (n-1)\bar{x}_{n-1}\right] - \bar{x}_{n-1}\left[n\bar{x}_n - (n-1)\bar{x}_{n-1}\right]\\ &=&-\bar{x}_n^2 +\bar{x}_n\bar{x}_{n-1} + \bar{x}_n x_n -\bar{x}_{n-1}x_n\,.\end{eqnarray} As a result, \begin{eqnarray}M_{2,n}&=&\sum_{i=1}^{n-1}\left(x_i-\bar{x}_{n-1} +\bar{x}_{n-1}-\bar{x}_n\right)^2 + \left(x_n-\bar{x}_n\right)^2\\&=&M_{2,n-1} + \sum_{i=1}^{n-1} \left(\bar{x}_n-\bar{x}_{n-1}\right)^2+\left(x_n-\bar{x}_n\right)^2\\&=&M_{2, n-1}+\left(x_n-\bar{x}_{n-1}\right)\left(x_n-\bar{x}_n\right)\,.\end{eqnarray}

Comments

Post a Comment

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts