Interview problem: Welford's online algorithm
Problem:
Provide an online algorithm to compute the sample mean and sample variance of stream data.
Solution:
Let $x_1, x_2, \cdots, x_n, \cdots$ be the stream data, the sample mean $\bar{x}_n \equiv\sum_{i=1}^n x_i / n$ can be computed by \begin{equation}\bar{x}_n=\bar{x}_{n-1} +\frac{x_n - \bar{x}_{n-1}}{n}\,.\tag{1}\end{equation}
For the sample variance $s^2_{n}\equiv \frac{1}{n-1}\sum_{i=1}^n\left(x_i-\bar{x}_n\right)^2$, a naive method is to compute \begin{equation} s^2_n = \frac{1}{n-1}\left(\sum_{i=1}^n x_i^2 - \frac{\left(\sum_{i=1}^n x_i\right)^2}{n}\right)\end{equation} by the statistics of the sum and sum of squared. However, such method is numerically unstable since it may involve the cancellation of two very similar numbers. Instead, as suggested in this wiki, we use Welford's algorithm: \begin{eqnarray} M_{2,n} &=& M_{2, n-1}+\left(x_n-\bar{x}_{n-1}\right)\left(x_n-\bar{x}_n\right)\,, \\ s_n^2 &=&\frac{M_{2, n}}{n-1}\,.\tag{2}\end{eqnarray}
Proof: Obviously $M_{2,n}=\sum_{i=1}^n\left(x_i-\bar{x}_n\right)^2$. Note that \begin{eqnarray}\sum_{i=1}^{n-1} \left(\bar{x}_n-\bar{x}_{n-1}\right)^2&=&(n-1)\bar{x}_n^2+(n-1)\bar{x}^2_{n-1}-2(n-1) \bar{x}_n \bar{x}_{n-1}\\&=&-\bar{x}_n^2 +\bar{x}_n\bar{x}_{n-1} + \bar{x}_n\left[n\bar{x}_n - (n-1)\bar{x}_{n-1}\right] - \bar{x}_{n-1}\left[n\bar{x}_n - (n-1)\bar{x}_{n-1}\right]\\ &=&-\bar{x}_n^2 +\bar{x}_n\bar{x}_{n-1} + \bar{x}_n x_n -\bar{x}_{n-1}x_n\,.\end{eqnarray} As a result, \begin{eqnarray}M_{2,n}&=&\sum_{i=1}^{n-1}\left(x_i-\bar{x}_{n-1} +\bar{x}_{n-1}-\bar{x}_n\right)^2 + \left(x_n-\bar{x}_n\right)^2\\&=&M_{2,n-1} + \sum_{i=1}^{n-1} \left(\bar{x}_n-\bar{x}_{n-1}\right)^2+\left(x_n-\bar{x}_n\right)^2\\&=&M_{2, n-1}+\left(x_n-\bar{x}_{n-1}\right)\left(x_n-\bar{x}_n\right)\,.\end{eqnarray}
bump!
ReplyDelete