Interview problem: Welford's online algorithm
Problem:
Provide an online algorithm to compute the sample mean and sample variance of stream data.
Solution:
Let x_1, x_2, \cdots, x_n, \cdots be the stream data, the sample mean \bar{x}_n \equiv\sum_{i=1}^n x_i / n can be computed by \begin{equation}\bar{x}_n=\bar{x}_{n-1} +\frac{x_n - \bar{x}_{n-1}}{n}\,.\tag{1}\end{equation}
For the sample variance s^2_{n}\equiv \frac{1}{n-1}\sum_{i=1}^n\left(x_i-\bar{x}_n\right)^2, a naive method is to compute \begin{equation} s^2_n = \frac{1}{n-1}\left(\sum_{i=1}^n x_i^2 - \frac{\left(\sum_{i=1}^n x_i\right)^2}{n}\right)\end{equation} by the statistics of the sum and sum of squared. However, such method is numerically unstable since it may involve the cancellation of two very similar numbers. Instead, as suggested in this wiki, we use Welford's algorithm: \begin{eqnarray} M_{2,n} &=& M_{2, n-1}+\left(x_n-\bar{x}_{n-1}\right)\left(x_n-\bar{x}_n\right)\,, \\ s_n^2 &=&\frac{M_{2, n}}{n-1}\,.\tag{2}\end{eqnarray}
Proof: Obviously M_{2,n}=\sum_{i=1}^n\left(x_i-\bar{x}_n\right)^2. Note that \begin{eqnarray}\sum_{i=1}^{n-1} \left(\bar{x}_n-\bar{x}_{n-1}\right)^2&=&(n-1)\bar{x}_n^2+(n-1)\bar{x}^2_{n-1}-2(n-1) \bar{x}_n \bar{x}_{n-1}\\&=&-\bar{x}_n^2 +\bar{x}_n\bar{x}_{n-1} + \bar{x}_n\left[n\bar{x}_n - (n-1)\bar{x}_{n-1}\right] - \bar{x}_{n-1}\left[n\bar{x}_n - (n-1)\bar{x}_{n-1}\right]\\ &=&-\bar{x}_n^2 +\bar{x}_n\bar{x}_{n-1} + \bar{x}_n x_n -\bar{x}_{n-1}x_n\,.\end{eqnarray} As a result, \begin{eqnarray}M_{2,n}&=&\sum_{i=1}^{n-1}\left(x_i-\bar{x}_{n-1} +\bar{x}_{n-1}-\bar{x}_n\right)^2 + \left(x_n-\bar{x}_n\right)^2\\&=&M_{2,n-1} + \sum_{i=1}^{n-1} \left(\bar{x}_n-\bar{x}_{n-1}\right)^2+\left(x_n-\bar{x}_n\right)^2\\&=&M_{2, n-1}+\left(x_n-\bar{x}_{n-1}\right)\left(x_n-\bar{x}_n\right)\,.\end{eqnarray}
bump!
ReplyDelete