Geometric interpretation of linear regression

Hilbert space

Given a linear space $V$ over the field $\mathbb{C}$, we define an inner product, i.e, a map $\langle \cdot, \cdot\rangle : \, V\times V \rightarrow \mathbb{C}$ that satisfies the following axioms for all vectors $\mathbf{x}, \mathbf{y}, \mathbf{z}\in V$ and all scalars $\alpha \in \mathbb{C}$:
  • $\langle \mathbf{x}, \mathbf{y}\rangle =\langle \mathbf{y}, \mathbf{x}\rangle^*$
  • $\langle \mathbf{x}+\mathbf{y}, \mathbf{z}\rangle = \langle \mathbf{x}, \mathbf{z}\rangle + \langle \mathbf{y}, \mathbf{z}\rangle$
  • $\langle \alpha\,\mathbf{x}, \mathbf{y}\rangle=\alpha \langle \mathbf{x}, \mathbf{y}\rangle$
  • $\langle \mathbf{x}, \mathbf{x}\rangle \geq 0$. Equality holds iff $\mathbf{x}=\mathbf{0}\,.$
We can then define the norm $\left|\left|\mathbf{x}\right|\right|^2\equiv \langle \mathbf{x}, \mathbf{x}\rangle$. The norm satisfies the following two properties:
  • Cauchy-Schwarz inequality: $\left|\langle \mathbf{x}, \mathbf{y}\rangle\right|^2\leq \left|\left|\mathbf{x}\right|\right|^2\,\left|\left|\mathbf{y}\right|\right|^2$.  Proof: Let $\mathbf{z}\equiv \mathbf{y}-\frac{\langle \mathbf{x}, \mathbf{y}\rangle}{\left|\left|\mathbf{x}\right|\right|^2}\mathbf{x}$ and use the property $\left|\left|\mathbf{z}\right|\right|^2\geq 0$.
  • Triangular inequality: $\left|\left|\mathbf{x}+\mathbf{y}\right|\right|^2\leq \left|\left|\mathbf{x}\right|\right|^2+ \left|\left|\mathbf{y}\right|\right|^2$, which follows the Cauchy-Schwarz inequality directly.
Finally, a Hilbert space is an inner product space which is complete with respect to the norm.

Projection Theorem

Let $\cal{M}$ be a closed subspace of the Hibert space $\cal{H}$ and let $\mathbf{y}\in {\cal{H}}$. Then $\mathbf{y}$ can be uniquely represented as $\mathbf{y}=\hat{\mathbf{y}}+\mathbf{e}$, where $\hat{\mathbf{y}}\in {\cal{M}}$ and $\mathbf{e}$ is orthogonal to $\cal{M}$. That is, for all $\mathbf{w}\in {\cal{M}}$, $\langle \mathbf{w}, \mathbf{e}\rangle=0$ and $\left|\left| \mathbf{y} - \mathbf{w} \right|\right|\geq \left|\left| \mathbf{y} - \hat{\mathbf{y}} \right|\right|$, where equality holds iff $\mathbf{w}=\hat{\mathbf{y}}$. 
Proof: $\left|\left| \mathbf{y} - \mathbf{w} \right|\right|^2 = \left|\left| \mathbf{y} - \hat{\mathbf{y} } +\hat{\mathbf{y} } - \mathbf{w} \right|\right|^2 = \left|\left| \mathbf{y} - \hat{\mathbf{y} } \right|\right|^2 + \left|\left| \hat{\mathbf{y} } - \mathbf{w} \right|\right|^2 \geq \left|\left| \mathbf{y} - \hat{\mathbf{y} } \right|\right|^2$.
Figure 1: Projection theorem in solid geometry
In a picture of solid geometry as in Fig. 1, $AD\perp \text{plane}\,OBC$ and $AB\perp OB$. If $\mathbf{y}$ is $OA$ and $\cal{M}$ is plane $OBC$, then $\mathbf{e}$ is $DA$ and $\hat{\mathbf{y}}$ is $OD$. If $\cal{M}$ is line $OB$, then $\mathbf{e}$ is $BA$ and $\hat{\mathbf{y}}$ is $OB$.

Linear regression

Assume $\cal{M}$ is spanned by the basis $\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_{p}$, linear regression is to find \begin{eqnarray}\hat{\mathbf{y}} = \sum_{j=0}^p \beta_j\, \mathbf{x}_j\,.\tag{1}\end{eqnarray} that minimizes $\left|\left|\mathbf{y}- \hat{\mathbf{y}}\right|\right|^2$. 
Let $\mathbf{e}\equiv\mathbf{y}-\hat{\mathbf{y}}$, by the projection theorem, we know that \begin{equation} \left|\left|\mathbf{y}\right|\right|^2 = \left|\left|\hat{\mathbf{y}}\right|\right|^2 + \left|\left|\mathbf{e}\right|\right|^2 \tag{2}\,.\end{equation} Furthermore, $\langle {\mathbf{x}_i, \mathbf{y}} - \hat{\mathbf{y}}\rangle = 0$. That is, we can solve $\beta_j$ from equations \begin{eqnarray}\sum_{j=0}^p\langle \mathbf{x}_i, \mathbf{x}_j\rangle\,\beta_j  = \langle \mathbf{x}_i,\mathbf{y}\rangle\tag{3}\end{eqnarray} for $i=0, 1, \cdots, p$. 
In most cases in practice, we have included an "interception" term $\mathbf{x}_0\equiv\mathbf{1}$. Eq. (3) then reduces to \begin{equation} \beta_0 = \frac{\langle\mathbf{1}, \mathbf{y}\rangle-\sum_{j=1}^p\langle\mathbf{1},  \mathbf{x}_j\rangle\beta_j}{\langle\mathbf{1},\mathbf{1}\rangle}\,,\end{equation} and \begin{equation} \sum_{j=1}^p\left(\langle\mathbf{x}_i,\mathbf{x}_j\rangle - \frac{\langle\mathbf{x}_i,\mathbf{1}\rangle\langle\mathbf{1},\mathbf{x}_j\rangle}{\langle\mathbf{1},\mathbf{1}\rangle} \right)\beta_j = \langle\mathbf{x}_i,\mathbf{y}\rangle -\frac{\langle\mathbf{x}_i,\mathbf{1}\rangle\langle\mathbf{1},\mathbf{y}\rangle}{\langle\mathbf{1},\mathbf{1}\rangle} \tag{4}\end{equation} for $i=1,\cdots, p$.
 
There are two common Hilbert spaces:

Hilbert space of random variables

For two random variables $X$ and $Y$, their inner product can be defined as $\langle X, Y\rangle=\mathbb{E}(XY)$. 
The linear regression here is to find $\hat{Y}=\beta_0+\sum_{j=1}^p\beta_p X_p$ that minimizes $\mathbb{E}(Y-\hat{Y})^2$. By applying this specific inner product of random variables in (4), we have \begin{equation} \beta_0 = \mathbb{E}(Y) - \sum_{j=1}^p \beta_j\mathbb{E}(X_j) \tag{5}\end{equation} and \begin{equation} \sum_{j=1}^p \text{Cov}(X_i, X_j)\,\beta_j = \text{Cov}(X_i, Y) \tag{6}\end{equation} for $i=1,\cdots, p$.
For simple linear regression $\hat{Y}=\beta_0+\beta_1 X$, \begin{equation}\beta_1=\frac{\text{Cov}(X, Y)}{\text{Var(X)}}\,.\tag{7}\end{equation}

Finally, let $e\equiv Y-\hat{Y}$, from Eq. (2) we have the decomposition \begin{equation}\mathbb{E}\left(Y^2\right)=\mathbb{E}\left(\hat{Y}^2\right)+\mathbb{E}\left(e^2\right)\,.\tag{8}\end{equation} Only with interception term (5), we have $\mathbb{E}(Y)=\mathbb{E}(\hat{Y})$ and $\mathbb{E}(e)=0$, and thus Eq. (8) further reduces to \begin{equation}\text{Var}(Y)=\text{Var}(\hat{Y})+\text{Var}(e)\,.\tag{9}\end{equation}.

Hilbert space of observations

For $n$ observations $\left(y^{(k)}, x_1^{(k)}, \cdots, x_n^{(k)} \right)$, we form columns $\mathbf{y}=\left[y^{(1)}, \cdots, y^{(n)}\right]^T$ and $\mathbf{x}_j=\left[x_j^{(1)}, \cdots, x_j^{(n)}\right]^T$ for $j=1,\cdots, p$. The inner product of two columns $\mathbf{x}$ and $\mathbf{y}$ is defined as $\langle \mathbf{x}, \mathbf{y}\rangle=\mathbf{x}^T\cdot\mathbf{y}$. 
Let $\mathbf{x}_0\equiv [1, \cdots, 1]^T$, the linear regression here is to find $\hat{\mathbf{y}}=\sum_{j=0}^p\beta_j \mathbf{x}_j$ that minimizes $(\mathbf{y}-\hat{\mathbf{y}})^T\cdot(\mathbf{y}-\hat{\mathbf{y}})$. By applying this specific inner product of columns in (3), we have \begin{equation}\boldsymbol{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\,,\tag{10}\end{equation}
where $\boldsymbol{\beta}\equiv[\beta_0, \cdots, \beta_p]^T$ and $\mathbf{X}\equiv[\mathbf{x}_0, \cdots, \mathbf{x}_p]$. Or by applying to (4), we have \begin{equation} \beta_0 = \hat{\mu}(\mathbf{y}) - \sum_{j=1}^p \beta_j\hat{\mu}(\mathbf{x}_j)\,,\quad \sum_{j=1}^p \hat{\gamma}(\mathbf{x}_i,\mathbf{x}_j)\beta_j = \hat{\gamma}(\mathbf{x}_i, \mathbf{y})\end{equation} for $i=1, \cdots, p$, where $\hat{\mu}(\mathbf{y})\equiv\frac{1}{n}\sum_{k=1}y^{(k)}$ is the sample mean and $\hat{\gamma}(\mathbf{x}_i,\mathbf{x}_j)\equiv \frac{1}{n}\mathbf{x}_i^T\cdot \mathbf{x}_j-\hat{\mu}(\mathbf{x}_i)\hat{\mu}(\mathbf{x}_j)$ is the sample covariance.
Compared to Eq.(5) and (6), all results are the same after replacing expectation and covariance by sample mean and sample covariance, as expected.

Comments

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts