Geometric interpretation of linear regression
Hilbert space
Given a linear space $V$ over the field $\mathbb{C}$, we define an inner product, i.e, a map $\langle \cdot, \cdot\rangle : \, V\times V \rightarrow \mathbb{C}$ that satisfies the following axioms for all vectors $\mathbf{x}, \mathbf{y}, \mathbf{z}\in V$ and all scalars $\alpha \in \mathbb{C}$:
- $\langle \mathbf{x}, \mathbf{y}\rangle =\langle \mathbf{y}, \mathbf{x}\rangle^*$
- $\langle \mathbf{x}+\mathbf{y}, \mathbf{z}\rangle = \langle \mathbf{x}, \mathbf{z}\rangle + \langle \mathbf{y}, \mathbf{z}\rangle$
- $\langle \alpha\,\mathbf{x}, \mathbf{y}\rangle=\alpha \langle \mathbf{x}, \mathbf{y}\rangle$
- $\langle \mathbf{x}, \mathbf{x}\rangle \geq 0$. Equality holds iff $\mathbf{x}=\mathbf{0}\,.$
- Cauchy-Schwarz inequality: $\left|\langle \mathbf{x}, \mathbf{y}\rangle\right|^2\leq \left|\left|\mathbf{x}\right|\right|^2\,\left|\left|\mathbf{y}\right|\right|^2$. Proof: Let $\mathbf{z}\equiv \mathbf{y}-\frac{\langle \mathbf{x}, \mathbf{y}\rangle}{\left|\left|\mathbf{x}\right|\right|^2}\mathbf{x}$ and use the property $\left|\left|\mathbf{z}\right|\right|^2\geq 0$.
- Triangular inequality: $\left|\left|\mathbf{x}+\mathbf{y}\right|\right|^2\leq \left|\left|\mathbf{x}\right|\right|^2+ \left|\left|\mathbf{y}\right|\right|^2$, which follows the Cauchy-Schwarz inequality directly.
Projection Theorem
Let $\cal{M}$ be a closed subspace of the Hibert space $\cal{H}$ and let $\mathbf{y}\in {\cal{H}}$. Then $\mathbf{y}$ can be uniquely represented as $\mathbf{y}=\hat{\mathbf{y}}+\mathbf{e}$, where $\hat{\mathbf{y}}\in {\cal{M}}$ and $\mathbf{e}$ is orthogonal to $\cal{M}$. That is,
for all $\mathbf{w}\in {\cal{M}}$, $\langle \mathbf{w}, \mathbf{e}\rangle=0$ and $\left|\left| \mathbf{y} - \mathbf{w} \right|\right|\geq \left|\left| \mathbf{y} - \hat{\mathbf{y}} \right|\right|$,
where equality holds iff $\mathbf{w}=\hat{\mathbf{y}}$.
Proof: $\left|\left| \mathbf{y} - \mathbf{w} \right|\right|^2 = \left|\left| \mathbf{y} - \hat{\mathbf{y} } +\hat{\mathbf{y} } - \mathbf{w} \right|\right|^2
= \left|\left| \mathbf{y} - \hat{\mathbf{y} } \right|\right|^2 + \left|\left| \hat{\mathbf{y} } - \mathbf{w} \right|\right|^2 \geq \left|\left| \mathbf{y} - \hat{\mathbf{y} } \right|\right|^2$.
Figure 1: Projection theorem in solid geometry |
Linear regression
Assume $\cal{M}$ is spanned by the basis $\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_{p}$, linear regression is to find \begin{eqnarray}\hat{\mathbf{y}} = \sum_{j=0}^p \beta_j\, \mathbf{x}_j\,.\tag{1}\end{eqnarray} that minimizes $\left|\left|\mathbf{y}- \hat{\mathbf{y}}\right|\right|^2$.
Let $\mathbf{e}\equiv\mathbf{y}-\hat{\mathbf{y}}$, by the projection theorem, we know that \begin{equation} \left|\left|\mathbf{y}\right|\right|^2 = \left|\left|\hat{\mathbf{y}}\right|\right|^2 + \left|\left|\mathbf{e}\right|\right|^2 \tag{2}\,.\end{equation} Furthermore, $\langle {\mathbf{x}_i, \mathbf{y}} - \hat{\mathbf{y}}\rangle = 0$. That is, we can solve $\beta_j$ from equations \begin{eqnarray}\sum_{j=0}^p\langle \mathbf{x}_i, \mathbf{x}_j\rangle\,\beta_j = \langle \mathbf{x}_i,\mathbf{y}\rangle\tag{3}\end{eqnarray} for $i=0, 1, \cdots, p$.
In most cases in practice, we have included an "interception" term $\mathbf{x}_0\equiv\mathbf{1}$. Eq. (3) then reduces to \begin{equation} \beta_0 = \frac{\langle\mathbf{1}, \mathbf{y}\rangle-\sum_{j=1}^p\langle\mathbf{1}, \mathbf{x}_j\rangle\beta_j}{\langle\mathbf{1},\mathbf{1}\rangle}\,,\end{equation} and \begin{equation} \sum_{j=1}^p\left(\langle\mathbf{x}_i,\mathbf{x}_j\rangle - \frac{\langle\mathbf{x}_i,\mathbf{1}\rangle\langle\mathbf{1},\mathbf{x}_j\rangle}{\langle\mathbf{1},\mathbf{1}\rangle} \right)\beta_j = \langle\mathbf{x}_i,\mathbf{y}\rangle -\frac{\langle\mathbf{x}_i,\mathbf{1}\rangle\langle\mathbf{1},\mathbf{y}\rangle}{\langle\mathbf{1},\mathbf{1}\rangle} \tag{4}\end{equation} for $i=1,\cdots, p$.
There are two common Hilbert spaces:
Hilbert space of random variables
For two random variables $X$ and $Y$, their inner product can be defined as $\langle X, Y\rangle=\mathbb{E}(XY)$.
The linear regression here is to find $\hat{Y}=\beta_0+\sum_{j=1}^p\beta_p X_p$ that minimizes $\mathbb{E}(Y-\hat{Y})^2$. By applying this specific inner product of random variables in (4), we have \begin{equation} \beta_0 = \mathbb{E}(Y) - \sum_{j=1}^p \beta_j\mathbb{E}(X_j) \tag{5}\end{equation} and \begin{equation} \sum_{j=1}^p \text{Cov}(X_i, X_j)\,\beta_j = \text{Cov}(X_i, Y) \tag{6}\end{equation} for $i=1,\cdots, p$.
For simple linear regression $\hat{Y}=\beta_0+\beta_1 X$, \begin{equation}\beta_1=\frac{\text{Cov}(X, Y)}{\text{Var(X)}}\,.\tag{7}\end{equation}
Finally, let $e\equiv Y-\hat{Y}$, from Eq. (2) we have the decomposition \begin{equation}\mathbb{E}\left(Y^2\right)=\mathbb{E}\left(\hat{Y}^2\right)+\mathbb{E}\left(e^2\right)\,.\tag{8}\end{equation} Only with interception term (5), we have $\mathbb{E}(Y)=\mathbb{E}(\hat{Y})$ and $\mathbb{E}(e)=0$, and thus Eq. (8) further reduces to \begin{equation}\text{Var}(Y)=\text{Var}(\hat{Y})+\text{Var}(e)\,.\tag{9}\end{equation}.
Hilbert space of observations
For $n$ observations $\left(y^{(k)}, x_1^{(k)}, \cdots, x_n^{(k)} \right)$, we form columns $\mathbf{y}=\left[y^{(1)}, \cdots, y^{(n)}\right]^T$ and $\mathbf{x}_j=\left[x_j^{(1)}, \cdots, x_j^{(n)}\right]^T$ for $j=1,\cdots, p$. The inner product of two columns $\mathbf{x}$ and $\mathbf{y}$ is defined as $\langle \mathbf{x}, \mathbf{y}\rangle=\mathbf{x}^T\cdot\mathbf{y}$.
Let $\mathbf{x}_0\equiv [1, \cdots, 1]^T$, the linear regression here is to find $\hat{\mathbf{y}}=\sum_{j=0}^p\beta_j \mathbf{x}_j$ that minimizes $(\mathbf{y}-\hat{\mathbf{y}})^T\cdot(\mathbf{y}-\hat{\mathbf{y}})$. By applying this specific inner product of columns in (3), we have \begin{equation}\boldsymbol{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\,,\tag{10}\end{equation}
where $\boldsymbol{\beta}\equiv[\beta_0, \cdots, \beta_p]^T$ and $\mathbf{X}\equiv[\mathbf{x}_0, \cdots, \mathbf{x}_p]$. Or by applying to (4), we have \begin{equation} \beta_0 = \hat{\mu}(\mathbf{y}) - \sum_{j=1}^p \beta_j\hat{\mu}(\mathbf{x}_j)\,,\quad \sum_{j=1}^p \hat{\gamma}(\mathbf{x}_i,\mathbf{x}_j)\beta_j = \hat{\gamma}(\mathbf{x}_i, \mathbf{y})\end{equation} for $i=1, \cdots, p$, where $\hat{\mu}(\mathbf{y})\equiv\frac{1}{n}\sum_{k=1}y^{(k)}$ is the sample mean and $\hat{\gamma}(\mathbf{x}_i,\mathbf{x}_j)\equiv \frac{1}{n}\mathbf{x}_i^T\cdot \mathbf{x}_j-\hat{\mu}(\mathbf{x}_i)\hat{\mu}(\mathbf{x}_j)$ is the sample covariance.
Compared to Eq.(5) and (6), all results are the same after replacing expectation and covariance by sample mean and sample covariance, as expected.
Comments
Post a Comment