R-squared

Given $n$ samples $y^{(k)}$ and the corresponding fitting values $\hat{y}^{(k)}$, one can evaluate the goodness-of-fit by \begin{equation} R^2 = 1 - \frac{\sum_{k=1}^n \left(y^{(k)}-\hat{y}^{(k)}\right)^2}{\sum_{k=1}^n\left(y^{(k)}-\bar{y}\right)^2}\,,\tag{1}\end{equation}
where $\bar{y}\equiv \frac{1}{n}\sum_{i=1}^n y^{(k)}$. 
Let $Y=\hat{Y}+e$, we can write $R^2$ in short as \begin{equation}R^2=1-\frac{\mathbb{E}(e^2)}{\text{Var}(Y)}\,.\end{equation} In the following, we will discuss $R^2$ in the scope of linear regression.
  1. With interception term, as shown in this post, we have $\mathbb{E}(e)=0$ and $\text{Var}(Y)=\text{Var}(\hat{Y})+\text{Var}(e)$. As a result, \begin{equation}R^2=1-\frac{\text{Var}(e)}{\text{Var}(Y)}=\frac{\text{Var}(\hat{Y})}{\text{Var}(Y)}\geq 0\,.\tag{2}\end{equation}
  2. Without interception term, as shown in this post, we can only have $\mathbb{E}(Y^2)=\mathbb{E}(\hat{Y}^2)+\mathbb{E}(e^2)$. As a result, \begin{equation} R^2=\frac{\mathbb{E}\left(\hat{Y}^2\right)-\left(\mathbb{E}(Y)\right)^2}{\text{Var}(Y)}\,,\end{equation}which can be negative.
  3.  In simple linear regression $\hat{Y}=\beta_0+\beta X$, as shown in this post, \begin{equation}\beta = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} =\rho_{X,Y}\sqrt{\frac{\text{Var}(Y)}{\text{Var}(X)}}\,,\end{equation} and thus \begin{equation}R^2= \frac{\text{Var}(\hat{Y})}{\text{Var}(Y)}=\beta^2 \frac{\text{Var}(X)}{\text{Var}(Y)}=\rho_{X,Y}^2\,.\tag{3}\end{equation}
Problem:
If R-squared of linear regressions $Y\sim X_1$ and $Y\sim X_2$ are $R^2_1$ and $R^2_2$, what is the range of $R^2$ of the linear regression $Y\sim X_1+X_2$?

Figure 1: a solid geometry picture of $Y\sim X_1+X_2$

Solution:
Without loss of generality, assume $Y, X_1, X_2$ are centered (by interceptions). We consider a solid geometry picture as in Fig. 1, in which $Y, X_1, X_2$ are represented as lines OA, OB, OC and variance is represented by the length of the corresponding line segment. In such a picture, $R_1^2=\cos^2 \angle AOB$, $R_2^2=\cos^2\angle AOC$ and $R^2=\cos^2\angle AOD$. The value of $R^2$ depends on the value of $\angle BOC$ denoted by $\theta$. Using solid geometry, we can derive \begin{equation} R^2=\frac{R_1^2+R_2^2\pm 2R_1R_2\cos\theta}{\sin^2\theta}\,.\end{equation}
Assume $R_1\geq R_2$, \begin{equation} R^2=\left(\frac{R_2\pm R_1\cos\theta}{\sin\theta}\right)^2+R_1^2\geq R^2_1\,,\end{equation}
the equality holds when $\cos^2\theta = R_2^2/R_1^2$ (D coincides with B). On the other side, $R^2\leq 1$ and the equality holds when $\cos^2\theta=\left(R_1R_2\pm \sqrt{(1-R_1^2)(1-R_2^2)}\right)^2$ (A is in the plane OBC). In sum, $\max(R_1^2, R_2^2)\leq R^2\leq 1$. 

Comments

Popular posts from this blog

529 Plan

How to offset W2 tax

Health Saving Account (HSA)