R-squared

Given n samples y^{(k)} and the corresponding fitting values \hat{y}^{(k)}, one can evaluate the goodness-of-fit by \begin{equation} R^2 = 1 - \frac{\sum_{k=1}^n \left(y^{(k)}-\hat{y}^{(k)}\right)^2}{\sum_{k=1}^n\left(y^{(k)}-\bar{y}\right)^2}\,,\tag{1}\end{equation}
where \bar{y}\equiv \frac{1}{n}\sum_{i=1}^n y^{(k)}
Let Y=\hat{Y}+e, we can write R^2 in short as \begin{equation}R^2=1-\frac{\mathbb{E}(e^2)}{\text{Var}(Y)}\,.\end{equation} In the following, we will discuss R^2 in the scope of linear regression.
  1. With interception term, as shown in this post, we have \mathbb{E}(e)=0 and \text{Var}(Y)=\text{Var}(\hat{Y})+\text{Var}(e). As a result, \begin{equation}R^2=1-\frac{\text{Var}(e)}{\text{Var}(Y)}=\frac{\text{Var}(\hat{Y})}{\text{Var}(Y)}\geq 0\,.\tag{2}\end{equation}
  2. Without interception term, as shown in this post, we can only have \mathbb{E}(Y^2)=\mathbb{E}(\hat{Y}^2)+\mathbb{E}(e^2). As a result, \begin{equation} R^2=\frac{\mathbb{E}\left(\hat{Y}^2\right)-\left(\mathbb{E}(Y)\right)^2}{\text{Var}(Y)}\,,\end{equation}which can be negative.
  3.  In simple linear regression \hat{Y}=\beta_0+\beta X, as shown in this post, \begin{equation}\beta = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} =\rho_{X,Y}\sqrt{\frac{\text{Var}(Y)}{\text{Var}(X)}}\,,\end{equation} and thus \begin{equation}R^2= \frac{\text{Var}(\hat{Y})}{\text{Var}(Y)}=\beta^2 \frac{\text{Var}(X)}{\text{Var}(Y)}=\rho_{X,Y}^2\,.\tag{3}\end{equation}
Problem:
If R-squared of linear regressions Y\sim X_1 and Y\sim X_2 are R^2_1 and R^2_2, what is the range of R^2 of the linear regression Y\sim X_1+X_2?

Figure 1: a solid geometry picture of Y\sim X_1+X_2

Solution:
Without loss of generality, assume Y, X_1, X_2 are centered (by interceptions). We consider a solid geometry picture as in Fig. 1, in which Y, X_1, X_2 are represented as lines OA, OB, OC and variance is represented by the length of the corresponding line segment. In such a picture, R_1^2=\cos^2 \angle AOB, R_2^2=\cos^2\angle AOC and R^2=\cos^2\angle AOD. The value of R^2 depends on the value of \angle BOC denoted by \theta. Using solid geometry, we can derive \begin{equation} R^2=\frac{R_1^2+R_2^2\pm 2R_1R_2\cos\theta}{\sin^2\theta}\,.\end{equation}
Assume R_1\geq R_2, \begin{equation} R^2=\left(\frac{R_2\pm R_1\cos\theta}{\sin\theta}\right)^2+R_1^2\geq R^2_1\,,\end{equation}
the equality holds when \cos^2\theta = R_2^2/R_1^2 (D coincides with B). On the other side, R^2\leq 1 and the equality holds when \cos^2\theta=\left(R_1R_2\pm \sqrt{(1-R_1^2)(1-R_2^2)}\right)^2 (A is in the plane OBC). In sum, \max(R_1^2, R_2^2)\leq R^2\leq 1

Comments

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts