Interview problem: linear regression (1)

Problem:
A collection of 2D points $(x_i, y_i)$ are i.i.d. randomly sampled from a uniform distribution in a triangular region $0\leq X \leq 1, 0 \leq Y \leq 1, X+Y\geq 3/2$. If we fit a linear regression, say $y=kx+b$, on the sampled points, what are the expectation values of $k$ and $b$?





Brute force solution:
Linear regression is to solve the optimization problem $\min_{k, b} \mathbb{E}\left(Y-kX-b\right)^2$. As a result, we can derive $k=\frac{\text{Cov}(X, Y)}{Var(X)}$ and $b=\mathbb{E}(Y)-k\,\mathbb{E}(X)$. For the given distribution, we can compute $\mathbb{E}(X)=\mathbb{E}(Y)=5/6$, $\text{Cov}(X, Y)=-1/144$ and $\text{Var}(X)=1/72$. The final answer is $k=-1/2$ and $b=5/4$.

Quick solution:
As proved in chapter 2.4 of the ESL book, the solution to the optimization problem $\min_f \mathbb{E}(Y-f(X))^2$ is $f(X)=\mathbb{E}(Y|X)$. For the given distribution, $Y|X\sim \text{Uniform}[3/2-X, 1]$. As a result, $\mathbb{E}(Y|X)=\frac{1}{2}(3/2-X+1)=-\frac{1}{2}X + 5/4$, which is linear with $k=-1/2$ and $b=5/4$.

Comments

Post a Comment

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts