Interview problem: linear regression (1)

Problem:
A collection of 2D points (x_i, y_i) are i.i.d. randomly sampled from a uniform distribution in a triangular region 0\leq X \leq 1, 0 \leq Y \leq 1, X+Y\geq 3/2. If we fit a linear regression, say y=kx+b, on the sampled points, what are the expectation values of k and b?





Brute force solution:
Linear regression is to solve the optimization problem \min_{k, b} \mathbb{E}\left(Y-kX-b\right)^2. As a result, we can derive k=\frac{\text{Cov}(X, Y)}{Var(X)} and b=\mathbb{E}(Y)-k\,\mathbb{E}(X). For the given distribution, we can compute \mathbb{E}(X)=\mathbb{E}(Y)=5/6, \text{Cov}(X, Y)=-1/144 and \text{Var}(X)=1/72. The final answer is k=-1/2 and b=5/4.

Quick solution:
As proved in chapter 2.4 of the ESL book, the solution to the optimization problem \min_f \mathbb{E}(Y-f(X))^2 is f(X)=\mathbb{E}(Y|X). For the given distribution, Y|X\sim \text{Uniform}[3/2-X, 1]. As a result, \mathbb{E}(Y|X)=\frac{1}{2}(3/2-X+1)=-\frac{1}{2}X + 5/4, which is linear with k=-1/2 and b=5/4.

Comments

Post a Comment

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts