Generalized linear model
A generalized linear model (GLM) is said to have three components:
- A random component: $Y\sim \mathbb{P}(Y|\theta)$ in exponential family
- A linear predictor: $\eta=\mathbf{x}\cdot\boldsymbol{\beta}$
- A link function $f$: $f(\mathbb{E}[Y|\theta])=\eta$
- Given $\mathbf{X}$, GLM predicts the conditional expectation of $Y$ instead of $Y$ itself.
- In GLM, the link function directly relates the conditional expectation of $Y$ to the linear predictor $\mathbf{x}\cdot\boldsymbol{\beta}$ as $f(\mathbb{E}[Y|\theta]) = \mathbf{x}\cdot\boldsymbol{\beta}$. In contrast, one may transform $Y$ and then do ordinary linear regression, which implies $\mathbb{E}[f(Y)|\theta] = \mathbf{x}\cdot\boldsymbol{\beta}$.
- From $\mathbb{P}(Y|\theta)$, we can compute $\mathbb{E}[Y|\theta]=\mu(\theta)$. The linear predictor $\eta$ relates to $\theta$ by $f\left(\mu(\theta)\right)=\eta$.
- If we set $\eta=\theta$, ${\mu}^{-1}$ is a link function, called canonical link function. For example, the canonical link function of Bernoulli(p) is logit: $\log\frac{p}{1-p}=\mathbf{x}\cdot\boldsymbol{\beta}$.
- The canonical link function is by default because of its many good statistical properties. But we may use other link functions as well (probit for example).
Iterative reweighted linear regression (IRLS)
IRLS is to used to find the maximum likelihood estimates of GLM. As shown below, It is from Newton method for canonical link case and Fisher scoring for non-canonical link case.
- Given data $(\mathbf{x}^{(i)}, y^{(i)})$, consider the canonical link case $\theta^{(i)}=\mathbf{x}^{(i)}\cdot\boldsymbol{\beta}$ first. The log-likelihood for the exponential family $\mathbb{P}(Y|\theta)=g(Y)e^{\theta Y}/Z(\theta)$ is \begin{equation} l_n(\boldsymbol{\beta}) = \sum_{i=1}^n l_i(\theta^{(i)})= \sum_{i=1}^n \left[\theta^{(i)} y^{(i)} -\log Z(\theta^{(i)})\right]\,.\end{equation} Denote $\mu^{(i)} \equiv \mathbb{E}(Y|\theta^{(i)})$ and $\sigma^{2\,{(i)}} \equiv \text{Var}(Y|\theta^{(i)})$, then $l_i'(\theta^{(i)})=y^{(i)}-\mu^{(i)}$ and $l_i''(\theta^{(i)})=-\sigma^{2\,{(i)}}$. As a result, we have $\nabla l_n(\boldsymbol{\beta})=\mathbf{X}^T(\mathbf{y}-\boldsymbol{\mu})$ and $\text{Hessian}[l_n(\boldsymbol{\beta})]=-\mathbf{X}^T\mathbf{W}\mathbf{X}$, where $\boldsymbol{\mu}\equiv\left[\mu^{(1)},\cdots, \mu^{(n)}\right]^T$ and $\mathbf{W}\equiv\text{diag}\left[\sigma^{2\,(1)},\cdots,\sigma^{2\,(n)})\right]$. We can update $\boldsymbol{\beta}$ using Newton method: \begin{eqnarray}\boldsymbol{\beta}^{\text{new}}=\boldsymbol{\beta}^{\text{old}}+\left(\mathbf{X}^T\mathbf{W}\mathbf{X}\right)^{-1}\mathbf{X}^T(\mathbf{y}-\boldsymbol{\mu})=\left(\mathbf{X}^T\mathbf{W}\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{W}\mathbf{z}\,,\end{eqnarray} where \begin{equation}\mathbf{z}^{\text{old}}\equiv \mathbf{X}\boldsymbol{\beta}^{\text{old}}+\mathbf{W}^{-1}(\mathbf{y}-\boldsymbol{\mu})\,.\end{equation} Therefore, for the canonical link case, Newton method leads to IRLS: weighted least squares to $\mathbf{z}^{\text{old}}$.
- For non-canonical link case $f(\mu(\theta^{(i)}))=\eta^{(i)}$, we optimize GLM by Fisher scoring instead of Newton method, i.e, to use the expectation of Hessian instead of Hessian itself. We start by computing \begin{equation} \frac{\partial l_n}{\partial \eta^{(i)}}=\frac{\partial l_n}{\partial \theta^{(i)}}\frac{d\theta^{(i)}}{d\eta^{(i)}}=\frac{y^{(i)}-\mu^{(i)}}{f'(\mu^{(i)})\, \sigma^{2\,(i)}}\,.\end{equation} Then it is easy to see $\mathbb{E}\left(\left.\frac{\partial l_n}{\partial \eta^{(i)}\partial \eta^{(j)}}\right|\theta\right)=0$ for $i\neq j$, and \begin{equation}\mathbb{E}\left(\left.\frac{\partial^2 l_n}{\partial \eta^{(i)\,2}}\right|\theta\right)=-\frac{{d\mu^{(i)}}/{d\eta^{(i)}}}{f'(\mu^{(i)})\, \sigma^{2\,(i)}}=-\frac{1}{{f'(\mu^{(i)})}^2 \sigma^{2\,{(i)}} }\,.\end{equation} Therefore, Fisher scoring still leads to IRLS: \begin{equation} \boldsymbol{\beta}^{\text{new}}=\left(\mathbf{X}^T\mathbf{W}\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{W}\mathbf{z}\,,\end{equation} but with $\mathbf{W}=\text{diag}\left[\frac{1}{{f'(\mu^{(1)})}^2 \sigma^{2\,{(1)}} }, \cdots, \frac{1}{{f'(\mu^{(n)})}^2 \sigma^{2\,{(n)}} }\right]$ and \begin{equation}\mathbf{z}_i^{\text{old}}\equiv \mathbf{x}^{(i)}\cdot\boldsymbol{\beta}^{\text{old}}+ {f'(\mu^{(i)})}^2 \sigma^{2\,(i)}\frac{y^{(i)}-\mu^{(i)}}{f'(\mu^{(i)})\, \sigma^{2\,(i)}}=\mathbf{x}^{(i)}\cdot\boldsymbol{\beta}^{\text{old}}+ \left(y^{(i)}-\mu^{(i)}\right)f'(\mu^{(i)})\,.\end{equation} The equations here include the canonical link case in which $f'(\mu^{(i)})={1}/{\sigma^{2\,(i)}}$.
Comments
Post a Comment