Karush-Kuhn-Tucker (KKT) conditions
When I first learned Andrew Ng's CS229 machine learning, I did not really understand KKT conditions in his lecture note 3, in which KKT conditions are presented without proof and mixed with the content of Lagrangian duality. Recently, I occasionally read a blog which provides a more clear and intuitive illustration on KKT conditions. Let me try to rephase the argument in this blog.
Lagrangian Multiplier
KKT conditions are about the optimization under inequality constraints. So before jumping into KKT conditions, we should first recall the optimization under equality constraints. Let's consider the optimization problem \begin{eqnarray} \begin{split} \min_{x}&&\quad f(x) \\ \text{s.t.}&& \quad g(x)=0 \end{split}\tag{1}\end{eqnarray} The key insight is that if $x^*$ is the optimal solution of the problem (1), then \begin{eqnarray} \nabla f(x^*)&=&-\lambda\nabla g(x^*)\tag{2}\\ g(x^*)&=&0\tag{3}\end{eqnarray} for some $\lambda \in \mathbb{R}$. There is a simple geometric interpretation of Eq. (2): Note that $\nabla g$ is the normal vector of the hypersurface $g(x)=0$. Consider a class of hypersurfaces $f(x)=c$, the minimum $c$ comes from the case in which the hypersurface $f(x)=c$ become tangent to the hypersurface $g(x)=0$ at the point $x^*$ and consequently their normal vectors are parallel at $x^*$. Otherwise, we can move a small step on the hypersurface $g(x)=0$ away from $x^*$ and reach a even smaller $f(x)$. The rigorous proof is as follows:
Suppose $\nabla f(x^*)$ and $\nabla g(x^*)$ are not parallel, we have the decomposition $\nabla f(x^*)=-\lambda \nabla g(x^*) + \mathbf{t}$ such that $\mathbf{t}^T\cdot \nabla g(x^*)=0$, i.e., $\mathbf{t}$ is a tangent vector of $g(x)=0$ at $x^*$. As a result, \begin{eqnarray}&&f(x^*-\epsilon\,\mathbf{t})=f(x^*)-\epsilon\,\mathbf{t}^T\cdot \nabla f(x^*)=f(x^*)-\epsilon\,\left|\left|\mathbf{t}\right|\right|_2^2 < f(x^*)\,,\\&&g(x^*-\epsilon\,\mathbf{t})=g(x^*)-\epsilon\,\mathbf{t}^T\cdot \nabla g(x^*)=g(x^*)=0\,,\end{eqnarray} which contradicts the assumption that $f(x^*)$ is the minimum. Q.E.D.
In practice, we usually solve the optimization problem (1) by Lagrangian multiplier: optimize the Lagrangian \begin{eqnarray}L(x,\lambda)=f(x)+\lambda\,g(x)\tag{4}\end{eqnarray} with no constraints. $\frac{\partial}{\partial x} L(x, \lambda)=0$ and $\frac{\partial}{\partial \lambda} L(x, \lambda)=0$ are exactly the conditions (2)-(3), justifying the approach of optimizing the Lagrangian (4).
KKT conditions
KKT conditions generalize the conditions (2)-(3) to inequality constraints. Consider the optimization problem \begin{eqnarray} \begin{split} \min_{x}&&\quad f(x) \\ \text{s.t.}&& \quad g(x)\leq 0 \end{split}\tag{5}\end{eqnarray} There are two cases for optimal solution $x^*$ of the problem (5):
- Boundary solution: $g(x^*) = 0$. In this case, the constraint is active and the problem (5) deprecates to the problem (1) under an equality constraint. As a result, we have $\nabla f(x^*)=-\lambda \nabla g(x^*)$, as proved in the previous section. Moreover, we have an additional condition $\lambda > 0$: if we move from $x^*$ to the interior region $g(x)<0$ along the normal vector $-\nabla g(x^*)$, the function $f(x)$ should becomes larger than $f(x^*)$. A rigorous proof of $\lambda > 0$ is as follows: $x^*-\epsilon \nabla g(x^*)$ is an interior point: $g(x^*-\epsilon \nabla g(x^*))=g(x^*)-\epsilon \nabla g(x^*)^T\cdot \nabla g(x^*)=-\epsilon \left|\left|\right|\nabla g(x^*)\right|^2_2 <0$; Since the optimal $x^*$ is on the boundary, $f(x^*) < f(x^*-\epsilon \nabla g(x^*))=f(x^*)-\epsilon \nabla g(x^*)^T \cdot \nabla f(x^*)=f(x^*)+\epsilon \lambda\left|\left|\right|\nabla g(x^*)\right|^2_2$ requiring $\lambda >0$. In sum, for the boundary solution, we have \begin{eqnarray}g(x^*) = 0\,,\quad \nabla f(x^*)=-\lambda \nabla g(x^*)\,,\quad \lambda>0\,.\end{eqnarray}
- Interior solution: $g(x^*) < 0$. In this case, the constraint is inactive and the problem (5) deprecates to an unconstrained problem: $\nabla f(x^*)=0$ and $\lambda=0$ (from the view of Lagrangian multiplier). In sum, for the interior solution, we have \begin{eqnarray}g(x^*) < 0\,,\quad \nabla f(x^*)=0\,,\quad \lambda=0\,.\end{eqnarray}
- The condition (6) is called "primal feasibility". It is simply the inequality constraint in the problem (5).
- The condition (7) is called "stationarity". It is the zero gradient of the Lagrangian with respect to $x$.
- The condition (8) is called "dual feasibility". This name comes from the Lagrangian duality as discussed in the last section.
- The condition (9) is called "complementary slackness": $\lambda=0$ is related to the interior solution and $\lambda\, g(x^*)=0$ is related to the boundary solution. It is the most nontrivial one among all KKT conditions.
Example
Lagrangian duality
- $x^*$ minimize $f(x)+\lambda^*\,g(x)$, which is stationarity in KKT conditions.
- $\lambda^*\,g(x^*)=0$, which is complementary slackness in KKT conditions.
Comments
Post a Comment