Karush-Kuhn-Tucker (KKT) conditions

When I first learned Andrew Ng's CS229 machine learning, I did not really understand KKT conditions in his lecture note 3, in which KKT conditions are presented without proof and mixed with the content of Lagrangian duality. Recently, I occasionally read a blog which provides a more clear and intuitive illustration on KKT conditions. Let me try to rephase the argument in this blog.

Lagrangian Multiplier

KKT conditions are about the optimization under inequality constraints. So before jumping into KKT conditions, we should first recall the optimization under equality constraints. Let's consider the optimization problem \begin{eqnarray} \begin{split} \min_{x}&&\quad f(x) \\ \text{s.t.}&& \quad g(x)=0 \end{split}\tag{1}\end{eqnarray} The key insight is that if $x^*$ is the optimal solution of the problem (1), then \begin{eqnarray} \nabla f(x^*)&=&-\lambda\nabla g(x^*)\tag{2}\\ g(x^*)&=&0\tag{3}\end{eqnarray} for some $\lambda \in \mathbb{R}$. There is a simple geometric interpretation of Eq. (2): Note that $\nabla g$ is the normal vector of the hypersurface $g(x)=0$. Consider a class of hypersurfaces $f(x)=c$, the minimum $c$ comes from the case in which the hypersurface $f(x)=c$ become tangent to the hypersurface $g(x)=0$ at the point $x^*$ and consequently their normal vectors are parallel at $x^*$. Otherwise, we can move a small step on the hypersurface $g(x)=0$ away from $x^*$ and reach a even smaller $f(x)$. The rigorous proof is as follows: 

Suppose $\nabla f(x^*)$ and $\nabla g(x^*)$ are not parallel, we have the decomposition $\nabla f(x^*)=-\lambda \nabla g(x^*) + \mathbf{t}$ such that $\mathbf{t}^T\cdot \nabla g(x^*)=0$, i.e., $\mathbf{t}$ is a tangent vector of $g(x)=0$ at $x^*$.  As a result, \begin{eqnarray}&&f(x^*-\epsilon\,\mathbf{t})=f(x^*)-\epsilon\,\mathbf{t}^T\cdot \nabla f(x^*)=f(x^*)-\epsilon\,\left|\left|\mathbf{t}\right|\right|_2^2 < f(x^*)\,,\\&&g(x^*-\epsilon\,\mathbf{t})=g(x^*)-\epsilon\,\mathbf{t}^T\cdot \nabla g(x^*)=g(x^*)=0\,,\end{eqnarray} which contradicts the assumption that $f(x^*)$ is the minimum. Q.E.D.

In practice, we usually solve the optimization problem (1) by Lagrangian multiplier: optimize the Lagrangian \begin{eqnarray}L(x,\lambda)=f(x)+\lambda\,g(x)\tag{4}\end{eqnarray} with no constraints. $\frac{\partial}{\partial x} L(x, \lambda)=0$ and $\frac{\partial}{\partial \lambda} L(x, \lambda)=0$ are exactly the conditions (2)-(3), justifying the approach of optimizing the Lagrangian (4).

KKT conditions

KKT conditions generalize the conditions (2)-(3) to inequality constraints. Consider the optimization problem \begin{eqnarray} \begin{split} \min_{x}&&\quad f(x) \\ \text{s.t.}&& \quad g(x)\leq 0 \end{split}\tag{5}\end{eqnarray} There are two cases for optimal solution $x^*$ of the problem (5):

  1. Boundary solution: $g(x^*) = 0$. In this case, the constraint is active and the problem (5) deprecates to the problem (1) under an equality constraint. As a result, we have $\nabla f(x^*)=-\lambda \nabla g(x^*)$, as proved in the previous section. Moreover, we have an additional condition $\lambda > 0$: if we move from $x^*$ to the interior region $g(x)<0$ along the normal vector $-\nabla g(x^*)$, the function $f(x)$ should becomes larger than $f(x^*)$. A rigorous proof of $\lambda > 0$ is as follows: $x^*-\epsilon \nabla g(x^*)$ is an interior point: $g(x^*-\epsilon \nabla g(x^*))=g(x^*)-\epsilon \nabla g(x^*)^T\cdot \nabla g(x^*)=-\epsilon \left|\left|\right|\nabla g(x^*)\right|^2_2 <0$; Since the optimal $x^*$ is on the boundary, $f(x^*) < f(x^*-\epsilon \nabla g(x^*))=f(x^*)-\epsilon \nabla g(x^*)^T \cdot \nabla f(x^*)=f(x^*)+\epsilon \lambda\left|\left|\right|\nabla g(x^*)\right|^2_2$ requiring $\lambda >0$. In sum, for the boundary solution, we have \begin{eqnarray}g(x^*) = 0\,,\quad \nabla f(x^*)=-\lambda \nabla g(x^*)\,,\quad \lambda>0\,.\end{eqnarray}
  2. Interior solution: $g(x^*) < 0$. In this case, the constraint is inactive and the problem (5) deprecates to an unconstrained problem: $\nabla f(x^*)=0$ and $\lambda=0$ (from the view of Lagrangian multiplier). In sum, for the interior solution, we have \begin{eqnarray}g(x^*) < 0\,,\quad \nabla f(x^*)=0\,,\quad \lambda=0\,.\end{eqnarray}
Combining the above two cases leading to the so-called KKT conditions: \begin{eqnarray}&& g(x^*)\leq 0\,, \tag{6}\\ && \nabla f(x^*)=-\lambda \nabla g(x^*)\,,\tag{7} \\ && \lambda \geq 0 \,,\tag{8} \\ && \lambda\, g(x^*)=0 \,.\tag{9}\end{eqnarray}
Note:
  • The condition (6) is called "primal feasibility". It is simply the inequality constraint in the problem (5).
  • The condition (7) is called "stationarity". It is the zero gradient of the Lagrangian with respect to $x$.
  • The condition (8) is called "dual feasibility". This name comes from the Lagrangian duality as discussed in the last section.
  • The condition (9) is called "complementary slackness": $\lambda=0$ is related to the interior solution and $\lambda\, g(x^*)=0$ is related to the boundary solution. It is the most nontrivial one among all KKT conditions.
Note that KKT conditions are the necessary conditions of the problem (5) instead of sufficient conditions. KKT conditions are sufficient when the problem (5) is convex.

Example

The ridge regression \begin{eqnarray}\hat{\boldsymbol{\beta}}(C)=\text{argmin}_{\boldsymbol{\beta}}\,\left|\left|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\right|\right|_2^2+C\left|\left|\boldsymbol{\beta}\right|\right|_2^2\tag{10}\end{eqnarray} is claimed to be equivalent to the following constraint problem \begin{eqnarray} \begin{split} \min_{\boldsymbol{\beta}} &&\quad \left|\left|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\right|\right|_2^2 \\ \text{s.t.}&&\quad \left|\left|\boldsymbol{\beta}\right|\right|_2^2\leq t \end{split}\tag{11}\end{eqnarray} We can prove such equivalence and find the relation between $C$ and $t$ by the KKT conditions of the optimization (11): \begin{eqnarray}&& \left|\left|\boldsymbol{\beta}\right|\right|_2^2\leq t \,,\\ && \frac{\partial}{\partial \boldsymbol{\beta}} \left|\left|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\right|\right|_2^2=-\lambda \frac{\partial}{\partial \boldsymbol{\beta}}\left|\left|\boldsymbol{\beta}\right|\right|_2^2\,, \\ && \lambda \geq 0 \,, \\ && \lambda\, \left( \left|\left|\boldsymbol{\beta}\right|\right|_2^2 - t \right) = 0\,.\end{eqnarray} Since the problem (11) is convex, the solution of the above KKT conditions is also the optimal solution of the problem (11). As a result, by the KKT conditions, problems (10) and (11) are equivalent if and only if $C=\lambda$ and $t=\left|\left|\hat{\boldsymbol{\beta}}(C)\right|\right|_2^2$.  

Lagrangian duality

Finally in the final section, we introduce Lagrangian duality for completeness. For the problem (5), we define Lagrangian the dual function \begin{eqnarray} d(\lambda) \equiv \inf_x \left(f(x)+\lambda\,g(x)\right)\,. \end{eqnarray}
For $\lambda \geq 0$, we have $d(\lambda) \leq f(x^*)+\lambda\,g(x^*) \leq f(x^*)$, and thus the solution $\lambda^*$ of the following Lagrangian dual problem \begin{eqnarray} \begin{split} \max_{\lambda}&&\quad d(\lambda) \\ \text{s.t.}&& \quad \lambda\geq 0 \end{split}\end{eqnarray} provides a lower bound of the primal problem (5) as \begin{equation} d(\lambda ^*)\leq f(x^*)\,.\end{equation} 
The equality $d(\lambda ^*)=f(x^*)$ (strong duality) only holds in a few circumstances. For example, for convex primal problems, strong duality holds if all the inequality constraint functions are either affine or strictly feasible (Slater conditions). 

When strong duality holds, $x^*$ and $\lambda^*$ are primal and dual solutions if and only if they satisfy KKT conditions. 
Proof: ($\Rightarrow$) Because of strong duality, we have \begin{eqnarray} f(x^*)=d(\lambda^*)\equiv \inf_x \left(f(x)+\lambda^*\,g(x)\right) \leq f(x^*)+\lambda^*\,g(x^*)\leq f(x^*)\,,\end{eqnarray} so all the inequalities above are actually equalities, which means:
  • $x^*$ minimize $f(x)+\lambda^*\,g(x)$, which is stationarity in KKT conditions.
  • $\lambda^*\,g(x^*)=0$, which is complementary slackness in KKT conditions.
($\Leftarrow$) If there exist $x^*$ and $\lambda^*$ satisfying KKT conditions, then \begin{eqnarray} d(\lambda^*) = f(x^*)+\lambda^*\,g(x^*) = f(x^*)\,,\end{eqnarray} i.e., $x^*$ and $\lambda^*$ make the duality gap zero and thus are the primal and dual solutions.

Comments

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts