Support Vector Machine (SVM)

The training dataset of SVM consists of $n$ training examples of the form $\left(x^{(i)}, y^{(i)}\right)$ for $i=1, \cdots, n$, where each ${x}^{(i)}$ is a $p$-dimensional input vector and $y^{(i)}$ is the corresponding label. Furthermore, there is a fixed feature map $\phi$ in SVM that maps an input vector $x$ to a feature vector $\phi(x)$. Such feature map is implicitly specified by a kernel in the dual formulation of SVM. The task of SVM is to find two parallel linear boundaries (hyperplanes) in the feature space by maximizing their margin for either binary classification or regression.

SVM as binary classifier

For binary classification, $y^{(i)}\in \{1, -1\}$ labels the class to which ${x}^{(i)}$ belongs. The task of SVM is to find two parallel hyperplanes in the feature space that separate the two classes. 
Fig. 1: Two parallel hyperplanes in the feature space with the maximum margin for binary classification: (a) Linearly separable data: taken from wiki. (b) Non-linearly separable data: taken from opencv doc

Robust linear discrimination

Suppose the data or the two classes are linearly separable, as shown in Fig. 1 (a), one can find out two parallel hyperplanes such that $y^{(i)}=1$: $w^T\,\phi(x^{(i)})+b\geq 1$ for the class $y^{(i)}=1$ and $w^T \,\phi(x^{(i)})+b\leq -1$ for class $y^{(i)}=-1$. Note that the distance or the margin between two hyperplanes is ${2}/{||w||^2}$, SVM is to the find out the maximum margin, leading to the primary optimization problem\begin{eqnarray}\begin{split} \min_{w,b}\quad && \frac{1}{2}||w||^2\nonumber\\ \text{s.t.}\quad && y^{(i)}\left[w^T \,\phi(x^{(i)})+b\right]\geq 1, \,\,\,\,i=1,\cdots,n \end{split}\tag{1}\end{eqnarray} Although the optimization problem (1) can be solved by the quadratic programming, in practice people prefer to solve its dual problem by the Lagrangian duality

  1. Write the Lagrangian of the problem (1) as \begin{eqnarray} L(w, b, \alpha)=\frac{1}{2}||w||^2 + \sum_{i=1}^n \alpha_i\left\{1-y^{(i)}\left[w^T \,\phi(x^{(i)})+b\right]\right\}\tag{1.1}\end{eqnarray} with $\alpha_i\geq 0$. 
  2. Compute equations $\frac{\partial}{\partial w}L(w, b, \alpha) =0$ and $\frac{\partial}{\partial b}L(w, b, \alpha) =0$ as \begin{eqnarray} &&w=\sum_{i=1}^n\alpha_i y^{(i)}\phi(x^{(i)})\,,\tag{1.2}\\ &&\sum_{i=1}^n \alpha_i y^{(i)}=0\,. \tag{1.3}\end{eqnarray} 
  3. Submit (1.2) into the Lagrangian (1.1) and apply the constraint (1.3), leading to the dual function \begin{eqnarray} d(\alpha) = \sum_{i=1}^n \alpha_i-\frac{1}{2}\sum_{i,j=1}^n y^{(i)}y^{(j)} \alpha_i\alpha_j\phi(x^{(i)})^T \phi(x^{(j)})\,.\tag{1.4}\end{eqnarray}
Finally, we achieve the dual problem as: \begin{eqnarray} \begin{split}\max_{\alpha}\quad && \sum_{i=1}^n \alpha_i-\frac{1}{2}\sum_{i,j=1}^n y^{(i)}y^{(j)}\alpha_i\alpha_j K_{ij}\nonumber\\ \text{s.t.}\quad&& \alpha_i\geq 0, \,\,\,\,i=1,\cdots,n\nonumber\\ && \sum_{i=1}^n \alpha_i y^{(i)}=0 \end{split}\tag{2}\end{eqnarray} where $K_{ij}\equiv\phi(x^{(i)})^T \phi(x^{(j)})$ is the inner product between two feature vectors $\phi(x^{(i)})$ and $\phi(x^{(j)})$ called "kernel".

Strong duality between (1) and (2) holds since the problem (1) is convex and all the inequality constraints are affine. As a result, once we solve the optimal $\alpha_i^*$ in the dual problem (2), we can find the optimal $\omega^*$ and $b^*$ in the primal problem (1) by the KKT conditions.

Approximate linear separation of non-separable sets

As shown in Fig. 1 (b), to make the algorithm work for non-linearly separable datasets and be less sensitive to outliers, we reformulate the primal problem (1) by adding ${\ell}_1$ regularization: \begin{eqnarray}\begin{split}\min_{w,b, \xi}\quad && \frac{1}{2}||w||^2+C\sum_{i=1}^n \xi_i\nonumber\\ \text{s.t.}\quad&& y^{(i)}\left[w^T \phi(x^{(i)})+b\right]\geq 1-\xi_i, \,\,\,\,i=1,\cdots,n\nonumber\\ &&\xi_i\geq 0,\,\,\,\,i=1,\cdots,n \end{split}\tag{3}\end{eqnarray} Similarly to the Lagrangian duality (1.1)-(1.4), we can derive the dual problem of the primal problem (3) as \begin{eqnarray} \begin{split}\max_{\alpha}\quad && \sum_{i=1}^n \alpha_i-\frac{1}{2}\sum_{i,j=1}^n y^{(i)}y^{(j)}\alpha_i\alpha_jK_{ij}\nonumber\\ \text{s.t.}\quad && 0\leq\alpha_i\leq C, \,\,\,\,i=1,\cdots,n\nonumber\\ && \sum_{i=1}^n \alpha_i y^{(i)}=0 \end{split}\tag{4}\end{eqnarray} where $K_{ij}\equiv\phi(x^{(i)})^T \phi(x^{(j)})$ is the kernel. 

Note:

  • For the linearly and non-linearly separable cases, although their primal forms (1) and (3) look different, their dual forms (2) and (4) are very similar: The only difference is that there is an additional constraint $\alpha_i\leq C$ in the problem (4) compared to the problem (2).
  • the primal problem (3) is equivalent to an unconstrained optimization problem with hinge loss: \begin{eqnarray} \min_{w, b}\quad \sum_{i=1}^n \max\left\{0, 1-y^{(i)}[w^T \phi(x^{(i)})+b]\right\}+\frac{1}{2C}||w||^2\,,\end{eqnarray} where $1/2C$ is the strength of the ${\ell}_2$ regularization on $w$. Such form does not apply to the linearly separable case in which $C=0$.
  • The the optimization of dual problem (4) can be solved by sequential minimal optimization (SMO) algorithm.

Kernel trick

In the dual forms (2) and (4), the feature vectors $\phi(x)$ is in the form of kernels $K_{ij}\equiv\phi(x^{(i)})^T \phi(x^{(j)})$. Therefore, in practice, we assign the kernel $K_{ij}$ directly with no need of an explicit form of $\phi$: 
  • Computational efficiency: computing the high dimensional feature $\phi(x)$ is very expensive while the kernel $K$ is much cheaper to compute.
  • Feasibility: Mercer's theorem states that any symmetric function $K(x, z)$ can be expressed in the form of $\phi(x)^T\phi(z)$ for some $\phi$ if and only if $K(x, z)$ is positive semidefinite.
  • Nonlinear boundaries: the boundaries are hyperplanes in the feature space but nonlinear in the input space. 

Sparsity

In the solution of dual problems (2) and (4), most data examples have $\alpha_i=0$ and only a few data examples have $\alpha_i > 0$. The data examples with $\alpha_i > 0$ are on the hyperplanes and called "support vectors". The sparsity comes from the inequality constraints in the primal problems (1) and (3).

SVM for regression

For regression, we predict $y^{(i)}$ by a linear model $y^{(i)}=w^T\phi(x^{(i)})+b$. The optimization is formulated by the epsilon-insensitive hinge loss function as \begin{eqnarray}\min_{w, b}\quad \sum_{i=1}^n \max\left\{0, \left|y^{(i)}-\left[w^T \phi(x^{(i)})+b\right]\right|-\epsilon\right\}+\frac{1}{2C}||w||^2\,.\end{eqnarray} Unlike linear regression, there is no loss penalty if the error is within a margin defined by $\epsilon$, i.e. no loss when $\left|y^{(i)}-\left[w^T \phi(x^{(i)})+b\right]\right|\leq \epsilon$ and ${\ell}_1$-loss when $\left|y^{(i)}-\left[w^T \phi(x^{(i)})+b\right]\right|>\epsilon$. 
We can write the optimization into a constrained form as \begin{eqnarray}\begin{split}\min_{w,b, \xi, \eta}\quad && \frac{1}{2}||w||^2+C\sum_{i=1}^n (\xi_i+\eta_i)\nonumber\\ \text{s.t.}\quad&& y^{(i)}-\left(w^T \phi(x^{(i)})+b\right)\leq \epsilon +\xi_i, \,\,\,\,i=1,\cdots,n\nonumber\\ && y^{(i)}-\left(w^T \phi(x^{(i)})+b\right)\geq -\epsilon -\eta_i, \,\,\,\,i=1,\cdots,n\nonumber\\ &&\xi_i\geq 0,\,\,\,\,i=1,\cdots,n \\ &&\eta_i\geq 0,\,\,\,\,i=1,\cdots,n\end{split}\tag{5}\end{eqnarray}
Similar to SVM for binary classification, we can derive a dual problem. Furthermore, similar concepts including kernel, sparsity and support vectors still hold here. 

Least-Square SVM (LS-SVM)

LS-SVM treats the non-linearly separable case in a different way from SVM. Instead of using the ${\ell}_1$ regularization and inequality constraints as in (3) and (5), we introduces the ${\ell}_2$ regularization and equality constraints in LS-SVM:

  • No sparsity because of equality constraints.
  • Kernel trick is still applicable.
  • The solutions are in a closed form.

LS-SVM for binary classification

The optimization problem of LS-SVM for binary classification is different from the problem (3) as \begin{eqnarray} \min_{w,b}&& \frac{1}{2}||w||^2+C\sum_{i=1}^n \xi_i^2\nonumber\tag{6}\\ \text{s.t.}&& y^{(i)}\left[w^T \phi(x^{(i)})+b\right]= 1-\xi_i, \,\,\,\,i=1,\cdots,n\end{eqnarray}

Because of equality constraint, it can be solved by Lagrangian multiplier:

  1. Write the Lagrangian is \begin{eqnarray}L(\omega, b, \xi, \alpha)=\frac{1}{2}||w||^2+C\sum_{i=1}^n \xi_i^2-\sum_{i=1}^n\alpha_i\left\{y^{(i)}\left[w^T \phi(x^{(i)})+b\right] - 1+\xi_i\right\}\,.\end{eqnarray} 
  2. Solve derivatives \begin{eqnarray}\frac{\partial L}{\partial \omega}&=&\omega - \sum_{i=1}^n\alpha_i y^{(i)} \phi(x^{(i)})=0\\ -\frac{\partial L}{\partial b}&=&\sum_{i=1}^n\alpha_iy^{(i)}=0\,,\\\frac{\partial L}{\partial \xi_i}&=&2C\xi_i-\alpha_i=0\\ \frac{\partial L}{\partial \alpha_i}&=& y^{(i)}\left[w^T \phi(x^{(i)})+b\right] - 1+\xi_i=0\end{eqnarray} 
  3. Eliminate $\omega,\,\xi$ and obtain linear equations on $b, \,\alpha$ as \begin{eqnarray}\left[\begin{array}{cc} 0 & \mathbf{y}^T\\ \mathbf{y} & \boldsymbol{\Omega} + \mathbf{I}/2C\end{array}\right] \left[\begin{array}{c} b\\ \boldsymbol{\alpha} \end{array}\right] = \left[\begin{array}{c} 0\\ \mathbf{1}\end{array}\right]\,,\tag{6.1}\end{eqnarray} with $\Omega_{ij}=y^{(i)} y^{(j)} K_{ij}$.

LS-SVM for regression

The optimization problem of LS-SVM for regression is different from (5) as \begin{eqnarray} \min_{w,b}&& \frac{1}{2}||w||^2+C\sum_{i=1}^n \xi_i^2\nonumber\tag{7}\\ \text{s.t.}&& y^{(i)}=w^T \phi(x^{(i)})+b+\xi_i, \,\,\,\,i=1,\cdots,n\end{eqnarray} whose non-constraint form is exactly the linear regression in the feature space with ridge regularization. Because of equality constraint, it can also be solved by Lagrangian multiplier:

  1. Write the Lagrangian is \begin{eqnarray}L(\omega, b, \xi, \alpha)=\frac{1}{2}||w||^2+C\sum_{i=1}^n \xi_i^2+\sum_{i=1}^n\alpha_i\left[y^{(i)}-w^T \phi(x^{(i)})-b-\xi_i\right]\,.\end{eqnarray} 
  2. Solve derivatives \begin{eqnarray}\frac{\partial L}{\partial \omega}&=&\omega - \sum_{i=1}^n\alpha_i \phi(x^{(i)})=0\\ -\frac{\partial L}{\partial b}&=&\sum_{i=1}^n\alpha_i=0\,,\\\frac{\partial L}{\partial \xi_i}&=&2C\xi_i-\alpha_i=0\\ \frac{\partial L}{\partial \alpha_i}&=& y^{(i)}-w^T \phi(x^{(i)})-b-\xi_i=0\end{eqnarray} 
  3. Eliminate $\omega,\,\xi$ and obtain linear equations on $b, \,\alpha$ as \begin{eqnarray}\left[\begin{array}{cc} 0 & \mathbf{1}^T\\ \mathbf{1} & \mathbf{K} + \mathbf{I}/2C\end{array}\right] \left[\begin{array}{c} b\\ \boldsymbol{\alpha} \end{array}\right] = \left[\begin{array}{c} 0\\ \mathbf{y}\end{array}\right]\,.\tag{7.1}\end{eqnarray}

Comments

Popular posts from this blog

529 Plan

How to offset W2 tax

Retirement Accounts