Support Vector Machine (SVM)
The training dataset of SVM consists of $n$ training examples of the form $\left(x^{(i)}, y^{(i)}\right)$ for $i=1, \cdots, n$, where each ${x}^{(i)}$ is a $p$-dimensional input vector and $y^{(i)}$ is the corresponding label. Furthermore, there is a fixed feature map $\phi$ in SVM that maps an input vector $x$ to a feature vector $\phi(x)$. Such feature map is implicitly specified by a kernel in the dual formulation of SVM. The task of SVM is to find two parallel linear boundaries (hyperplanes) in the feature space by maximizing their margin for either binary classification or regression.
SVM as binary classifier
Fig. 1: Two parallel hyperplanes in the feature space with the maximum margin for binary classification: (a) Linearly separable data: taken from wiki. (b) Non-linearly separable data: taken from opencv doc. |
Robust linear discrimination
Suppose the data or the two classes are linearly separable, as shown in Fig. 1 (a), one can find out two parallel hyperplanes such that $y^{(i)}=1$: $w^T\,\phi(x^{(i)})+b\geq 1$ for the class $y^{(i)}=1$ and $w^T \,\phi(x^{(i)})+b\leq -1$ for class $y^{(i)}=-1$. Note that the distance or the margin between two hyperplanes is ${2}/{||w||^2}$, SVM is to the find out the maximum margin, leading to the primary optimization problem\begin{eqnarray}\begin{split} \min_{w,b}\quad && \frac{1}{2}||w||^2\nonumber\\ \text{s.t.}\quad && y^{(i)}\left[w^T \,\phi(x^{(i)})+b\right]\geq 1, \,\,\,\,i=1,\cdots,n \end{split}\tag{1}\end{eqnarray} Although the optimization problem (1) can be solved by the quadratic programming, in practice people prefer to solve its dual problem by the Lagrangian duality:
- Write the Lagrangian of the problem (1) as \begin{eqnarray} L(w, b, \alpha)=\frac{1}{2}||w||^2 + \sum_{i=1}^n \alpha_i\left\{1-y^{(i)}\left[w^T \,\phi(x^{(i)})+b\right]\right\}\tag{1.1}\end{eqnarray} with $\alpha_i\geq 0$.
- Compute equations $\frac{\partial}{\partial w}L(w, b, \alpha) =0$ and $\frac{\partial}{\partial b}L(w, b, \alpha) =0$ as \begin{eqnarray} &&w=\sum_{i=1}^n\alpha_i y^{(i)}\phi(x^{(i)})\,,\tag{1.2}\\ &&\sum_{i=1}^n \alpha_i y^{(i)}=0\,. \tag{1.3}\end{eqnarray}
- Submit (1.2) into the Lagrangian (1.1) and apply the constraint (1.3), leading to the dual function \begin{eqnarray} d(\alpha) = \sum_{i=1}^n \alpha_i-\frac{1}{2}\sum_{i,j=1}^n y^{(i)}y^{(j)} \alpha_i\alpha_j\phi(x^{(i)})^T \phi(x^{(j)})\,.\tag{1.4}\end{eqnarray}
Approximate linear separation of non-separable sets
As shown in Fig. 1 (b), to make the algorithm work for non-linearly separable datasets and be less sensitive to outliers, we reformulate the primal problem (1) by adding ${\ell}_1$ regularization: \begin{eqnarray}\begin{split}\min_{w,b, \xi}\quad && \frac{1}{2}||w||^2+C\sum_{i=1}^n \xi_i\nonumber\\ \text{s.t.}\quad&& y^{(i)}\left[w^T \phi(x^{(i)})+b\right]\geq 1-\xi_i, \,\,\,\,i=1,\cdots,n\nonumber\\ &&\xi_i\geq 0,\,\,\,\,i=1,\cdots,n \end{split}\tag{3}\end{eqnarray} Similarly to the Lagrangian duality (1.1)-(1.4), we can derive the dual problem of the primal problem (3) as \begin{eqnarray} \begin{split}\max_{\alpha}\quad && \sum_{i=1}^n \alpha_i-\frac{1}{2}\sum_{i,j=1}^n y^{(i)}y^{(j)}\alpha_i\alpha_jK_{ij}\nonumber\\ \text{s.t.}\quad && 0\leq\alpha_i\leq C, \,\,\,\,i=1,\cdots,n\nonumber\\ && \sum_{i=1}^n \alpha_i y^{(i)}=0 \end{split}\tag{4}\end{eqnarray} where $K_{ij}\equiv\phi(x^{(i)})^T \phi(x^{(j)})$ is the kernel.
Note:
- For the linearly and non-linearly separable cases, although their primal forms (1) and (3) look different, their dual forms (2) and (4) are very similar: The only difference is that there is an additional constraint $\alpha_i\leq C$ in the problem (4) compared to the problem (2).
- the primal problem (3) is equivalent to an unconstrained optimization problem with hinge loss: \begin{eqnarray} \min_{w, b}\quad \sum_{i=1}^n \max\left\{0, 1-y^{(i)}[w^T \phi(x^{(i)})+b]\right\}+\frac{1}{2C}||w||^2\,,\end{eqnarray} where $1/2C$ is the strength of the ${\ell}_2$ regularization on $w$. Such form does not apply to the linearly separable case in which $C=0$.
- The the optimization of dual problem (4) can be solved by sequential minimal optimization (SMO) algorithm.
Kernel trick
In the dual forms (2) and (4), the feature vectors $\phi(x)$ is in the form of kernels $K_{ij}\equiv\phi(x^{(i)})^T \phi(x^{(j)})$. Therefore, in practice, we assign the kernel $K_{ij}$ directly with no need of an explicit form of $\phi$:- Computational efficiency: computing the high dimensional feature $\phi(x)$ is very expensive while the kernel $K$ is much cheaper to compute.
- Feasibility: Mercer's theorem states that any symmetric function $K(x, z)$ can be expressed in the form of $\phi(x)^T\phi(z)$ for some $\phi$ if and only if $K(x, z)$ is positive semidefinite.
- Nonlinear boundaries: the boundaries are hyperplanes in the feature space but nonlinear in the input space.
Sparsity
SVM for regression
Least-Square SVM (LS-SVM)
LS-SVM treats the non-linearly separable case in a different way from SVM. Instead of using the ${\ell}_1$ regularization and inequality constraints as in (3) and (5), we introduces the ${\ell}_2$ regularization and equality constraints in LS-SVM:
- No sparsity because of equality constraints.
- Kernel trick is still applicable.
- The solutions are in a closed form.
LS-SVM for binary classification
The optimization problem of LS-SVM for binary classification is different from the problem (3) as \begin{eqnarray} \min_{w,b}&& \frac{1}{2}||w||^2+C\sum_{i=1}^n \xi_i^2\nonumber\tag{6}\\ \text{s.t.}&& y^{(i)}\left[w^T \phi(x^{(i)})+b\right]= 1-\xi_i, \,\,\,\,i=1,\cdots,n\end{eqnarray}
Because of equality constraint, it can be solved by Lagrangian multiplier:
- Write the Lagrangian is \begin{eqnarray}L(\omega, b, \xi, \alpha)=\frac{1}{2}||w||^2+C\sum_{i=1}^n \xi_i^2-\sum_{i=1}^n\alpha_i\left\{y^{(i)}\left[w^T \phi(x^{(i)})+b\right] - 1+\xi_i\right\}\,.\end{eqnarray}
- Solve derivatives \begin{eqnarray}\frac{\partial L}{\partial \omega}&=&\omega - \sum_{i=1}^n\alpha_i y^{(i)} \phi(x^{(i)})=0\\ -\frac{\partial L}{\partial b}&=&\sum_{i=1}^n\alpha_iy^{(i)}=0\,,\\\frac{\partial L}{\partial \xi_i}&=&2C\xi_i-\alpha_i=0\\ \frac{\partial L}{\partial \alpha_i}&=& y^{(i)}\left[w^T \phi(x^{(i)})+b\right] - 1+\xi_i=0\end{eqnarray}
- Eliminate $\omega,\,\xi$ and obtain linear equations on $b, \,\alpha$ as \begin{eqnarray}\left[\begin{array}{cc} 0 & \mathbf{y}^T\\ \mathbf{y} & \boldsymbol{\Omega} + \mathbf{I}/2C\end{array}\right] \left[\begin{array}{c} b\\ \boldsymbol{\alpha} \end{array}\right] = \left[\begin{array}{c} 0\\ \mathbf{1}\end{array}\right]\,,\tag{6.1}\end{eqnarray} with $\Omega_{ij}=y^{(i)} y^{(j)} K_{ij}$.
LS-SVM for regression
The optimization problem of LS-SVM for regression is different from (5) as \begin{eqnarray} \min_{w,b}&& \frac{1}{2}||w||^2+C\sum_{i=1}^n \xi_i^2\nonumber\tag{7}\\ \text{s.t.}&& y^{(i)}=w^T \phi(x^{(i)})+b+\xi_i, \,\,\,\,i=1,\cdots,n\end{eqnarray} whose non-constraint form is exactly the linear regression in the feature space with ridge regularization. Because of equality constraint, it can also be solved by Lagrangian multiplier:
- Write the Lagrangian is \begin{eqnarray}L(\omega, b, \xi, \alpha)=\frac{1}{2}||w||^2+C\sum_{i=1}^n \xi_i^2+\sum_{i=1}^n\alpha_i\left[y^{(i)}-w^T \phi(x^{(i)})-b-\xi_i\right]\,.\end{eqnarray}
- Solve derivatives \begin{eqnarray}\frac{\partial L}{\partial \omega}&=&\omega - \sum_{i=1}^n\alpha_i \phi(x^{(i)})=0\\ -\frac{\partial L}{\partial b}&=&\sum_{i=1}^n\alpha_i=0\,,\\\frac{\partial L}{\partial \xi_i}&=&2C\xi_i-\alpha_i=0\\ \frac{\partial L}{\partial \alpha_i}&=& y^{(i)}-w^T \phi(x^{(i)})-b-\xi_i=0\end{eqnarray}
- Eliminate $\omega,\,\xi$ and obtain linear equations on $b, \,\alpha$ as \begin{eqnarray}\left[\begin{array}{cc} 0 & \mathbf{1}^T\\ \mathbf{1} & \mathbf{K} + \mathbf{I}/2C\end{array}\right] \left[\begin{array}{c} b\\ \boldsymbol{\alpha} \end{array}\right] = \left[\begin{array}{c} 0\\ \mathbf{y}\end{array}\right]\,.\tag{7.1}\end{eqnarray}
Comments
Post a Comment