Support Vector Machine (SVM)-theory and application in R and Python
Caleb / 2020-04-05
Introduction
Support vector machine (SVM) is a supervised learning model with assoicated learning algorithms that analyze data used for classification and regression analysis.1 The basic SVM is a binary classifer, and can be used in non-linear classification when introducing kernel functions. There are three kinds of SVMs solving three different problem basically.
Linearly separable SVM (Hard margin SVM): when the training data is linearly separable, the separating hyper-plane with biggest margin and the corresponding classification decision function is called the linearly separable SVM.
Nonlinear SVM:
Linearly separable SVM
Given a training dataset in the feature space \(\mathscr{T} = \{({\bf x}_1,y_1),({\bf x}_2,y_2),\ldots, ({\bf x}_n,y_n)\}\), where \({\bf x}_i = (x_i^{(1)}, x_i^{(2)},\ldots, x_i^{(d)})^{{\top}}\in\mathbb{R}^d, y_i\in\mathscr{Y} = \{+1,-1\}, i = 1,2,\ldots,n\). \({\bf x}_i\) is \(i\)th instance; \(y_i\) is the class label of \({\bf x}_i\). The \({\bf x}_i\) is called a positve (negative) instance when \(y_i=+1\) \((y_i=-1)\).
The target of linearly separable SVM is to find a hyper-plane to separate all positive into one side and all negative ones the other side. The hyper-plane is expressed by the following equation: \[\begin{eqnarray} {\bf w}{\bf x}+ b = 0 \tag{1} \end{eqnarray}\] , where \({\bf w}\) is a vector and \(b\) is an intercept. The Eq. (1) can be denoted by \(({\bf w},b)\), which is our goal to find.
When the training dataset is linearly separable, the hyperplane is not unique. Look at Fig. 1, for example, many solid lines (hyperplanes) can be obtained by shifting between two dash lines or slightly tilting , they are still able to separate green and blue points completely. Therefore, linearly separable SVM proposes a constrain of maximum margin to find a unique hyperplane.
Functional Margin
Relative distance. For a hyperplane, \({\bf w}{\bf x}+b=0\), the relative distance (not geometric distance) from a point \({\bf x}_i\) to this hyperplane is \(|{\bf w}{\bf x}_i+b|\).
When \({\bf w}\cdot{\bf x}_i+b>0\), the instance \({\bf x}_i\) is above the hyperplane, and \({\bf x}_i\) is classified by positive. If \(y_i=+1\), then the classification is correct. See blue points.
When \({\bf w}\cdot{\bf x}_i+b<0\), the instance \({\bf x}_i\) is below the hyperplane, and \({\bf x}_i\) is classified by negative If \(y_i=-1\), then the classification is correct. See green points.
Consistency of the sign of \({\bf w}{\bf x}_i+b\) with the sign of its class label \(y_i\) can tell the correctness of the classification.
Functional Margin. The functional margin of a sample point \(({\bf x}_i,y_i)\) and a hyperplane \(({\bf w}{\bf x},b)\) is defined as \(\hat\gamma_i({\bf w},b) = y_i({\bf w}{\bf x}+b), i = 1,2,\ldots, n\). The functional margin of the training dataset \(\mathscr{T}\) and a hyperplane \(({\bf w}{\bf x},b)\) is defined as \(\hat\gamma({\bf w},b) = \min_i\hat\gamma_i({\bf w},b)\).
Reliability. For a linearly separable SVM, the distance from an instance to the hyperplane indicates the reliability of the classification prediction: the farther an instance to the hyperplane is, the more reliable the classification is, and vice versa.
Shortcoming of the functional margin. When \(({\bf w},b)\) changes proportionally, for example \((10{\bf w},10b)\), the hyperplane \(10{\bf w}+10b=0\) keeps the same. Thus, we introduce geometric margin that makes a unique maximum margin hyperplane (solid line in Fig. 1).
Geometric Margin
The geometric distance between \({\bf x}_i\) and the hyperplane \({\bf w}{\bf x}+ b = 0\) is given by
\[ \gamma_i = \frac{|{\bf w}{\bf x}_i+b|}{\|{\bf w}\|}. \] Similarly, we will choose \(({\bf w}, b)\) to make \({\bf w}{\bf x}+ b < 0\) for negative instances, and \(> 0\) for positive instances. Therefore, when a sample point \(({\bf x}_i, y_i)\) is correctly classified, the geometric distance from the hyperplane is \(\gamma_i = \frac{y_i({\bf w}{\bf x}_i+b)}{\|{\bf w}\|}.\) Thus, geometric margin between \(({\bf x}_i, y_i)\) and a hyperplane \(({\bf w}, b)\) is defined as \[ \gamma_i({\bf w},b) = \frac{y_i({\bf w}{\bf x}_i+b)}{\|{\bf w}\|},i=1,2,\ldots,n \] and geometric margin between a training dataset \(\mathscr{T}\) and a hyperplane \(({\bf w}, b)\) is defined as
\[ \gamma({\bf w},b)=\min_{{\bf x}_i\in\mathscr{T}}\gamma_i({\bf w},b). \] Note that the geometric margin is scale invariant and it is a function of \(({\bf w},b)\).
Linearly Separable SVM as An Optimization Problem
The target of SVM is to find a hyperplane with maximum geometric margin to correctly separate training dataset. The maximum geometric margin is also called maximum hard margin. This goal falls into the description of constraint optimization problems:
\[\begin{eqnarray} &&\max_{{\bf w},b}\gamma\\ && s.t.\quad \frac{y_i({\bf w}{\bf x}_i+b)}{\|{\bf w}\|}\geq \gamma,i=1,2,\ldots,n \tag{2} \end{eqnarray}\] The inequality in Eq. (2) holds because \(\frac{y_i({\bf w}{\bf x}_i+b)}{\|{\bf w}\|} = \gamma_i({\bf w},b) \geq \min_{{\bf x}_i\in\mathscr{T}}\gamma_i({\bf w},b) =\gamma\).
According to connection between functional margin and geometric margin, the above proglem can be converted to: \[\begin{eqnarray} &&\max_{{\bf w},b}\frac{\hat\gamma}{\|{\bf w}\|}\\ && s.t.\quad y_i({\bf w}{\bf x}_i+b)\geq \gamma\|{\bf w}\|,i=1,2,\ldots,n. \end{eqnarray}\] Note that for all \(i\), \[ y_i({\bf w}{\bf x}_i+b)\geq \gamma\|{\bf w}\| \iff y_i\left(\frac{{\bf w}}{\gamma\|{\bf w}\|}{\bf x}_i+\frac{b}{\gamma\|{\bf w}\|}\right) \geq 1. \] Reparameterizing, \[ \frac{{\bf w}}{\gamma\|{\bf w}\|} \triangleq \tilde{\bf w},\quad \frac{b}{\gamma\|{\bf w}\|}\triangleq\tilde b. \] Therefore, the new hyperplane is \((\tilde {\bf w}, \tilde b)\). Suppose a point \(({\bf x}_0,y_0)\) is in the line: \(\tilde {\bf w}{\bf x}+\tilde b = \pm1\), then the geometric margin is \(\gamma = \frac{|\tilde {\bf w}{\bf x}_0+\tilde b|}{\|\tilde {\bf w}\|} = \frac{1}{\|\tilde {\bf w}\|}\), and functional margin is \(\hat\gamma = 1\).
Therefore, Eq. (2) is equivalent to \[\begin{eqnarray} &&\max_{{\bf w},b}\frac{1}{\|{\bf w}\|}\\ && s.t.\quad y_i({\bf w}{\bf x}_i+ b)\geq 1,i=1,2,\ldots,n. \end{eqnarray}\] Note that \(\max\frac{1}{\|{\bf w}\|}\iff\min\frac{1}{2}\|{\bf w}\|^2\). Therefore, the SVM for a linearly separable data set is the solution of the following restricted convex optimization problem: \[\begin{eqnarray} &&\min_{{\bf w},b}\frac{1}{2}\|{\bf w}\|^2\\ && s.t.\quad y_i({\bf w}{\bf x}_i+ b) - 1\geq ,i=1,2,\ldots,n, \end{eqnarray}\] which is a quadratic convex optimization problem.
The complete SVM alogrithm can be described as follows:
Input: Linearly separable training dataset \(\mathscr{T} = \{({\bf x}_1,y_1),({\bf x}_2,y_2),\ldots, ({\bf x}_n,y_n)\}\), where \({\bf x}_i = (x_i^{(1)}, x_i^{(2)},\ldots, x_i^{(d)})^{{\top}}\in\mathbb{R}^d, y_i\in\mathscr{Y} = \{+1,-1\}, i = 1,2,\ldots,n\).
Output: The separating hyperplane \({\bf w}{\bf x}+ b = 0\) and the classification decision function.
- Find the solution \(({\bf w}^*,b^*)\) of the following restricted optimization question:
\[\begin{eqnarray} \begin{cases} \min_{{\bf w},b}\frac{1}{2}\|{\bf w}\|^2\\ s.t.\quad y_i({\bf w}{\bf x}_i+ b) - 1\geq 0,i=1,2,\ldots,n. \end{cases} \end{eqnarray}\]
- The separating hyperplane is \[ {\bf w}^*{\bf x}+ b^* = 0, \] and the classification decision function is \[ f({\bf x}) = \text{sign}({\bf w}^*{\bf x}+ b^*). \]
Existence and Uniqueness of SVM Hyperplane
Proof of Existence: Note that the linearly separability of \(\mathscr{T}\) implies there exists a feasible point \((\tilde {\bf w}, \tilde b)\), such that the optimization question is equivalent to
\[\begin{eqnarray} \begin{cases} \min_{{\bf w},b}\frac{1}{2}\|{\bf w}\|^2 \\ s.t. \quad y_i({\bf w}{\bf x}_i+b)-1\geq 0, i = 1,2,\ldots,n,\\ \|{\bf w}\|\leq\|\tilde{\bf w}\|. \end{cases} \end{eqnarray}\] Therefore, the feasible region of this optimization question is nonempty and bounded closed set. Note that any continuous function achieves minimum in a nonempty and bounded closed set, os the solution to the optimization question exists. Also, because both positive and negative instances present in \(\mathscr{T}\), so for any \(({\bf w},b)\) with \(({\bf w},b) = (0,b)\) will not be a feasible point, which implies the solution must satisfy \({\bf w}^*\neq0.\quad\Box\)
Proof of Uniqueness:
- Before showing the uniqueness, we prove the following two statements first:
There exists \(j\in\{i:y_i = 1\}\), such that \({\bf w}^*{\bf x}_j + b^*=1\);
There exists \(j\in\{i:y_i = -1\}\), such that \({\bf w}^*{\bf x}_j + b^*=-1\);
Use contradiction. Suppose the first statement is not true, then
\[{\bf w}^*{\bf x}_j + b^*>1,\quad \text{for } \forall j\in\{i:y_i = 1\}.\] Since \({\bf w}^*, b^*\) satisfy the contrains, so \[{\bf w}^*{\bf x}_j + b^*\leq -1,\quad \text{for } \forall j\in\{i:y_i = -1\}.\] For any \(\alpha\in(0,1)\), let \(\tilde{\bf w}= \alpha{\bf w}^*, \tilde b = (b^*+1)\alpha - 1\). Then for any \(j\in\{i:y_i = -1\}\), \[\tilde{\bf w}{\bf x}_j + \tilde b = \alpha{\bf w}^*{\bf x}_j + (b^*+1)\alpha - 1 = \alpha({\bf w}^*{\bf x}_j + b^*)+\alpha -1\leq-\alpha+\alpha-1=-1.\] For any \(j\in\{i:y_i = 1\}\), \[\lim_{\alpha\rightarrow 1^-}(\tilde{\bf w}{\bf x}_j + \tilde b) = \lim_{\alpha\rightarrow 1^-}[\alpha({\bf w}^*{\bf x}_j + b^*)+\alpha -1] = {\bf w}^*{\bf x}_j + b^*\geq 1.\] So, we can choose a suitable \(\alpha\in(0,1)\) such that \[\tilde{\bf w}{\bf x}_j + \tilde b>1\quad \text{for }\forall j\in\{i:y_i = 1\}.\] This implies that \((\tilde{\bf w},\tilde b)\) is a feasible point. However \(\|\tilde{\bf w}\|^2 = \alpha^2\|{\bf w}^*\|^2\leq\|{\bf w}^*\|^2\), which implies \(({\bf w}^*,b^*)\) is not a solution.\(\quad\Box\)
- Now let’s show the uniqueness of SVM hyperplane.
Suppose we have optimal solutions \(({\bf w}^*_1,b^*_1)\) and \(({\bf w}^*_2,b^*_2)\).
First we show that \({\bf w}^*_1 = {\bf w}^*_2\). It is easy to verify that \(({\bf w},b)\) is a feasible point. Note that we must have \(\|{\bf w}^*_1\| = \|{\bf w}^*_2\|^2=c\geq0.\) Let \({\bf w}= ({\bf w}^*_1+{\bf w}^*_2)/2\), and \(b = (b^*_1+b^*_2)/2\). Then \[c\leq\|{\bf w}\| = \frac{1}{2}\|{\bf w}_1^*+{\bf w}^*_2\|\leq\frac{1}{2}(\|{\bf w}_1^*\|+\|{\bf w}^*_2\|)=c.\] So equality holds. Hence, \(c=\|{\bf w}\|=\frac{1}{2}\sqrt{\|{\bf w}^*_1\|^2 + \|{\bf w}^*_2\|^2 + 2\|{\bf w}^*_1\|\|{\bf w}^*_2\|\cos\theta} = \frac{1}{2}\sqrt{2c^2(\cos\theta+1)}\), where \(\theta\) is the angle between \({\bf w}^*_1\) and \({\bf w}^*_2\). Thus we have \(\theta=0\) or \(\pi\). If \(\theta = \pi\), \({\bf w}= 0\), which is impossible because \(({\bf w},b)\) is a feasible point. Then we mush have \(\theta = 0\), which implies \({\bf w}^*_1 = {\bf w}^*_2\).
Finally, we show that \(b^*_1 = b^*_2\). From aforementioned, there exists \(i,j\) such that \(y_i = 1\) and \(y_j = -1\). Thus we have
\[ {\bf w}^*{\bf x}_i + b^*_1 = 1, \quad {\bf w}^*{\bf x}_j + b^*_1 \leq -1 \] and \[ {\bf w}^*{\bf x}_j + b^*_2 = 1, \quad {\bf w}^*{\bf x}_i + b^*_2 \geq 1. \] This implies \(b^*_1 = b^*_2.\quad\blacksquare\)
Support Vectors
When the training data set \(\mathscr{T}\) is linearly separable, the points \({\bf x}_i\) which satisfy
\[{\bf w}{\bf x}_i+b\pm1=0\] are called suport vectors.
Dual Problem
We regard the solution to optimization problem of the linearly separable SVM as the primary optimization problem. \[\begin{eqnarray} &&\min_{{\bf w},b}\frac{1}{2}\|{\bf w}\|^2 \\ &&s.t. \quad y_i({\bf w}{\bf x}_i+b)-1\geq 0, i = 1,2,\ldots,n, \end{eqnarray}\]
We construct the Lagrange function
\[L({\bf w},b,{\boldsymbol \alpha}) = \frac{1}{2}\|{\bf w}\|^2 - \sum\alpha_i[y_i({\bf w}{\bf x}_i+b)-1],\] where \(\alpha_i\geq 0, {\boldsymbol \alpha}= (\alpha_1,\ldots,\alpha_n)\) is the Lagrange multiplier vector.
This SVM note is mainly referenced by the course Stat 950: Machine Learning-Theory and Application taught by Dr. Weixing Song in 2020 Spring, and a book called Python 大战机器学习. I’d like to thank Dr. Song and the author of that book.