Support Vector Machine (SVM)-theory and application in R and Python
Caleb / 2020-04-05
Introduction
Support vector machine (SVM) is a supervised learning model with assoicated learning algorithms that analyze data used for classification and regression analysis.1 The basic SVM is a binary classifer, and can be used in non-linear classification when introducing kernel functions. There are three kinds of SVMs solving three different problem basically.
Linearly separable SVM (Hard margin SVM): when the training data is linearly separable, the separating hyper-plane with biggest margin and the corresponding classification decision function is called the linearly separable SVM.
Nonlinear SVM:
Linearly separable SVM
Given a training dataset in the feature space , where . is th instance; is the class label of . The is called a positve (negative) instance when .
The target of linearly separable SVM is to find a hyper-plane to separate all positive into one side and all negative ones the other side. The hyper-plane is expressed by the following equation: , where is a vector and is an intercept. The Eq. (1) can be denoted by , which is our goal to find.
When the training dataset is linearly separable, the hyperplane is not unique. Look at Fig. 1, for example, many solid lines (hyperplanes) can be obtained by shifting between two dash lines or slightly tilting , they are still able to separate green and blue points completely. Therefore, linearly separable SVM proposes a constrain of maximum margin to find a unique hyperplane.
![linearly separable SVM^[https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm]](https://live.staticflickr.com/65535/49740264526_9fef526f15.jpg)
Figure 1: linearly separable SVM2
Functional Margin
Relative distance. For a hyperplane, , the relative distance (not geometric distance) from a point to this hyperplane is .
When , the instance is above the hyperplane, and is classified by positive. If , then the classification is correct. See blue points.
When , the instance is below the hyperplane, and is classified by negative If , then the classification is correct. See green points.
Consistency of the sign of with the sign of its class label can tell the correctness of the classification.
Functional Margin. The functional margin of a sample point and a hyperplane is defined as . The functional margin of the training dataset and a hyperplane is defined as .
Reliability. For a linearly separable SVM, the distance from an instance to the hyperplane indicates the reliability of the classification prediction: the farther an instance to the hyperplane is, the more reliable the classification is, and vice versa.
Shortcoming of the functional margin. When changes proportionally, for example , the hyperplane keeps the same. Thus, we introduce geometric margin that makes a unique maximum margin hyperplane (solid line in Fig. 1).
Geometric Margin
The geometric distance between and the hyperplane is given by
Similarly, we will choose to make for negative instances, and for positive instances. Therefore, when a sample point is correctly classified, the geometric distance from the hyperplane is Thus, geometric margin between and a hyperplane is defined as and geometric margin between a training dataset and a hyperplane is defined as
Note that the geometric margin is scale invariant and it is a function of .
Linearly Separable SVM as An Optimization Problem
The target of SVM is to find a hyperplane with maximum geometric margin to correctly separate training dataset. The maximum geometric margin is also called maximum hard margin. This goal falls into the description of constraint optimization problems:
The inequality in Eq. (2) holds because .
According to connection between functional margin and geometric margin, the above proglem can be converted to: Note that for all , Reparameterizing, Therefore, the new hyperplane is . Suppose a point is in the line: , then the geometric margin is , and functional margin is .
Therefore, Eq. (2) is equivalent to Note that . Therefore, the SVM for a linearly separable data set is the solution of the following restricted convex optimization problem: which is a quadratic convex optimization problem.
The complete SVM alogrithm can be described as follows:
Input: Linearly separable training dataset , where .
Output: The separating hyperplane and the classification decision function.
- Find the solution of the following restricted optimization question:
- The separating hyperplane is and the classification decision function is
Existence and Uniqueness of SVM Hyperplane
Proof of Existence: Note that the linearly separability of implies there exists a feasible point , such that the optimization question is equivalent to
Therefore, the feasible region of this optimization question is nonempty and bounded closed set. Note that any continuous function achieves minimum in a nonempty and bounded closed set, os the solution to the optimization question exists. Also, because both positive and negative instances present in , so for any with will not be a feasible point, which implies the solution must satisfy
Proof of Uniqueness:
- Before showing the uniqueness, we prove the following two statements first:
There exists , such that ;
There exists , such that ;
Use contradiction. Suppose the first statement is not true, then
Since satisfy the contrains, so For any , let . Then for any , For any , So, we can choose a suitable such that This implies that is a feasible point. However , which implies is not a solution.
- Now let’s show the uniqueness of SVM hyperplane.
Suppose we have optimal solutions and .
First we show that . It is easy to verify that is a feasible point. Note that we must have Let , and . Then So equality holds. Hence, , where is the angle between and . Thus we have or . If , , which is impossible because is a feasible point. Then we mush have , which implies .
Finally, we show that . From aforementioned, there exists such that and . Thus we have
and This implies
Support Vectors
When the training data set is linearly separable, the points which satisfy
are called suport vectors.
Dual Problem
We regard the solution to optimization problem of the linearly separable SVM as the primary optimization problem.
We construct the Lagrange function
where is the Lagrange multiplier vector.
This SVM note is mainly referenced by the course Stat 950: Machine Learning-Theory and Application taught by Dr. Weixing Song in 2020 Spring, and a book called Python 大战机器学习. I’d like to thank Dr. Song and the author of that book.