Caleb

Stats Ph.D.

Support Vector Machine (SVM)-theory and application in R and Python

Caleb / 2020-04-05


Introduction

Support vector machine (SVM) is a supervised learning model with assoicated learning algorithms that analyze data used for classification and regression analysis.1 The basic SVM is a binary classifer, and can be used in non-linear classification when introducing kernel functions. There are three kinds of SVMs solving three different problem basically.

Linearly separable SVM

Given a training dataset in the feature space T={(x1,y1),(x2,y2),,(xn,yn)}, where xi=(xi(1),xi(2),,xi(d))Rd,yiY={+1,1},i=1,2,,n. xi is ith instance; yi is the class label of xi. The xi is called a positve (negative) instance when yi=+1 (yi=1).

The target of linearly separable SVM is to find a hyper-plane to separate all positive into one side and all negative ones the other side. The hyper-plane is expressed by the following equation: (1)wx+b=0 , where w is a vector and b is an intercept. The Eq. (1) can be denoted by (w,b), which is our goal to find.

When the training dataset is linearly separable, the hyperplane is not unique. Look at Fig. 1, for example, many solid lines (hyperplanes) can be obtained by shifting between two dash lines or slightly tilting , they are still able to separate green and blue points completely. Therefore, linearly separable SVM proposes a constrain of maximum margin to find a unique hyperplane.

linearly separable SVM^[https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm]

Figure 1: linearly separable SVM2

Functional Margin

  • Relative distance. For a hyperplane, wx+b=0, the relative distance (not geometric distance) from a point xi to this hyperplane is |wxi+b|.

    • When wxi+b>0, the instance xi is above the hyperplane, and xi is classified by positive. If yi=+1, then the classification is correct. See blue points.

    • When wxi+b<0, the instance xi is below the hyperplane, and xi is classified by negative If yi=1, then the classification is correct. See green points.

    • Consistency of the sign of wxi+b with the sign of its class label yi can tell the correctness of the classification.

    • Functional Margin. The functional margin of a sample point (xi,yi) and a hyperplane (wx,b) is defined as γ^i(w,b)=yi(wx+b),i=1,2,,n. The functional margin of the training dataset T and a hyperplane (wx,b) is defined as γ^(w,b)=miniγ^i(w,b).

  • Reliability. For a linearly separable SVM, the distance from an instance to the hyperplane indicates the reliability of the classification prediction: the farther an instance to the hyperplane is, the more reliable the classification is, and vice versa.

  • Shortcoming of the functional margin. When (w,b) changes proportionally, for example (10w,10b), the hyperplane 10w+10b=0 keeps the same. Thus, we introduce geometric margin that makes a unique maximum margin hyperplane (solid line in Fig. 1).

Geometric Margin

The geometric distance between xi and the hyperplane wx+b=0 is given by

γi=|wxi+b|w. Similarly, we will choose (w,b) to make wx+b<0 for negative instances, and >0 for positive instances. Therefore, when a sample point (xi,yi) is correctly classified, the geometric distance from the hyperplane is γi=yi(wxi+b)w. Thus, geometric margin between (xi,yi) and a hyperplane (w,b) is defined as γi(w,b)=yi(wxi+b)w,i=1,2,,n and geometric margin between a training dataset T and a hyperplane (w,b) is defined as

γ(w,b)=minxiTγi(w,b). Note that the geometric margin is scale invariant and it is a function of (w,b).

Linearly Separable SVM as An Optimization Problem

The target of SVM is to find a hyperplane with maximum geometric margin to correctly separate training dataset. The maximum geometric margin is also called maximum hard margin. This goal falls into the description of constraint optimization problems:

maxw,bγ(2)s.t.yi(wxi+b)wγ,i=1,2,,n The inequality in Eq. (2) holds because yi(wxi+b)w=γi(w,b)minxiTγi(w,b)=γ.

According to connection between functional margin and geometric margin, the above proglem can be converted to: maxw,bγ^ws.t.yi(wxi+b)γw,i=1,2,,n. Note that for all i, yi(wxi+b)γwyi(wγwxi+bγw)1. Reparameterizing, wγww~,bγwb~. Therefore, the new hyperplane is (w~,b~). Suppose a point (x0,y0) is in the line: w~x+b~=±1, then the geometric margin is γ=|w~x0+b~|w~=1w~, and functional margin is γ^=1.

Therefore, Eq. (2) is equivalent to maxw,b1ws.t.yi(wxi+b)1,i=1,2,,n. Note that max1wmin12w2. Therefore, the SVM for a linearly separable data set is the solution of the following restricted convex optimization problem: minw,b12w2s.t.yi(wxi+b)1,i=1,2,,n, which is a quadratic convex optimization problem.

The complete SVM alogrithm can be described as follows:

  • Input: Linearly separable training dataset T={(x1,y1),(x2,y2),,(xn,yn)}, where xi=(xi(1),xi(2),,xi(d))Rd,yiY={+1,1},i=1,2,,n.

  • Output: The separating hyperplane wx+b=0 and the classification decision function.

  1. Find the solution (w,b) of the following restricted optimization question:

{minw,b12w2s.t.yi(wxi+b)10,i=1,2,,n.

  1. The separating hyperplane is wx+b=0, and the classification decision function is f(x)=sign(wx+b).

Existence and Uniqueness of SVM Hyperplane

Theorem 1 (Existence and Uniqueness of SVM Hyperplane) If the training dataset T is linearly separable, then the SVM hyperplane exists and is unique.

Proof of Existence: Note that the linearly separability of T implies there exists a feasible point (w~,b~), such that the optimization question is equivalent to

{minw,b12w2s.t.yi(wxi+b)10,i=1,2,,n,ww~. Therefore, the feasible region of this optimization question is nonempty and bounded closed set. Note that any continuous function achieves minimum in a nonempty and bounded closed set, os the solution to the optimization question exists. Also, because both positive and negative instances present in T, so for any (w,b) with (w,b)=(0,b) will not be a feasible point, which implies the solution must satisfy w0.

Proof of Uniqueness:

  1. Before showing the uniqueness, we prove the following two statements first:
  • There exists j{i:yi=1}, such that wxj+b=1;

  • There exists j{i:yi=1}, such that wxj+b=1;

Use contradiction. Suppose the first statement is not true, then

wxj+b>1,for j{i:yi=1}. Since w,b satisfy the contrains, so wxj+b1,for j{i:yi=1}. For any α(0,1), let w~=αw,b~=(b+1)α1. Then for any j{i:yi=1}, w~xj+b~=αwxj+(b+1)α1=α(wxj+b)+α1α+α1=1. For any j{i:yi=1}, limα1(w~xj+b~)=limα1[α(wxj+b)+α1]=wxj+b1. So, we can choose a suitable α(0,1) such that w~xj+b~>1for j{i:yi=1}. This implies that (w~,b~) is a feasible point. However w~2=α2w2w2, which implies (w,b) is not a solution.

  1. Now let’s show the uniqueness of SVM hyperplane.

Suppose we have optimal solutions (w1,b1) and (w2,b2).

First we show that w1=w2. It is easy to verify that (w,b) is a feasible point. Note that we must have w1=w22=c0. Let w=(w1+w2)/2, and b=(b1+b2)/2. Then cw=12w1+w212(w1+w2)=c. So equality holds. Hence, c=w=12w12+w22+2w1w2cosθ=122c2(cosθ+1), where θ is the angle between w1 and w2. Thus we have θ=0 or π. If θ=π, w=0, which is impossible because (w,b) is a feasible point. Then we mush have θ=0, which implies w1=w2.

Finally, we show that b1=b2. From aforementioned, there exists i,j such that yi=1 and yj=1. Thus we have

wxi+b1=1,wxj+b11 and wxj+b2=1,wxi+b21. This implies b1=b2.

Support Vectors

When the training data set T is linearly separable, the points xi which satisfy

wxi+b±1=0 are called suport vectors.

Dual Problem

We regard the solution to optimization problem of the linearly separable SVM as the primary optimization problem. minw,b12w2s.t.yi(wxi+b)10,i=1,2,,n,

We construct the Lagrange function

L(w,b,α)=12w2αi[yi(wxi+b)1], where αi0,α=(α1,,αn) is the Lagrange multiplier vector.

This SVM note is mainly referenced by the course Stat 950: Machine Learning-Theory and Application taught by Dr. Weixing Song in 2020 Spring, and a book called Python 大战机器学习. I’d like to thank Dr. Song and the author of that book.


  1. https://en.wikipedia.org/wiki/Support-vector_machine

  2. https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm