Processing math: 100%

SVMs

Support Vector Machines

Historically, Support Vector Machines (SVMs) were motivated by the notion of the maximum margin separating hyperplane. Before we explore this motivation, which is a bit of a MacGuffin, let's relate it to the other linear classification algorithms: we have seen that Logistic Regression and Boosting minimize a convex upper-bound on the 0/1 loss for binary classification: logistic in one case and exponential in the other. Support Vector Machines minimize the hinge loss -- also an upper-bound on the 0/1 loss.


A few things to note about the hinge loss.

  • It's convex like the exponential and logistic loss: optimization is fairly easy
  • It grows slower below zero: the penalty for misclassification is more mild
  • It is zero after 1: once correct and confident enough, there is no bonus for driving yf(x) higher
  • It's non-differentiable at the hinge point: simple gradient doesn't work

Why all the excitement about this loss? First, it has a nice geometric interpretation (at least in the linearly separable case) as leading to a maximum margin hyperplane separating positive from negative examples. Second, more importantly, it fits particularly well with kernels because many examples will not be a part of the solution, leading to a much more efficient dual representation and algorithms.

Hyperplanes and Margins

Consider the case of linearly separable data. In the toy 2D example below, there are many hyperplanes (lines) that separate the negative examples (blue) from positive (green). If we define the margin of a hyperplane as the distance of the closest point to the decision boundary, the maximum margin hyperplane is shown below.


We will denote our linear classifier as (note the constant term b is separated out)

h(x)=sign(wx+b)

Since rescaling the classifier by multiplying it by a positive constant does not change the decision boundary, sign(wx+b)=sign(γ(wx+b)), we will set the scale so that at the points closest to the boundary, wx+b=±1, as shown in the figure below.


Now we will compute the margin, which is the distance of the points closest to the boundary, using one point from each class, x1 and x2.

wx1+b=1andwx2+b=1

Hence,

w(x2x1)=2w2||w||2(x2x1)=1||w||2

Note that w2||w||2(x2x1) is precisely the distance to the hyperplane: the vector 12(x2x1) projected on the unit vector w||w||2. Let's restate this more precisely. A rescaled hyperplane classifier satisfies:

yi(wxi+b)1,i=1,,n

(with equality for the closest points) and has margin 1||w||2.

To find the maximal margin hyperplane, we can simply maximize 1||w||2, which is equivalent to minimizing 12ww subject to the (scaled) correct classification constraints. The SVM optimization problem (in the separable case) is:

Separable SVM primal:minw,b12wws.t.yi(wxi+b)1,i=1,,n

Note on notation: I will use min/max instead of inf/sup since the continuous nature of the optimization will be obvious from context. This optimization problem has a quadratic objective and linear inequality constraints. Note that only the points closest to the boundary participate in defining the SVM hyperplane. These points are called the support vectors, and we will see that the solution can be expressed only using these points.

The Dual View

In order to get more intuition about the problem, we can look at it from the dual perspective. Lagrangian duality provides the right tool here. We formulate the Lagrangian by introducing a non-negative multiplier αi (the Lagrange multiplier, which we called λ before) for each inequality constraint:

L(w,b,α)=12ww+iαi(1yi(wxi+b))

Note that if w,b is feasible (i.e. satisfy all the constraints), then:

maxα0L(w,b,α)=12ww

since all the coefficients (1yi(wxi+b)) are negative or zero. When w,b are not feasible, maxα0L(w,b,α)=. Assuming that feasible w,b exist, minw,b(maxα0L(w,b,α)) solves the original problem, and since the original objective is convex (and assuming it's feasible for a moment), we can change the order of min/max to obtain a dual problem that has the same optimal value: maxα0(minw,bL(w,b,α)). Taking derivatives with respect to w and b and setting them to zero, we get:

Lb=0iαiyi=0

and

Lw=0w=iαiyixi

Note the dual representation of w: it is a linear combination of the examples, weighted by dual variables αi. Plugging in the expression for w, we get:

L(b,α)=12i,jαiαjyiyjxixj+iαi(1yi(jαjyjxjxi+b))=12i,jαiαjyiyjxjxi+iαiiαiyib

Since iαiyi=0, the dual problem is:

Separable SVM dual:maxα0iαi12i,jαiαjyiyjxixjs.t.iαiyi=0

This dual optimization problem, as the primal, has a quadratic objective and (a single) linear constraint. Because of the (KKT) optimality conditions, when xi is not the closest to the boundary, αi is zero. So only the points that support the hyperplane, called support vectors, participate in the solution. Given a solution α to the problem, we can recover the primal via: w=iαiyixi To recover b, note that for any support vector i, where αi>0, we have yi(wxi+b)=1, so

b=yiwxi,iαi>0.

Kernels

We can use the kernel trick again, by assuming a feature map ϕ(x) such that k(xi,xj)=ϕ(xi)ϕ(xj). The kernelized dual optimization problem is:

Kernelized separable dual:maxα0iαi12i,jαiαjyiyjk(xi,xj)s.t.iαiyi=0

Given a solution α to the problem, we can recover

b=yijαjyjk(xj,xi)iαi>0

and the prediction:

wx+b=iαiyik(xi,x)+b

Non-separable case

When the data is not separable, the above optimization will fail. Consider the case in the figure below, where the yellow point in the center of the blue ones cannot be separated from the blue and grouped with the other yellow using any linear separator.


In general, if data is non-separable this means that for at least one data point, say the rth point, the yr(wxr+b) term will fail to exceed the threshold value 1 no matter what w is set to. This will mean the maxαL(w,b,α) optimization can achieve value by setting αr=. The resulting classifier h(x)=sign(αryrxrx) will then depend only on comparison to the rth point. In practice, this means we'll be classifying all new data based on an outlier from the training set. This will likely give catastrophically low accuracy.

Although we might perhaps like to minimize the number of misclassified points, this is an NP-hard problem. Instead, to handle the case of points on the wrong side of the fence, SVMs introduce slack variables. These variables allow all the constraints to be satisfied, so that no αi get set to . A corresponding slack variable penalty term is added to the objective, such that the more the slack variables are relied upon, the worse the value of the objective.

Specifically, SVMs add a slack variable ξi for each example so that the margin constraints become: yi(wxi+b)1ξi. For ξi>0 this means we allow points to be within the margin (or even on the wrong side of the linear separator for large enough ξi). The ξi are then added as a penalty term in the objective, weighted by the positive slack penalty constant C:

Hinge primal:minw,b,ξ012ww+Ciξis.t.yi(wxi+b)1ξi, i=1,,n

The reason we call the primal for the non-separable case the hinge primal is that the value of ξi can be written:

ξi=max(0,1yi(wxi+b))

which is exactly the form of hinge loss (see the figure at the top of these lecture notes for a reminder of what the hinge function looks like). That is, using these ξi we are penalizing linearly for misclassification.

Taking the dual, we get:

Hinge dual:maxα0iαi12i,jαiαjyiyjxixjs.t.iαiyi=0,αiC,i=1,,n

Notice that this is exactly the same as the separable dual, but with the added constraints αiC. So, adding the slack variables ξi in the primal amounts to putting a cap on αi in the dual. Effectively, the cap limits the extent to which any one point can influence the final classifier, h(x)=sign(iαiyixix).

Here's a short derivation of the dual. In addition to the αs, we introduce a non-negative Lagrange multiplier λi for each ξi0 constraint.

L(w,b,ξ,α,λ)=12ww+Ciξi+iαi(1ξiyi(wxi+b))iλiξiLb=0iαiyi=0Lw=0w=iαiyixiLξi=0Cαiλi=0

Plugging in the expression for w and using the condition iαiyi=0 as before, we have

L(ξ,α,λ)=iαi12i,jαiαjyiyjxixj+iξi(Cαiλi)

Now using the condition Cαiλi=0, the last sum vanishes, so we are left with the dual:

maxα0,λ0iαi12i,jαiαjyiyjxixjs.t.iαiyi=0, Cαiλi=0

We can get rid of λis by noting that they only appear in the constraint Cαiλi=0, and since they must be non-negative, we can just replace the constraint with Cαi, as in the dual stated above. That is, the original constraint tells us Cαi=λi and the only restriction on λi is that it must be non-negative, so it's entire functionality can be captured by swapping in the constraint that Cαi must be non-negative.

The kernelized version of the dual, analogous to the separable SVM case, is just:

Hinge kernelized dual:maxα0iαi12i,jαiαjyiyjk(xi,xj)s.t.iαiyi=0,αiC,i=1,,n

where k(xi,xj) is substituted for xixj.

Given a solution α to the problem, we can recover

b=yijαjyjk(xj,xi)iC>αi>0

and the prediction:

wϕ(x)+b=iαiyik(xi,x)+b

Solving SVMs

We will not spend too much time on how to optimize quadratic programs with linear constraints. There is a very large literature on the topic of general convex optimization and SVM optimization in particular, and many off-the-shelf efficient tools have been developed for both. In recent years, simple algorithms (e.g., SMO, subgradient) that find approximate answers to the SVM problem have been shown to work quite well. Here is a an applet for kernel SVMs in 2D.

SVM Notation for different Objective Functions

As seen above, we have expressed the hinge primal as:

Hinge primal:minw,b,ξ012ww+Ciξis.t.yi(wxi+b)1ξi, i=1,,n

To generalize this slightly, we can write:

Hinge primal:minw,b,ξ012wpp+Cξqqs.t.yi(wxi+b)1ξi, i=1,,n

Here, wpp is the p-norm of w raised to the p'th power, and likewise for ξqq. We often describe this as the objective for Lp-regularized, Lq-loss SVM. We can think of the w term is the regularization term (penalizes large model weights) and the ξ term as the loss term (penalizes large losses). The original hinge primal we derived is therefore L2-regularized, L1-loss SVM.

Back to Lectures