Other available formats: [.PDF] [.TEX]

Support Vector Machines

Carlos C. Rodríguez
https://omega0.xyz/omega8008/

Finding the Hyperplane with Largest Margin

Let us assume that we have n labeled examples (x₁,y₁),ź,(x_n,y_n) with labels y_i Î {1,-1}. We want to find the hyperplane < w,x > +b = 0 (i.e. with parameters (w,b) ) satisfying the followin three conditions:

The scale of (w,b) is fixed so that the plane is in canonical position w.r.t. {x₁,ź,x_n}. i.e.,

min
i Ł n
| < w,x_i > + b | = 1
The plane with parameters (w,b) separates the +1's from the the -1's. i.e.,

y_i ( < w,x_i > + b ) ł 0 for all i Ł n
The plane has maximum margin r = 1/|w|. i.e., minimum |w|².

Of course there may not be a separating plane for the observed data. Let us assume, for the time being, that the data is in fact linearly separable and we'll take care of the general (more realistic) case later.

Clearly 1 and 2 combine into just one condition:

y_i( < w,x_i > + b ) ł 1 for all i Ł n.

Thus, we want to solve the following optimization problem,

minimize

|w|²

over all w Î R^d and b Î R subject to,

y_i( < w,x_i > + b ) -1 ł 0 for all i Ł n.

This is a very simple quadratic programming problem. There are readily available algorithms of complexity O(n³) that can be used for solving this problem. For example the so called interior point algorithms that are variations of the Karmarkar algorithm for linear programming will do. But, when n and d are large (tens of thousands) even the best QP methods will fail. A very desirable characteristics of SVMs is that most of the data ends up being irrelevant. The relevant data are only the points that end up exactly on the margin of the optimal classifier and these are often a very small fraction of n.

KKT-Theory

The problem that we need to solve is a special case of the general problem of minimizing a convex function f(x) subject to n inequality constraints g_j(x) ł 0 for j=1,2,ź,n where the functions g_j are also convex. Let's call this problem (CO). Notice that in our case x=(w,b) Î R^d+1 and the constraints are linear in the unknowns x. Don't get confused with our previous x_i's.

The characterization of the solution to the convex optimization (CO) problem is given by the so called Karush-Kuhn-Tucker conditions.

Theorem (KKT-Conditions)

solves the (CO) problem

if and only if there exists

,ź,

) ł 0

a vector of non-negative Lagrange multipliers so that

(

) is a saddle point of the Lagrangean,

L(x,l) = f(x) -

n
ĺ
j=1

l_j g_j(x).

i.e., for all x and for all l ł 0 we have,

,l) Ł L(

) Ł L(x,

Before proving (half of) this theorem notice that there is an easy to understand intuitive reason behind this result. Think of the term added (subtracted actually) to f(x) to form the Lagrangean L, as a penalty for an x that violates the constraints. In fact, if g_j(x) < 0, the term -l_jg_j(x) > 0 can be made arbitrarily large by increasing l_j. Thus, the minimizer of L(x,l) over x will have to make g_j(x) ł 0. On the other hand if g_j(x) > 0 then it is best to take l_j=0 to maximize L(x,l) as a function of l. It is possible to show that, the saddle point condition is equivalent to,

max
x

min
l ł 0

L(x,l) = L(

) =

min
l ł 0

max
x

L(x,l).

Proof: Let us show that the saddle point condition is in fact sufficient for solving the (CO) problem. That it is also necessary depends on Farkas's Lemma and it is much more difficult to prove. We need to show that the saddle point condition implies,

for all j Ł n,

g_j(
-

x

) ł 0

and,
for all x that satisfies the constraints,

f(
-

x

) Ł f(x)

To show 1, suppose that the ith constraint is violated. Then by taking

l_i >

and

l_j =

for all j š i

we get,

,l) > L(

)

which contradicts the saddle point condition.

To show 2, take l = 0 on the left hand side of the saddle point condition and take x satisfying the constraints on the right. Then,

) = L(

,0) Ł L(x,

) Ł f(x).

Which proves 2. ˇ

When all, the objective function f(x), and the constrainning functions g_j(x) are differentiable (they are infinitely differentiable in the case of SVMs) the condition for a saddle point is simply that at that point the tangent plane to the surface z = L(x,l) is parallel to the (x,l) plane. The saddle point of L can be obtained by solving the system of equations,

Ń_x L(x,l)

0, i.e., Ńf(x) =

ĺ
j

l_jŃg_j(x)

and

l_jg_j(x) = 0 for all j Ł n from where, Ń_l L(x,l) = 0.

The second set of equations are known as complementarity conditions and are a consequence of the constraint that l ł 0.

The Dual

The minmax = maxmin characterization of the saddle point of the Lagrangean L provides an alternative way to find the solution of the (CO) problem. Instead of minimizing f(x) subject to the g_j(x) ł 0 constraints one can equivalently maximize W(l), where

W(l) =

min
x

L(x,l)

subject to the constraint that l ł 0. This provides an alternative route to the same saddle point of L.

The Support Vectors of SVMs

Let us apply the KKT-conditions to our original problem of finding the separating hyperplane with maximum margin. The Lagrangean in this case is,

L(w,b,l) =

d
ĺ
i=1

w_i² -

n
ĺ
j=1

l_j{y_j( < w,x_j > + b ) - 1 },

and the KKT-conditions for optimality are,

Ń_wL

0, i.e., w =

n
ĺ
j=1

l_jy_jx_j

Ń_bL

0, i.e.,

n
ĺ
j=1

l_jy_j = 0

l_j{y_j( < w,x_j > + b ) - 1 }

0, for all j Ł n.

These provide a complete characterization of the optimal plane. The normal w must be a linear combination of the observed vectors x_j, that's the first set of equations. The coefficients of this linear combination must add up to 0, that's the second equation. Finally the complementarity conditions tell us that the only non-zero Lagrange multipliers l_j are those associated to the vectors x_j right on the margin, i.e., such that,

y_j( < w,x_j > + b ) = 1.

These are called support vectors and they are the only ones needed since

w =

ĺ
j Î J₀

l_jy_jx_j

where J₀ = {j : x_j is a s.v. }. The support vectors are the observations x_j at the exact distance r = 1/|w| from the separating plane. The number of such vectors is usually much smaller than n and that makes it possible to consider very large numbers of examples with x_j having many coordinates.

The Dual Problem for SVMs

The dual problem for SVMs turns out to be even simpler than the primal and its formulation shows the way to a magnificent non-linear generalization. For a given vector l of Lagrange multipliers, the minimizer of L(w,b,l) w.r.t. (w,b) must satisfy the optimality conditions obtained above, i.e., w is a l.c. of the x_j's with coefficients l_jy_j that must add up to zero. Hence, replacing these conditions into L(w,b,l) we obtain the dual formulation,

maximize W(l)

where,

W(l) =

n
ĺ
j=1

l_j -

ĺ
i,j

l_il_jy_iy_j < x_i,x_j >

and the l ł 0 satisfying,

n
ĺ
j=1

l_jy_j = 0.

Maximizing W(l) over l ł 0 s.t. the above simple linear constraint is satisfied, is the preferred form to feed a QP algorithm. Once an optimal l is obtained we find w as the l.c. of the x_j as above and we find b by recalling that the plane must be in canonical position so,

min
i Ł n

y_i( < w,x_i > + b) = 1 = y_j( < w,x_j > + b) for all j Î J₀

and we get,

b = y_j - < w,x_j > .

Multiplying through by l_j and adding over j we find,

b =

ĺ
i,j

l_il_jy_iy_j < x_i,x_j >

ĺ
j

l_j

and it can be readily checked that this value coincides with the value of the Lagrange multiplier b associated to the constraint ĺ_il_iy_i = 0 (just find Ń_lL = 0 for the Lagrangean associated to the dual, i.e. L = W(l) - bĺ_il_iy_i). The optimal values of b and of l are often returned by the modern QP solvers based on interior point algorithms.

File translated from T_EX by T_TH, version 3.63.
On 17 Oct 2004, 15:11.

https://omega0.xyz/omega8008/

Support Vector Machines

Carlos C. Rodríguez https://omega0.xyz/omega8008/

Finding the Hyperplane with Largest Margin

KKT-Theory

The Dual

The Support Vectors of SVMs

The Dual Problem for SVMs

Carlos C. Rodríguez
https://omega0.xyz/omega8008/