|
Support Vector Machines
Carlos C. Rodríguez
https://omega0.xyz/omega8008/
Finding the Hyperplane with Largest Margin
Let us assume that we have n labeled examples (x1,y1),¼,(xn,yn) with labels yi Î {1,-1}. We want to find the
hyperplane < w,x > +b = 0 (i.e. with parameters (w,b) ) satisfying
the followin three conditions:
- The scale of (w,b) is fixed so that the plane is in canonical
position w.r.t. {x1,¼,xn}. i.e.,
|
min
i £ n
|
| < w,xi > + b | = 1 |
|
- The plane with parameters (w,b) separates the +1's from the
the -1's. i.e.,
yi ( < w,xi > + b ) ³ 0 for all i £ n |
|
- The plane has maximum margin r = 1/|w|. i.e., minimum
|w|2.
Of course there may not be a separating plane for the observed
data. Let us assume, for the time being, that the data is in fact
linearly separable and we'll take care of the general (more realistic)
case later.
Clearly 1 and 2 combine into just one condition:
yi( < w,xi > + b ) ³ 1 for all i £ n. |
|
Thus, we want to solve the following optimization problem,
over all w Î Rd and b Î R subject to,
yi( < w,xi > + b ) -1 ³ 0 for all i £ n. |
|
This is a very simple quadratic programming problem. There are readily
available algorithms of complexity O(n3) that can be used for
solving this problem. For example the so called interior point
algorithms that are variations of the Karmarkar algorithm for linear
programming will do. But, when n and d are large (tens of
thousands) even the best QP methods will fail. A very desirable
characteristics of SVMs is that most of the data ends up being
irrelevant. The relevant data are only the points that end up exactly
on the margin of the optimal classifier and these are often a very
small fraction of n.
KKT-Theory
The problem that we need to solve is a special case of the general
problem of minimizing a convex function f(x) subject to n inequality
constraints gj(x) ³ 0 for j=1,2,¼,n where the functions
gj are also convex. Let's call this problem (CO).
Notice that in our case x=(w,b) Î Rd+1 and the
constraints are linear in the unknowns x. Don't get confused with our
previous xi's.
The characterization of the solution to the convex
optimization (CO) problem is given by the so called Karush-Kuhn-Tucker
conditions.
- Theorem (KKT-Conditions)
-
|
-
x
|
solves the (CO) problem |
|
if and only if there exists
|
-
l
|
=( |
-
l
|
1
|
,¼, |
-
l
|
n
|
) ³ 0 |
|
a vector of non-negative Lagrange multipliers so that
( |
-
x
|
, |
-
l
|
) is a saddle point of the Lagrangean, |
|
L(x,l) = f(x) - |
n å
j=1
|
lj gj(x). |
|
i.e., for all x and for all l ³ 0 we have,
L( |
-
x
|
,l) £ L( |
-
x
|
, |
-
l
|
) £ L(x, |
-
l
|
). |
|
Before proving (half of) this theorem notice that there is an easy to
understand intuitive reason behind this result. Think of the term added
(subtracted actually) to f(x) to form the Lagrangean L, as a penalty
for an x that violates the constraints. In fact, if gj(x) < 0,
the term -ljgj(x) > 0 can be made arbitrarily large by
increasing lj. Thus, the minimizer of L(x,l) over x
will have to make gj(x) ³ 0. On the other hand if gj(x) > 0
then it is best to take lj=0 to maximize L(x,l) as
a function of l. It is possible to show that, the saddle point
condition is equivalent to,
|
max
x
|
|
min
l ³ 0
|
L(x,l) = L( |
-
x
|
, |
-
l
|
) = |
min
l ³ 0
|
|
max
x
|
L(x,l). |
|
Proof: Let us show that the saddle point condition is in fact
sufficient for solving the (CO) problem. That it is also necessary
depends on Farkas's Lemma and it is much more difficult to prove.
We need to show that the saddle point condition implies,
- for all j £ n,
and,
- for all x that satisfies the constraints,
To show 1, suppose that the ith constraint is violated. Then by taking
and
we get,
L( |
-
x
|
,l) > L( |
-
x
|
, |
-
l
|
) |
|
which contradicts the saddle point condition.
To show 2, take l = 0 on the left hand side of the saddle point
condition and take x satisfying the constraints on the right. Then,
f( |
-
x
|
) = L( |
-
x
|
,0) £ L(x, |
-
l
|
) £ f(x). |
|
Which proves 2. ·
When all, the objective function f(x), and the constrainning functions
gj(x) are differentiable (they are infinitely differentiable in the
case of SVMs) the condition for a saddle point is simply that at that
point the tangent plane to the surface z = L(x,l) is parallel to
the (x,l) plane. The saddle point of L can be obtained by
solving the system of equations,
| |
|
0, i.e., Ñf(x) = |
å
j
|
ljÑgj(x) |
| |
| |
|
ljgj(x) = 0 for all j £ n from where, Ñl L(x,l) = 0. |
| |
|
The second set of equations are known as complementarity conditions
and are a consequence of the constraint that l ³ 0.
The Dual
The minmax = maxmin characterization of the saddle point of the
Lagrangean L provides an alternative way to find the solution of the
(CO) problem. Instead of minimizing f(x) subject to the gj(x) ³ 0
constraints one can equivalently maximize W(l), where
subject to the constraint that l ³ 0. This provides an alternative
route to the same saddle point of L.
The Support Vectors of SVMs
Let us apply the KKT-conditions to our original problem of finding the
separating hyperplane with maximum margin. The Lagrangean in this case
is,
L(w,b,l) = |
1
2
|
|
d å
i=1
|
wi2 - |
n å
j=1
|
lj{yj( < w,xj > + b ) - 1 }, |
|
and the KKT-conditions for optimality are,
| |
|
0, i.e., w = |
n å
j=1
|
ljyjxj |
| |
| |
|
| |
|
lj{yj( < w,xj > + b ) - 1 } |
|
|
| |
|
These provide a complete characterization of the optimal plane. The normal
w must be a linear combination of the observed vectors xj, that's
the first set of equations. The coefficients of this linear combination
must add up to 0, that's the second equation. Finally the complementarity
conditions tell us that the only non-zero Lagrange multipliers lj
are those associated to the vectors xj right on the margin, i.e., such
that,
These are called support vectors and they are the only ones needed
since
where J0 = {j : xj is a s.v. }. The support vectors are
the observations xj at the exact distance r = 1/|w| from the
separating plane. The number of such vectors is usually much smaller than
n and that makes it possible to consider very large numbers of examples
with xj having many coordinates.
The Dual Problem for SVMs
The dual problem for SVMs turns out to be even simpler than the primal and
its formulation shows the way to a magnificent non-linear generalization.
For a given vector l of Lagrange multipliers, the minimizer of
L(w,b,l) w.r.t. (w,b) must satisfy the optimality conditions
obtained above, i.e., w is a l.c. of the xj's with coefficients
ljyj that must add up to zero. Hence, replacing these
conditions into L(w,b,l) we obtain the dual formulation,
where,
W(l) = |
n å
j=1
|
lj - |
1
2
|
|
å
i,j
|
liljyiyj < xi,xj > |
|
and the l ³ 0 satisfying,
Maximizing W(l) over l ³ 0 s.t. the above simple
linear constraint is satisfied, is the preferred form to feed a QP algorithm.
Once an optimal l is obtained we find w as the l.c. of
the xj as above and we find b by recalling that the plane must
be in canonical position so,
|
min
i £ n
|
yi( < w,xi > + b) = 1 = yj( < w,xj > + b) for all j Î J0 |
|
and we get,
Multiplying through by lj and adding over j we find,
b = |
- |
å
i,j
|
liljyiyj < xi,xj > |
|
|
|
and it can be readily checked that this value coincides with the value of the
Lagrange multiplier b associated to the constraint
åiliyi = 0 (just find ÑlL = 0 for
the Lagrangean associated to the dual, i.e.
L = W(l) - båiliyi). The optimal values of b
and of l are often returned by the modern QP solvers based on
interior point algorithms.
File translated from
TEX
by
TTH,
version 3.63. On 17 Oct 2004, 15:11.
|