Other available formats: [.PDF] [.TEX]

Linear Classifiers

Carlos C. Rodríguez
https://omega0.xyz/omega8008/

Classification with Hyperplanes

Assume we have observed n examples of labeled data (X₁,Y₁),ź,(X_n,Y_n) but now let us take the labels Y Î {1,-1}. This convention simplifies some of the formulas below and it is the standard for support vector machines.

We'll be interested in classification rules of the form,

g(x) = sgn ( < w,x > + b )

where < , > denotes the (euclidean) inner product in R^d and w Î R^d and b Î R parametrize all the rules of this type. These rules are called linear. The class G of linear classifiers is about the simplest possible way of separating two kinds of objects in R^d. A classifier in G assigns the label +1 or -1 to a data point x Î R^d depending on weather we arrive to x from the plane by going in the direction of the normal w or its oposite -w.

Recall that the set of vectors,

{ x : < w,x > + b = 0}

correspond to the hyperplane perpendicular to w that goes through every point x₀ such that, < w,x₀ > = -b . For example the following picture shows such an hyperplane that perfectly separates the O's (with label +1) from the X's (with label -1).

The hyperplane with parameters (w,b) is the same as the hyperplane with parameters (cw,cb) for any non-zero scalar c. To fix the scale we say that a plane is in canonical position with respect to a set of points X = {x₁,ź,x_n } if,

min
i Ł n

| < w,x_i > + b | = 1.

Clearly, the perpendicular distance from x_i to the hyperplane with parameters (w,b) is the magnitud of the projection of (x_i-x₀) onto the unit normal direction w/|w|. It is given by,

d_i =

| < w,(x_i-x₀) > |

|w|

| < w,x_i > + b |

|w|

as it can be seen from the above picture. Thus, if we assume that the plane is in canononical position w.r.t. {x_i: i Ł n} then,

r =

min
i Ł n

d_i =

|w|

This minimun distance r is known as the margin between a separating hyperplane and the points w.r.t. which that plane is in general position.

Maximum Margin Classifiers

When there is a hyperplane with parameters (w,b) that separates the points {x_i: y_i = 1, i Ł n} from the points {x_i: y_i = -1, i Ł n} then, if we take the plane in canonical position w.r.t. X ={x₁,ź,x_n} we have,

y_i( < w,x_i > + b ) ł 1 for all i Ł n .

There could be many such planes but it is geometrically obvious that the one farthest from the points should be prefered over all the others. The intuition behind this is the fact that the farther the separating boundary is from the observations the more likely it seems for future data to end up being correctly classified by this plane. In other words, we expect better generalization power for boundaries with larger margin, i.e., for planes with small |w| (recall that r = 1/|w|. This intuition is confirmed by the following theorem.

Theorem

Consider hyperplanes through the origin in canonical position w.r.t. X ={x₁,ź,x_n}. Then, the set of linear classifiers, g_w(x) = sgn < w,x > with |w| Ł L has VC dimension,

V Ł

min

{ R²L², R^dL^d}

where R is the radius of the smallest sphere around the origin containing X

Before proving it notice that in fact the length of w controls the VC dimension so that the larger the margin the larger the generalization power as expected.

Proof: We need to show both, V Ł R^dL^d and V Ł R²L². The first inequality follows from the fact that the maximum number of spheres of radius 1/L that still fit inside a sphere of radius R is at most vol(R)/vol(1/L) = (RL)^d, where vol(r) = C_dr^d is the volume of the sphere of radius r in R^d. More than this number of points is imposible to shatter with margin of at least 1/L. Thus, V Ł (RL)^d. For the other inequality again we show that the number of points that can be shattered with margin 1/L is at most (RL)². If n points can be shattered then for all possible choices of y_i Î {1,-1} there is an hyperplane with parameter w with |w| Ł L, in canonical position w.r.t. {x₁,ź,x_n} separating +1's from -1's, i.e., satisfying,

y_i < w,x_i > ł 1 for all i=1,ź,n.

But this together with Cauchy-Schwartz inequality and the fact that |w| Ł L imply,

| < w,

n
ĺ
i=1

y_ix_i > | Ł |w| |

n
ĺ
i=1

y_ix_i|

L |

n
ĺ
1

y_ix_i |.

To bound the sum consider the case when the labels y_i are Rademacher variables, i.e. iid with P{y_i=1}=1-P{y_i = -1} = 1/2. From the fact that E{y_iy_j}=0 when i š j and that y_i²=1 for all i Ł n, we obtain

y_ix_i|²

n
ĺ
i=1

E{ < y_ix_i,

y_jx_j > }

n
ĺ
i=1

ě
í
î

ĺ
j š i

E[ < y_ix_i,y_jx_j > ] + E[ < y_ix_i,y_ix_i > ]

ü
ý
ţ

n
ĺ
i=1

|x_i|² Ł n R².

If the bound is true for the expectation when Rademacher variables are used, then there must exist y_i's for which,

y_ix_i|² Ł n R².

Squaring the initial inequality and using the above bound, we obtain

n² Ł L² n R²

and the result follows after deviding through by n.ˇ

File translated from T_EX by T_TH, version 3.63.
On 12 Oct 2004, 15:22.

https://omega0.xyz/omega8008/

Linear Classifiers

Carlos C. Rodríguez https://omega0.xyz/omega8008/

Classification with Hyperplanes

Maximum Margin Classifiers

Carlos C. Rodríguez
https://omega0.xyz/omega8008/