Other available formats: [.PDF] [.TEX]

The Kernel Trick

Carlos C. Rodríguez
https://omega0.xyz/omega8008/

Why don't we do it in higher dimensions?

If SVMs were able to handle only linearly separable data, their usefulness would be quite limited. But someone around 1992 (correct?) had a moment of true enlightenment and realized that if instead of the euclidean inner product < x_i,x_j > one fed the QP solver with a function K(x_i,x_j) the boundary between the two classes would then be,

K(x,w) + b = 0

and the set of x Î R^d on that boundary becomes a curved surface embedded in R^d when the function K(x,w) is non-linear. I don't really know how the history went (Vladimir?), but it could have perfectly been the case that a courageous soul just fed the QP solver with K(x,w) = exp(-|x-w|²) and waited for the picture of the classification boundary to appear on the screen and saw the first non linear boundary computed with just linear methods. Then, depending on that person's mathematical training, s/he either said aha! K(x,w) is a kernel in a reproducing kernel Hilbert space or rushed to the library or to the guy next door to find out, and probably very soon after that said aha!, K(x,w) is the kernel of a RKHS. After knowing about RKHS that conclusion is inescapable but that, I believe, does not diminish the importance of this discovery. No. Mathematicians did not know about the kernel trick then, and most of them don't know about the kernel trick now.

Let's go back to the function K(x,w) and try to understand why, even when it is non linear in its arguments, still makes sense as a proxy for < x,w > . The way to think about it is to consider K(x,w) to be the inner product not of the coordinate vectors x and w in R^d but of vectors f(x) and f(w) in higher dimensions. The map,

f: X ® H

is called a feature map from the data space X into the feature space H . The feature space is assumed to be a Hilbert space of real valued functions defined on X . The data space is often R^d but most of the interesting results hold when X is a compact Riemannian manifold.

The following picture illustrates a particularly simple example where the feature map f(x₁,x₂)=(x₁²,Ö2x₁x₂,x₂²) maps data in R² into R³.

There are many ways of mapping points in R² to points in R³ but the above has the extremely useful property of allowing the computation of the inner products of feature vectors f(x) and f(w) in R³ by just squaring the inner product of the data vectors x and w in R²! (to appreciate the exclamation point just replace 3 by 10 or by infinity!) i.e., in this case

K(x,w)

< f(x),f(w) >

x₁²w₁² + 2 x₁x₂w₁w₂ + x₂²w₂²

(x₁w₁ + x₂w₂)²

( < x,w > )²

OK. But are we really entitled to just Humpty Dumpty move up to higher, even infinite dimensions?

Sure, why not? Just remember that even the coordinate vectors x and w are just that, coordinates, arbitrary labels, that stand as proxy for the objects that they label, perhaps images or speech. What is important is the inter relation of these objects as measured, for example by the Grammian matrix of abstract inner products ( < x_i,x_j > ) and not the labels themselves with the euclidean dot product. If we think of the x_i as abstract labels (coordinate free), as pointers to the objects that they represent, then we are free to choose values for < x_i,x_j > representing the similarity between x_i and x_j and that can be provided with a non linear function K(x,w) producing the n² numbers k_ij=K(x_i,x_j) when the abstract labels x_i are replaced with the values of the observed data. In fact, the QP solver sees the data only through the n² numbers k_ij. These could, in principle be chosen without even a kernel function and the QP solver will deliver (w,b). The kernel function becomes useful for choosing the classification boundary but even that could be empirically approximated. Now, of course, arbitrary n² numbers k_ij that disregard the observed data completely will not be of much help, unless they happen to contain cogent prior information about the problem. So, what's the point that I am trying to make? The point is that it is obvious that a choice of kernel function is an ad-hoc way of sweeping under the rug prior information into the problem, indutransductibly (!) ducking the holy Bayes Theorem. There is a kosher path that flies under the banner of gaussian processes and RVMs but they are not cheap. We'll look at the way of the Laplacians (formerly a.k.a. bayesians) later but let's stick with the SVMs for now.

The question is: What are the constraints on the function K(x,w) so that there exists a Hilbert space H of abstract labels f(x) Î H such that < f(x_i),f(x_j) > = k_ij ?.

The answer: It is sufficient for K(x,w) to be continuous, symmetric and positive definite. i.e., for X either R^d or a d dimensional compact Riemannian manifold,

K : X ×X ® R

satisfies,

K(x,w) is continuous.
K(x,w) = K(w,x). Symmetric
K(x,w) is p.d. (positive definite). i.e., for any set {z₁,¼,z_m} Ì X the m by m matrix K[z] = (K(z_i,z_j))_ij is positive definite.

Such function is said to be a Mercer kernel. We have,

Theorem

If K(x,w) is a Mercer kernel then there exists a Hilbert space H _K of real valued functions defined on X and a feature map f:X ® H _K such that,

< f(x),f(w) > _K = K(x,w)

where < , > _K is the innerproduct on H _K.

Proof: Let L be the vector space containing all the real valued functions f defined on X of the form,

f(x) =

m
å
j=1

a_j K(x_j,x)

where m is a positive integer, the a_j's are real numbers and {x₁,¼,x_m} Ì X . Define in L the inner product,

m
å
i=1

a_iK(x_i,·),

n
å
j=1

b_jK(w_j,·) > =

m
å
i=1

n
å
j=1

a_ib_j K(x_i,w_j)

Since K is a Mercer kernel the above definition makes L a well defined inner product space. The Hilbert space H _K is obtained when we add to L the limits of all the Cauchy sequences (w.r.t. the < , > ) in L.

Notice that the inner product in H_K was defined so that

< K(x,·), K(w, ·) > _K = K(x,w).

We can then take the feature map to be,

f(x) = K(x,·)·

RKHS

The space H_K is said to be a Reproducing Kernel Hilbert Space (RKHS). Moreover, for all f Î H _K

f(x) = < K(x,·),f > _K.

This follows from the reproducing property when f Î L and by continuity for all f Î H _K. It is also easy to show that the reproducing property is equivalent to the continuity of the evaluation functionals d_x(f) = f(x).

The Mercer kernel K(x,w) naturally defines an integral linear operator that (abusing notation a bit) we also denote by K : H _K ® H _K, where,

(Kf)(x) =

ó
õ

K(x,y) f(y) dy

and since K is symmetric and positive definite, it is orthogonally diagonalizable (just as in the finite dimensional case). Thus, there is an ordered orthonormal basis {f₁,f₂,¼} of eigen vectors of K, i.e., for j=1,2,¼

Kf_j = l_j f_j

with < f_i,f_j > _K = d_ij, l₁ ³ l₂ ³ ¼ > 0 and such that,

K(x,w) =

¥
å
j=1

l_jf_j(x)f_j(w)

from where it follows that the feature map is,

f(x) =

¥
å
j=1

l_j^1/2 f_j(x) f_j

producing,

K(x,w) = < f(x), f(w) > _K

Notice also that any continuous map f: X ® H where H is a Hilbert space, defines K(x,w) = < f(x),f(w) > which is a Mercer Kernel.

File translated from T_EX by T_TH, version 3.63.
On 25 Oct 2004, 14:37.

https://omega0.xyz/omega8008/

The Kernel Trick

Carlos C. Rodríguez https://omega0.xyz/omega8008/

Why don't we do it in higher dimensions?

RKHS

Carlos C. Rodríguez
https://omega0.xyz/omega8008/