Learning Theory Notes

I hope you find the stuff in this site of help in your quest to learn. My notes are not copyrighted in any way. In fact you are encouraged to copy them!

Introduction to Probability

First two chapters of JaynesBook.

Pattern Recognition

Learning Patterns: Basic definitions, The Bayes Rule, An example, Learning with a teacher, Definition of consistency, Empirical Risk Minimization and the need for Uniform Laws of Large Numbers.
Reference:: Mostly from "A Probabilistic Theory of Pattern Recognition" by Devroye, Gyorfi, and Lugosi. 1996. Springer.

Laws of Large Numbers

Uniform Laws: What's a Law of Large Numbers again?, The Weak Law, Markov's and Tchebychev's inequalities, Chernoff's Method, The Strong Law, Hoeffding's inequality, The modern Strong Uniform Laws, McDiarmid's inequality, Enter Combinatorics.
References: Mostly from "Combinatorial Methods in Density Estimation" by Devroye and Lugosi. 2001. Springer.: Classics: Chernoff 1952 with his method. To know more about the powerful new-age concentration of measure phenomenon see Talagrand 1996 survey and references in there.

VC Dimension

Examples and Tools: Direct computation of VC dimension for simple classes. Half-lines, and intervals. General properties of shatter coefficients. South-West sets and rectangles in dimension d. The simplests (as far as I know) proof of Sauer's Lemma. Upper bounds of shatter coefficients in terms of VC dimension. The Steele-Dudley general class.
References: The ones above plus a bit from Pollard's 1984 book: "Convergence of Stochastic Proccesses". Springer.

Back to the Beginning: What have we learned?

PAC Bounds: Decomposition of errors. Probably Approximately Correct bounds. Inversion. PAC bounds for the true risk of an estimator. The finite case. Interpretation of VC dimension and shatter coefficients as measures of the effective size of an infinite class.
References: The ones above plus Boucheron-Bousquet-Lugosi 2004

Linear Rules and Fat Margins

Classification with Hyperplanes: Here is the line; The ones on this side are +1 the ones on the other side are -1. Fat lines have better generalization power. Definition of margin and of plane in canonical position.
References: SVM and Kernel Methods. Slides by Bernhard Scho"lkopf. learning-with-kernels.org

Support Vector Machines

SVMs: Finding the hyperplane with largest margin. The Karush-Kuhn-Tucker (KKT) conditions. The Dual Problem. The Support Vectors of SVMs. The Quadratic Programming problem of SVMs.
References: Bernhard Scho"lkopf slides above and, Tamas Terlaky Notes on KKT. Examples in Octave.

The Kernel Trick

RKHS: Introduction. Feature Space. Simple Example. Mercer Kernel.
References: Bernhard Scho"lkopf slides above and, Aronszajn 1950 classic. Cucker/Smale 2001 classic 2b.

Regression

Regression in Hilbert Space: Classical Regression. Kernel Regression. SV Regression.
References: Bernhard Scho"lkopf slides above and, Maple Code with an example of how to use it. Alex Smola's 1998 Thesis

Kernel PCA

Any algorithm that depends on the observed data only through its Grammian (i.e. its matrix of inner products) can be kernelized!: Classical example: Principal Component Analysis.
The original paper of Scho"lkopf, Smola, and Mu"ller still says it best.

Carlos Rodriguez <carlos@math.albany.edu>

Last modified: 02-08-28