Other available formats: [.PDF] [.TEX]

Regression

Carlos C. Rodríguez
https://omega0.xyz/omega8008/

Classical Regression

Recall the classical regression problem. Given observed data (x₁,y₁),¼, (x_n,y_n) iid as a vector (X,Y) Î R^d+1 we want to estimate f^*(x) = E(Y|X=x) i.e., the regression function of Y on X. When the vector (X,Y) is multivariate gaussian the regression function is f^*(x) = a+ L(x) with L(x) linear, and the Ordinary (yak!) Least Squares (OLS) estimator coincides with the MLE (Maximum Likelihood Estimator). Very often the distribution of (X,Y) is not explicitly known beyond the n observations and the available prior information about the meaning of these data. A typical assumption is to think of the y_j as the result of sampling the regression function at f^*(x_j) with gaussian measurement error. The model is that conditionally on x₁,¼, x_n the values of y₁,¼,y_n are independent with Y_j depending on X_j only and Y_j|X_j=x_j being N(f^*(x_j), s²). Thus, for j = 1,2,¼, n

y_j = f^*(x_j) + e_j

where e₁,e₂,¼, e_n are iid N(0,s²). In this way the distribution of (X,Y) is modeled semiparametrically with (f,s) where f Î F is a function in some space of functions F and s > 0 is a positive scalar parameter. When F is taken as the m dimensional space F _m generated by functions, g₁(x),¼,g_m(x) estimation of the regression function reduces to the linear optimization problem,

= arg

min
f Î F _m

n
å
i=1

( y_i - f(x_i) )²

The solution is then given as the orthogonal (euclidean) projection of the observed vector y^T = (y₁,y₂,¼,y_n) onto the space generated by the columns of the matrix G Î R^m×n with entries g_ij = g_i(x_j). In fact, the above optimization problem can be written as,

|| y -

||² =

min
w Î R^m

|| y - Gw ||²

As shown by the picture, the rejection vector (i.e. y minus its projection) must be orthogonal to the linear space generated by the columns of G, in particular (and equivalently) to each of these columns, obtaining the standard set of normal equations,

0 = G^T(y -

) = G^T(y - G

)

with solution,

= (G^TG)^-1G^Ty

The more general case of Weighted Least Squares (WLS) corresponding to the innerproduct < x,z > = x^TAz generated by a symmetric positive definite matrix A, is just,

= (G^TAG)^-1G^TA y.

The matrix A encodes a covariance structure for the measurement errors, e₁,¼,e_n.

Over fitting and Kernel Regression

How should F be chosen?. On the one hand, we would like F to be big so not to constrain the form of the true regression function too much. On the other hand, big F 's make the task of searching for the best f Î F more difficult and more importantly without a constraint on the explanatory capacity of F the solution will show no power of generalization. A big enough F will always have at least one member f, able to fit all the observations perfectly, without error, but this f provides no assurance that f(x) is not as bad as it can possibly be for any point x not in the training set. To be able to assure that the size of the mistake on future data will not exceed a given value with high probability, (i.e. to have PAC bounds) we must constrain the capacity of F somehow. Over the years, statisticians and numerical analysts have invented all kinds of ad-hoc devices for achieving this goal. These are known as regularization methods. They boil down to adding a penalty term to the OLS empirical term, often of the form W(||f||) where W is an increasing function and F is assumed to be a space with a norm. The problem to be solved becomes,

min
f Î F

n
å
i=1

(y_i - f(x_i))² subject to ||f|| £ r_n

where the sequence of radiuses r_n ® ¥ as n ® ¥, but not too quickly, (at a given rate that depends of F ) so that some form of asymptotic stochastic convergence of the solution f_n towards the projection of the true regression function f^* onto F is achieved.

Kernel Regression

Reproducing Kernel Hilbert Spaces (RKHS) provide convenient choices for F .

Theorem:

Let K be a Mercer kernel and let H _K be the associated RKHS. If

C((x₁,y₁,f(x₁)),¼, (x_n,y_n,f(x_n)))

is any cost function that depends on f Î H _K only through the values of f(x_j) at the observed x_j, then the minimizer of

U(f) = C((x₁,y₁,f(x₁)),¼, (x_n,y_n,f(x_n))) + W(|| f ||)

where W is an increasing function, is always achieved at a point f_n Î H _K of the form,

f_n(x) =

n
å
j=1

w_j K(x_j,x).

Thus, when F = H _K, a big fat infinite dimensional space, the regularized empirical cost U = C+W is minimized by solving a classic regression problem with F _n = spann{K(x₁,·),¼, K(x_n,·)}.

Proof: The proof is surprisingly simple. Every f Î H _K can be written as f = g + h where g Î F _n and h Î F _n^{^}. We show that,

U(f) = U(g+h) ³ U(g)

with equality if and only if h=0. This follows easily from the reproducing property of the kernel spaces. For all j £ n,

f(x_j) = < K(x_j,·), g+h > = < K(x_j,·),g > = g(x_j)

since h ^F _n by hypothesis. Thus, C(f)=C(g). On the other hand, since W is strictly increasing and g ^h, by the pythagorean theorem we have,

W(||f||)

W( (||g||² + ||h||²)^1/2 )

W(||g||)

with equality if and only if h=0. Hence, U(f) = C(g)+W(||g+h||) ³ C(g)+W(||g||) = U(g). ·

Support Vector Regression

For given values a > 0, and e > 0 define the empirical cost function ,

C = a

n
å
i=1

|y_i - f(x_i)|_e

where,

|z|_e =

max

{0, |z| - e}

is known as the e insensitive function, and take,

W(||f||) =

||f||².

With these choices, kernel regression becomes support vector regression. The parameter e controls the sparness of the solution. The smoothing parameter a controls the relative importance of the empirical cost C relative to the complexity penalty W.

The derivation of the support vector regression problem follows closely the derivation of support vector machines for classification. We first setup a primal optimization problem for minimizing the above e-insensitive regularized empirical cost over functions, f(x) = < w,x > + b for the euclidean innerproduct. Then we consider the dual problem. This turns out to be a simple quadratic programming problem that depends on the observed data only through the values of ( < x_i,x_j > ). Just as in the classification case, we can apply the kernel trick and rip the benefits of nonlinear kernel regression at the linear regression cost!

The Primal Problem for SV Regression

We seek the solution of,

minimize a

n
å
i=1

|y_i - f(x_i)|_e +

n
å
i=1

w_i²

over b,w when f(x) = < w,x > + b. This is equivalent to,

minimize a

n
å
i=1

u_i +

n
å
i=1

w_i²

over u_i,w_i,b subject to: u_i ³ |y_i-f(x_i)|_e for i £ n. Each of the last n inequalities corresponds to three inequalities,

u_i ³ 0, u_i ³ y_i - f(x_i) - e, u_i ³ f(x_i) - y_i - e

Applying the standard trick of adding non negative slack variables x_i and x^*_i we soften the inequalities and allow small violations. So we replace the above constrained optimization problem with,

minimize a

n
å
i=1

u_i +

n
å
i=1

(x_i + x^*_i) +

n
å
i=1

w_i²

subject to: for i £ n,

y_i - f(x_i) - e £ u_i + x_i

f(x_i) - y_i - e £ u_i + x^*_i

u_i ³ 0, x_i ³ 0, x^*_i ³ 0.

The objective function was chosen so that we can factorize out a/2 and write,

n
å
i=1

({u_i+x_i} + {u_i+x^*_i}) +

n
å
i=1

w_i²

In this way we can get rid of the u_i by just replacing u_i+x_i by x_i and u_i+x^*_i by x^*_i every where. Also, replace a/2 by a new a to obtain the problem:

(Primal) minimize a

n
å
i=1

(x_i + x^*_i) +

n
å
i=1

w_i²

over, w_i,b, x_i, x^*_i subject to: for i £ n,

y_i - f(x_i) - e £ x_i

f(x_i) - y_i - e £ x^*_i

x_i ³ 0, x^*_i ³ 0

and f(x) = < w,x > + b.

The Dual Problem for SV Regression

The Lagrangian in terms of non negative Lagrange multipliers is,

å
i

(x_i+x^*_i) +

å
i

w_i²

å
i

l_i { y_i - f(x_i) - e- x_i }

å
i

l^*_i { f(x_i) - y_i - e- x^*_i }

å
i

b_i x_i -

å
i

b^*_i x^*_i.

To compute the dual we need to find,

W(l,l^*,b,b^*) =

min
w,b,x,x^*

The values of w,b,x,x^* where the minimum is achieved must satisfy,

Ñ_wL

= 0

Û w =

å
i

(l_i-l^*_i) x_i

Ñ_bL

= 0

å
i

l_i =

å
i

l^*_i

Ñ_xL

= 0

Û l_j + b_j = a for j £ n.

Ñ_x^*L

= 0

Û l^*_j + b^*_j = a for j £ n.

Replacing these equations into L we obtain that all the terms involving x_i and x^*_i dissapear from L and with them, b and b^*. Therefore, W is only a function of l and l^*. We get, replacing K(x_i,x_j) for the innerproducts < x_i,x_j > (the kernel trick!) that,

W(l,l^*)

å
i,j

(l_i-l^*_i)(l_j-l^*_j) K(x_i,x_j)

å
i

(l_i-l^*_i) y_i - e

å
i

(l_i+l^*_i)

å
i,j

(l_i-l^*_i)(l_j-l^*_j) K(x_i,x_j)

The first and last terms simplify to produce,

W(l,l^*)

- e

å
i

(l_i+l^*_i)

å
i,j

(l_i-l^*_i)(l_j-l^*_j) K(x_i,x_j)

å
i

(l_i-l^*_i) y_i.

The dual problem becomes,

(Dual)

max
l,l^*

W(l,l^*)

subject to:

å
j

l_j =

å
j

l^*_j

0 £ l_j £ a, 0 £ l^*_j £ a.

where we have replaced the equalities l_j+b_j = a, involving l_j ³ 0, b_j ³ 0 by the equivalent inequalities shown above, that do not involve the bs.

As it was the case for classification, the dual problem is a simple quadratic programming problem that can be solved with efficient algorithms that are publicly available.

The solution from the QP solver is then used to produce the estimate,

å
i

(l_i-l^*_i) x_i

The KKT complementarity conditions, for the slack variables are of the type x_j(l_j-a) = 0 so we can write the other complementarity conditions as follows,

l_j{ y_j -

(x_j) - e}

= 0

provided l_j < a

l^*_j{

(x_j) - y_j - e}

= 0

provided l^*_j < a

are valid (and non trivial) for all j Î J₀, and j Î J^*₀ (resp.), where

J₀ = { j : j £ n, and 0 < l_j < a}

with J^*₀ defined analogously.

These, and the complementarity conditions associated to the inequalities l_j £ a and l^*_j £ a make many l_j = l^*_j to be either 0 or a and producing a sparse solution. The value for b can be obtained from any of the above complementarity conditions, but a more accurate value is obtained by combining efforts. Replacing the estimated values of the regression function at the training points,

(x_j) =

å
i

(l_i-l^*_i) K(x_i,x_j) + b

into the complementarity conditions, solving for b, multiplying through by l_j and l^*_j and adding over j Î J with J = J₀ÇJ^*₀ we get,

å
j Î J

l_j b

å
j Î J

l_j{ y_j -

å
i

(l_i-l^*_i) K(x_i,x_j) }

å
j Î J

l^*_j b

å
j Î J

l^*_j{ y_j -

å
i

(l_i-l^*_i) K(x_i,x_j) }

adding the two equations, we finally obtain the estimate

b =

å
j Î J

(l_j+l^*_j) { y_j -

å
i

(l_i-l^*_i) K(x_i,x_j) }

å
j Î J

(l_j+l^*_j)

Example: SV Regression in Action

The following picture shows n=30 samples (the circles) from the true regression line (the red curve) with gaussian error and s = 0.5. The green curve is the estimated regression line computed using a gaussian kernel. The blue curves show the e = 0.4 insensitive tube around the estimate. The support vectors are marked with plus signs and a value of a = 1.5 was used. The Maple(9.5) code is available from this site and uses the QPsolve program in efficient matrix form from the new Maple optimization package.

File translated from T_EX by T_TH, version 3.63.
On 1 Nov 2004, 15:37.

https://omega0.xyz/omega8008/

Regression

Carlos C. Rodríguez https://omega0.xyz/omega8008/

Classical Regression

Over fitting and Kernel Regression

Kernel Regression

Support Vector Regression

The Primal Problem for SV Regression

The Dual Problem for SV Regression

Example: SV Regression in Action

Carlos C. Rodríguez
https://omega0.xyz/omega8008/