Other available formats: [.PDF] [.TEX]

Uniform Laws of Large Numbers

Carlos C. Rodríguez
`https://omega0.xyz/omega8008/`

What is a Law of Large Numbers?

I am glad you asked! The Laws of Large Numbers, or LLNs for short, come in three basic flavors: Weak, Strong and Uniform. They all state that the observed frequencies of events tend to approach the actual probabilities as the number of observations increases. Saying it in another way, the LLNs show that under certain conditions, we can assymptotically learn the probabilities of events from their observed frequencies. To add some drama we could say that if God is not cheating and S/he doesn't change the innitial standard probabilistic model too much then, in principle, we (or other machines, or even the universe as a whole) could eventually find out the Truth, the whole Truth, and nothing but the Truth.

Bull! The Devil, is in the details.

I suspect that for reasons not too different in spirit to the ones above, famous minds of the past took the slippery slope of defining probabilities as the limits of relative frequencies. They became known as "frequentists". They wrote the books and indoctrinated generations of confused students.

As we shall see below, all the LLNs follow from the addition and product rules of probability theory. So, no matter what interpretation is ascribed to the concept of probability, if the numerical values of the events under consideration follow the addition and product rules then the LLNs are just an inevitable logical consequence. In other words, you don't have to be a frequentist to enjoy the LLNs. In fact, due to the very existence of the LLNs, it is not possible to define probabilities with the limit frequencies in a consistent way. This is simply because all LLNs state only probabilistic convergence of frequencies to probabilities (the convergence is either in probability or with probability 1). The concept that we want to interpret (namely probability) is needed to define the very concept (namely the LLNs) that is suppose to explain it. The frequentist concept of probability eats its own tail!

The Weak Law

The Weak Law of Large Numbers (WLLN) goes back to the beginnings of probability theory. It was discovered for the case of random coin flips by James Bernoulli at around 1700 but only appeared in print posthumously in his Ars Conjectandy in 1713. Later on, in 1800, Poisson generalized the result for general independent coin flips. After that Tchebychev in 1866 discovered his inequality and generalized the law for arbitrary sequences of independent random variables with second moments. Finally, his student Markov extended it to some classes of dependent random variables. Markov's inequality is almost a triviality but it has found innumerable applications.

Theorem 1 [Markov's inequality] If X is nonnegative and t > 0,

P{ X ł t } Ł EX
t

Proof: for t > 0,

X ł X 1_{[X ł t]} ł t 1_{[X ł t]}

and by the monotonicity of expectations we find that,

EX ł t P{ X ł t } ˇ

Two important consequences of Markov's inequality are:

Tchebychev's inequality

If V(X) denotes the variance of X then,

P{|X-EX| ł t} = P{|X-EX|² ł t²} Ł

V(X)

t²

Chernoff's method

For t > 0 find the best s in,

P{X ł t} = P{ e^sX ł e^st} Ł

E e^sX

e^st

Thus, when X₁,X₂,ź,X_n are independent and identically distributed (iid) as X the sample mean,

n
ĺ
i=1

X_i

has mean EX and variance V(X)/n so by Tchebychev, for any e > 0

P{ |

- EX| ł e} Ł

V(X)

ne²

and it immediately follows that,

lim
nŽĽ

P{ |

- EX| ł e} = 0

which is what is meant by the sentence "the sample mean converges in probability to the expected value". That's the WLLN. For the special case of coin flips, i.e. for binary r.v.'s Bin(p), with P{X=1} = 1-P{X=0}=p the Tchebychev bound gives,

P{ |

- p | ł e} Ł

p(1-p)

ne²

showing that the observed frequency of ones converges in probability to the true probability p of observing a 1.

The Strong Law

The bounds above obtained from Tchebychev's inequality are very poor. By using Chernoff's method an exponential bound can be obtained. In fact we have,

Hoeffding's inequality

ě
í
î

- p | ł e

ü
ý
ţ

Ł 2 e^-2ne²

and by the classic Borel-Cantelli lemma it follows that,

P{ w:

lim
nŽĽ

(w) = p } = 1

which is the definition that the observed frequency of ones converges with probability one (or a.s. for almost surely) to the true probability p of observing a 1.

The proof of Hoeffding's inequality uses the following result for bounded r.v.'s with zero mean,

Lemma 1 If EX=0 and a Ł X Ł b, then for any s > 0,

E e^sX Ł e^s²(b-a)²/8

Proof: Let a Ł x Ł b and define l Î [0,1] by

l =

b-x

b-a

Notice that for any s > 0 we have,

sx = lsa + (1-l) sb

Thus, exp(.) convex implies,

e^sx Ł

b-x

b-a

e^sa +

x-a

b-a

e^sb

Replacing x with the r.v. X, taking expectations and letting p = -a/(b-a) (notice that EX=0 implies p Î [0,1]) we can write,

E{e^sX}

b-a

e^sa -

b-a

e^sb

(1-p)e^sa + p e^sb

(1-p+p e^s(b-a)) e^-ps(b-a)

def
=

e^f(u),

where u=s(b-a) and,

f(u) = -pu + log(1-p+pe^u).

The lemma will follow from the last inequality above by showing that,

f(u) Ł

u²

s²(b-a)²

To see that this is true just expand f(u) about zero,

f(u) = f(0)+uf˘(0)+

u²f"(q)

where q Î [0,u] exists by Taylor's theorem, and notice that f(0)=f˘(0)=0 and

f"(u) =

p(1-p)e^-u

(p+(1-p)e^-u)²

this is just a special case of z(1-z) Ł 1/4 for z=p/(p+(1-p)e^-u). Alternatively, just take derivative equal 0 to find that the max (1/4) is achieved when e^-u = p/(1-p). ˇ

Notice that for the special case of X Î {1,-1} with equal probability 1/2 for each value the result follows at once from,

Ee^sX = cosh s Ł e^s²/2

by comparing the two series term by term. It is just this case that is needed in the main VC-theorem below.

We are now ready to show
Proof: [Hoeffding's inequality]
We actually show a more general version for X₁,ź,X_n independent with a_i Ł X_i Ł b_i. Let Z_i=X_i-EX_i we have,

P{ |

n
ĺ
i=1

Z_i| ł t }

n
ĺ
i=1

Z_i ł t } + P{

n
ĺ
i=1

-Z_i ł t }

2 e^-st

n
Ő
i=1

e^{s²(b_i-a_i)²/8}

2 exp{

s²

n
ĺ
i=1

(b_i-a_i)² - st }

where we are using Chernoff's method and the previous lemma. The upper bound is optimized for when s = 4t/ĺ(b_i-a_i)² producing,

P{ |

n
ĺ
i=1

Z_i| ł t } Ł 2 e^{-2t²/ĺ(b_i-a_i)²}

which implies the claimed bound for the special case of coin flips. Just replace t=ne and notice that for binary variables ĺ(b_i-a_i)² = n. ˇ

The Modern Strong Uniform Laws

The historical evolution of laws of large numbers have been coincidental with important paradigm shifts in the theory of probability. The weak law of Bernoulli and Poisson with the later refinements of Tchebychev and Markov are characteristic of the early era of probability. Then came the strong laws of Borel, Cantelli, Kolmogorov and others. These characterized the time of the axiomatic formalization of probability as part of measure theory during the first part of the twentieth century. The latest addition, to this saga is what we'll concentrate on here. These are the so called strong uniform laws that have a combinatorial flavor and were discovered by Vapnik and Chervonenkis in the 1970's in connection with statistical learning.

We start with a powerful generalization of Hoeffding's inequality for general functions of independent r.v.'s satisfying the bounded difference assumption. Let S Ě Rⁿ and denote by e_i Î Rⁿ the ith cannonical vector with all zeros except for a 1 in the ith position. We say that a function h : S Ž R has bounded differences in S if for all 1 Ł i Ł n,

| h(x) - h(x + t e_i) | Ł c_i

for all x Î S and all t Î R so that (x+te_i) Î S. This means that the function does not change by more than c_i along the ith direction. We have,

McDiarmid's inequality

Let h have bounded differences. For all t > 0,

P{ | h(X₁,ź,X_n) - Eh | ł t } Ł 2 e^{-2 t²/ĺc_i²}

Notice that when h=ĺX_i we recover Hoeffding's inequality.
Proof: [McDiarmid's inequality] The idea is to write,

h - Eh =

n
ĺ
i=1

Z_i

by using,

Z_i(X₁,ź,X_i) = E{h | X₁,ź,X_i} - E{h | X₁,ź,X_i-1}

these Z_i have zero mean and are bounded a.s. within the interval [L_i,U_i] with the lower and upper limits given by the inf and sup over X_i=u of Z_i. Thus, L_i and U_i depend only on X₁,ź,X_i-1 and U_i-L_i Ł c_i is inherited from the bounded difference assumption about h. Therefore, using Chernoff's method and the previous lemma we have that for all s > 0,

P{ h-Eh ł t }

e^-st E{e^{sĺ_i=1^n-1Z_i} E{e^sZ_n |X₁,ź,X_n-1} }

e^-st e^{s²ĺ_i=1ⁿc_i²/8}

where the lemma was used n times. Now optimize s and copy the steps used for the proof of Hoeffding's to obtain the result. ˇ

Corollary

Let n_n be the empirical probability measure based on the iid sample X₁,X₂,ź,X_n. The function,

h_n = h_n(X₁,ź,X_n) =

sup
A Î A

| n_n{A} - n{A} |

has bounded differences for any class of sets A.

Proof: By changing only one of the X_i the function h_n changes by at most c_i=1/n. ˇ

It then follows inmediately from McDiarmid's inequality that,

P{ |h_n - Eh_n| ł t } Ł 2 e^-2nt²

Thus, if we can show that Eh_nŽ 0 as nŽĽ we can deduce from the above inequality that, for any t > 0 and for any n sufficiently large,

sup
A Î A

| n_n{A} - n{A} | ł t } Ł 2 e^-2nt²

and by the Borel-Cantelli lemma we would have obtained that,

sup
A Î A

| n_n{A} - n{A} | Ž 0 a.s.

as nŽĽ, i.e. we'll have a uniform strong law of large numbers over the class A.

Enter Combinatorics

If A is a colletion of subsets of R^d we define the shatter coefficients associated to the class A as,

S(n,A) =

max
x₁,ź,x_n Î R^d

| { AÇ{x₁,ź,x_n} : A Î A } |.

The integer S(n,A) is the maximum number of subsets of a set of n points that appear in elements of A. Here is a post-modern version of the Vapnik-Chervonenkis inequality due to Devroye and Lugosi.

Theorem: [VC inequality]

sup
A Î A

| n_n{A} - n{A} | } Ł 2

ć
č

log2 S(n,A)

ö
ř

1/2

Before proving this, notice that classes A for which the rhs of the above inequality goes to zero allow strong uniform laws of large numbers. In other words, the class A must not be too populated in such a way that the logarithm of its shatter coefficients must increase at a rate slower than n. The proof uses the following Lemma which also has independent interest.

Lemma

Ee^sZ_i Ł e^{s² c²/2} implies that,

max
i Ł n

Z_i } Ł c (2 log n)^1/2.

Proof:

e^{s E{max_{i Ł n} Z_i}}

E{e^{s(max_{i Ł n} Z_i)}}

max
i Ł n

e^{s Z_i}}

ĺ
i Ł n

E e^{s Z_i}

n e^s²c²/2

where we have used Jensen's inequality and the hypothesis. Hence,

max
i Ł n

Z_i} Ł

log n

s c²

is valid for any s > 0. The best bound, claimed by the theorem, is obtained at s=c^-1(2log n)^1/2. ˇ

Proof: [VC inequality]

We divide the proof into three simple parts. First we show,

First symmetrization

sup
A Î A

| n_n{A} - n{A} | } Ł E{

sup
A Î A

| n_n{A} - n_n˘{A} | }

where n_n˘ denotes the empirical measure associated to an independent copy X˘₁,ź,X˘_n of the original sample X₁,ź,X_n. This is just a simple fact that follows from two applications of Jensen's inequality and the fact that the unconditional expectation is the expectation of the expectation conditional on the original sample,

sup
A Î A

| n_n{A} - n{A} | }

sup
A Î A

|E{ n_n{A} - n_n˘{A}|X₁ź,X_n} | }

sup
A Î A

E{| n_n{A} - n_n˘{A}||X₁ź,X_n} }

E{E{

sup
A Î A

| n_n{A} - n_n˘{A}||X₁ź,X_n} }

sup
A Î A

| n_n{A} - n_n˘{A} | }.

The second step is,

Second symmetrization

Introduce independently of the two samples, n independent random signs e₁,ź,e_n i.e., P{e_i = 1} = P{e_i = -1} = 1/2 and notice that if Z_i are any independent r.v.s symmetric about 0 then the joint distribution of e₁Z₁,ź,e_nZ_n is the same as the joint distribution of Z₁,ź,Z_n. Hence,

sup
A Î A

| n_n{A} - n{A} | } Ł

sup
A Î A

n
ĺ
i=1

e_i(1[X_i Î A] - 1[X_i˘ Î A]) | }

where we used Z_i = 1[X_i Î A] - 1[X_i˘ Î A]. Finally the third step,

Counting and bounding

Here is where combinatorics gets into the picture. To compute the sup over the class A we only need to check a finite number of sets A Î A, namely those that pick different subsets of the 2n values {x₁,x₁˘,ź,x_n,x_n˘}. Thus, we only need to check at most m=S(2n,A) sets in A to find the sup. Let's denote these sets by A₁,A₂,ź,A_m and let,

Y_j =

n
ĺ
i=1

e_i (1[X_i Î A_j] - 1[X_i˘ Î A_j])

we can then write,

sup
A Î A

| n_n{A} - n{A} | }

max
j Ł m

|Y_j| }

max

{Y₁,-Y₁,ź,Y_m,-Y_m} }

Now we apply the previous Lemma by noticing that,

Ee^sY_j = Ee^-sY_j Ł

n
Ő
i=1

e^s²/2 = e^{n s²/2}

and obtain,

sup
A Î A

| n_n{A} - n{A} | } Ł

Ön

(2 log 2m)^1/2

the result follows by noticing that m=S(2n,A) Ł S(n,A)².ˇ

File translated from T_EX by T_TH, version 3.63.
On 30 Sep 2004, 09:57.

https://omega0.xyz/omega8008/

Uniform Laws of Large Numbers

Carlos C. Rodríguez https://omega0.xyz/omega8008/

What is a Law of Large Numbers?

The Weak Law

The Strong Law

The Modern Strong Uniform Laws

Enter Combinatorics

First symmetrization

Second symmetrization

Counting and bounding

Carlos C. Rodríguez
`https://omega0.xyz/omega8008/`