|
Uniform Laws of Large Numbers
Carlos C. Rodríguez
https://omega0.xyz/omega8008/
What is a Law of Large Numbers?
I am glad you asked!
The Laws of Large Numbers, or LLNs for short, come in three basic flavors:
Weak, Strong and Uniform. They all state that the observed frequencies
of events tend to approach the actual probabilities as the number of observations
increases. Saying it in another way, the LLNs show that under certain conditions,
we can assymptotically learn the probabilities of events from their
observed frequencies. To add some drama we could say that if God is not
cheating and S/he doesn't change the innitial standard probabilistic model
too much then, in principle, we (or other machines, or even the
universe as a whole) could eventually find out the Truth, the whole
Truth, and nothing but the Truth.
Bull! The Devil, is in the details.
I suspect that for reasons not too different in spirit to the ones above,
famous minds of the past took the slippery slope of defining probabilities
as the limits of relative frequencies. They became known as "frequentists".
They wrote the books and indoctrinated generations of confused students.
As we shall see below, all the LLNs follow from the addition and product
rules of probability theory. So, no matter what interpretation is ascribed
to the concept of probability, if the numerical values of the events
under consideration follow the addition and product rules then the LLNs
are just an inevitable logical consequence. In other words, you don't
have to be a frequentist to enjoy the LLNs. In fact, due to the very existence
of the LLNs, it is not possible to define probabilities with the limit
frequencies in a consistent way. This is simply because all LLNs state only
probabilistic convergence of frequencies to probabilities (the convergence
is either in probability or with probability 1). The concept that we want
to interpret (namely probability) is needed to define the very concept
(namely the LLNs) that is suppose to explain it. The frequentist concept of
probability eats its own tail!
The Weak Law
The Weak Law of Large Numbers (WLLN) goes back to the beginnings of
probability theory. It was discovered for the case of random coin
flips by James Bernoulli at around 1700 but only appeared in print
posthumously in his Ars Conjectandy in 1713. Later on, in 1800,
Poisson generalized the result for general independent coin flips.
After that Tchebychev in 1866 discovered his inequality and generalized the law
for arbitrary sequences of independent random variables with second moments.
Finally, his student Markov extended it to some classes of dependent
random variables. Markov's inequality is almost a triviality but it has found
innumerable applications.
Theorem 1 [Markov's inequality]
If X is nonnegative and t > 0,
Proof:
for t > 0,
X ³ X 1[X ³ t] ³ t 1[X ³ t] |
|
and by the monotonicity of expectations we find that,
Two important consequences of Markov's inequality are:
- Tchebychev's inequality
-
If V(X) denotes the variance of X then,
P{|X-EX| ³ t} = P{|X-EX|2 ³ t2} £ |
V(X)
t2
|
|
|
- Chernoff's method
-
For t > 0 find the best s in,
P{X ³ t} = P{ esX ³ est} £ |
E esX
est
|
|
|
Thus, when X1,X2,¼,Xn are independent and identically
distributed (iid) as X the sample mean,
has mean EX and variance V(X)/n so by Tchebychev, for any e > 0
P{ | |
-
X
|
n
|
- EX| ³ e} £ |
V(X)
ne2
|
|
|
and it immediately follows that,
|
lim
n®¥
|
P{ | |
-
X
|
n
|
- EX| ³ e} = 0 |
|
which is what is meant by the sentence "the sample mean converges in
probability to the expected value". That's the WLLN. For the special case
of coin flips, i.e. for binary r.v.'s Bin(p), with P{X=1} = 1-P{X=0}=p
the Tchebychev bound gives,
P{ | |
-
X
|
n
|
- p | ³ e} £ |
p(1-p)
ne2
|
|
|
showing that the observed frequency of ones converges in probability to the
true probability p of observing a 1.
The Strong Law
The bounds above obtained from Tchebychev's inequality are very poor.
By using Chernoff's method an exponential bound can be obtained. In fact we
have,
- Hoeffding's inequality
-
P |
ì í
î
|
| |
-
X
|
n
|
- p | ³ e |
ü ý
þ
|
£ 2 e-2ne2 |
|
and by the classic Borel-Cantelli lemma it follows that,
P{ w: |
lim
n®¥
|
|
-
X
|
n
|
(w) = p } = 1 |
|
which is the definition that the observed frequency of ones converges
with probability one (or a.s. for almost surely) to the true probability p
of observing a 1.
The proof of Hoeffding's inequality uses the following result for bounded
r.v.'s with zero mean,
Lemma 1
If EX=0 and a £ X £ b, then for any s > 0,
Proof:
Let a £ x £ b and define l Î [0,1] by
Notice that for any s > 0 we have,
Thus, exp(.) convex implies,
esx £ |
b-x
b-a
|
esa + |
x-a
b-a
|
esb |
|
Replacing x with the r.v. X, taking expectations and letting p = -a/(b-a)
(notice that EX=0 implies p Î [0,1])
we can write,
| |
|
| |
| |
|
| |
| |
|
(1-p+p es(b-a)) e-ps(b-a) |
| |
| |
|
| |
|
where u=s(b-a) and,
f(u) = -pu + log(1-p+peu). |
|
The lemma will follow from the last inequality above by showing that,
f(u) £ |
u2
8
|
= |
s2(b-a)2
8
|
. |
|
To see that this is true just expand f(u) about zero,
f(u) = f(0)+uf¢(0)+ |
1
2
|
u2f"(q) |
|
where q Î [0,u] exists by Taylor's theorem, and notice that
f(0)=f¢(0)=0 and
f"(u) = |
p(1-p)e-u
(p+(1-p)e-u)2
|
£ |
1
4
|
|
|
this is just a special case of z(1-z) £ 1/4 for z=p/(p+(1-p)e-u).
Alternatively, just take derivative equal 0 to find that the max
(1/4) is achieved when e-u = p/(1-p). ·
Notice that for the special case of X Î {1,-1} with equal probability
1/2 for each value the result follows at once from,
by comparing the two series term by term. It is just this case that is
needed in the main VC-theorem below.
We are now ready to show
Proof: [Hoeffding's inequality]
We actually show a more general version for X1,¼,Xn
independent with ai £ Xi £ bi. Let Zi=Xi-EXi we have,
| |
|
P{ |
n å
i=1
|
Zi ³ t } + P{ |
n å
i=1
|
-Zi ³ t } |
| |
| |
|
2 e-st |
n Õ
i=1
|
es2(bi-ai)2/8 |
| |
| |
|
2 exp{ |
s2
8
|
|
n å
i=1
|
(bi-ai)2 - st } |
| |
|
where we are using Chernoff's method and the previous lemma. The upper bound
is optimized for when s = 4t/å(bi-ai)2 producing,
P{ | |
n å
i=1
|
Zi| ³ t } £ 2 e-2t2/å(bi-ai)2 |
|
which implies the claimed bound for the special case of coin flips.
Just replace t=ne and notice that for binary variables
å(bi-ai)2 = n. ·
The Modern Strong Uniform Laws
The historical evolution of laws of large numbers have been
coincidental with important paradigm shifts in the theory of
probability. The weak law of Bernoulli and Poisson with the later
refinements of Tchebychev and Markov are characteristic of the early
era of probability. Then came the strong laws of Borel, Cantelli, Kolmogorov
and others. These characterized the time of the axiomatic formalization of
probability as part of measure theory during the first part of the
twentieth century. The latest addition, to this saga is what we'll
concentrate on here. These are the so called strong uniform laws that
have a combinatorial flavor and were discovered by Vapnik and
Chervonenkis in the 1970's in connection with statistical learning.
We start with a powerful generalization of Hoeffding's inequality for general
functions of independent r.v.'s satisfying the bounded difference assumption.
Let S Ì Rn and denote by ei Î Rn the ith cannonical vector
with all zeros except for a 1 in the ith position.
We say that a function h : S ® R has
bounded differences in S if for all 1 £ i £ n,
| h(x) - h(x + t ei) | £ ci |
|
for all x Î S and all t Î R so that (x+tei) Î S. This means
that the function does not change by more than ci along the ith direction.
We have,
- McDiarmid's inequality
-
Let h have bounded differences. For all t > 0,
P{ | h(X1,¼,Xn) - Eh | ³ t } £ 2 e-2 t2/åci2 |
|
Notice that when h=åXi we recover Hoeffding's inequality.
Proof: [McDiarmid's inequality]
The idea is to write,
by using,
Zi(X1,¼,Xi) = E{h | X1,¼,Xi} - E{h | X1,¼,Xi-1} |
|
these Zi have zero mean and are bounded a.s. within the interval
[Li,Ui] with the lower and upper limits given by the inf and
sup over Xi=u of Zi. Thus, Li and Ui depend only
on X1,¼,Xi-1 and Ui-Li £ ci is inherited from
the bounded difference assumption about h.
Therefore, using Chernoff's method and the previous lemma we have that
for all s > 0,
| |
|
e-st E{esåi=1n-1Zi E{esZn |X1,¼,Xn-1} } |
| |
| |
|
| |
|
where the lemma was used n times. Now optimize s and copy the steps
used for the proof of Hoeffding's to obtain the result. ·
- Corollary
- Let
nn be the empirical probability measure based on the iid
sample X1,X2,¼,Xn. The function,
hn = hn(X1,¼,Xn) = |
sup
A Î A
|
| nn{A} - n{A} | |
|
has bounded differences for any class of sets A.
Proof: By changing only one of the Xi the function hn changes
by at most ci=1/n. ·
It then follows inmediately from McDiarmid's inequality that,
P{ |hn - Ehn| ³ t } £ 2 e-2nt2 |
|
Thus, if we can show that Ehn® 0 as n®¥ we
can deduce from the above inequality that, for any t > 0 and for any n
sufficiently large,
P{ |
sup
A Î A
|
| nn{A} - n{A} | ³ t } £ 2 e-2nt2 |
|
and by the Borel-Cantelli lemma we would have obtained that,
|
sup
A Î A
|
| nn{A} - n{A} | ® 0 a.s. |
|
as n®¥, i.e. we'll have a uniform strong law of large
numbers over the class A.
Enter Combinatorics
If A is a colletion of subsets of Rd we define the
shatter coefficients associated to the class A as,
S(n,A) = |
max
x1,¼,xn Î Rd
|
| { AÇ{x1,¼,xn} : A Î A } |. |
|
The integer S(n,A) is the maximum number of subsets of a set of
n points that appear in elements of A. Here is a post-modern
version of the Vapnik-Chervonenkis inequality due to Devroye and Lugosi.
- Theorem: [VC inequality]
-
E{ |
sup
A Î A
|
| nn{A} - n{A} | } £ 2 |
æ è
|
log2 S(n,A)
n
|
ö ø
|
1/2
|
. |
|
Before proving this, notice that classes A for which the rhs
of the above inequality goes to zero allow strong uniform laws of
large numbers. In other words, the class A must not be too
populated in such a way that the logarithm of its shatter coefficients
must increase at a rate slower than n. The proof uses the following
Lemma which also has independent interest.
- Lemma
- EesZi £ es2 c2/2 implies that,
E{ |
max
i £ n
|
Zi } £ c (2 log n)1/2. |
|
Proof:
where we have used Jensen's inequality and the hypothesis. Hence,
E{ |
max
i £ n
|
Zi} £ |
log n
s
|
+ |
s c2
2
|
|
|
is valid for any s > 0. The best bound, claimed by the theorem, is
obtained at s=c-1(2log n)1/2. ·
Proof: [VC inequality]
We divide the proof into three simple parts. First we show,
First symmetrization
E{ |
sup
A Î A
|
| nn{A} - n{A} | } £ E{ |
sup
A Î A
|
| nn{A} - nn¢{A} | } |
|
where nn¢ denotes the empirical measure associated to an
independent copy X¢1,¼,X¢n of the original sample
X1,¼,Xn. This is just a simple fact that follows from
two applications of Jensen's inequality and the fact that the
unconditional expectation is the expectation of the expectation
conditional on the original sample,
|
E{ |
sup
A Î A
|
| nn{A} - n{A} | } |
|
|
E{ |
sup
A Î A
|
|E{ nn{A} - nn¢{A}|X1¼,Xn} | } |
| |
| |
|
E{ |
sup
A Î A
|
E{| nn{A} - nn¢{A}||X1¼,Xn} } |
| |
| |
|
E{E{ |
sup
A Î A
|
| nn{A} - nn¢{A}||X1¼,Xn} } |
| |
| |
|
E{ |
sup
A Î A
|
| nn{A} - nn¢{A} | }. |
| |
|
The second step is,
Second symmetrization
Introduce independently of the two samples, n independent random
signs e1,¼,en i.e.,
P{ei = 1} = P{ei = -1} = 1/2 and notice that if
Zi are any independent r.v.s symmetric about 0 then the joint distribution
of e1Z1,¼,enZn is the same as
the joint distribution of Z1,¼,Zn. Hence,
E{ |
sup
A Î A
|
| nn{A} - n{A} | } £ |
1
n
|
E{ |
sup
A Î A
|
| |
n å
i=1
|
ei(1[Xi Î A] - 1[Xi¢ Î A]) | } |
|
where we used Zi = 1[Xi Î A] - 1[Xi¢ Î A]. Finally the third step,
Counting and bounding
Here is where combinatorics gets into the picture. To compute the sup over
the class A we only need to check a finite number of sets
A Î A, namely those that pick different subsets of the 2n
values {x1,x1¢,¼,xn,xn¢}. Thus, we only need to check
at most m=S(2n,A) sets in A to find the sup. Let's
denote these sets by A1,A2,¼,Am and let,
Yj = |
n å
i=1
|
ei (1[Xi Î Aj] - 1[Xi¢ Î Aj]) |
|
we can then write,
|
E{ |
sup
A Î A
|
| nn{A} - n{A} | } |
|
|
| |
| |
|
|
1
n
|
E{ |
max
| {Y1,-Y1,¼,Ym,-Ym} } |
| |
| | |
|
Now we apply the previous Lemma by noticing that,
EesYj = Ee-sYj £ |
n Õ
i=1
|
es2/2 = en s2/2 |
|
and obtain,
E{ |
sup
A Î A
|
| nn{A} - n{A} | } £ |
Ön
n
|
(2 log 2m)1/2 |
|
the result follows by noticing that
m=S(2n,A) £ S(n,A)2.·
File translated from
TEX
by
TTH,
version 3.63. On 30 Sep 2004, 09:57.
|