Are We Cruising a Hypothesis Space?
C. C. RODRIGUEZ
Abstract
This paper is about Information Geometry, a relatively new subject
within mathematical statistics that attempts to study the problem of
inference by using tools from modern differential geometry. This paper
provides an overview of some of the achievements and possible future
applications of this subject to physics.
1 Introduction
It is not surprising that geometry should play a fundamental role in
the theory of inference. The very idea of what constitutes a good
model cannot be stated clearly without reference to geometric concepts
such as size and form of the model as well as distance between
probability distributions. Recall that a statistical model (hypothesis
space) is a collection of probability distributions for the data.
Therefore, a good model should be big enough to include a
close approximation to the true distribution of the data, but small
enough to facilitate the task of identifying this approximation. As
Occam said: as simple as possible but not too simple.
Moreover, regular statistical models have a natural
Riemannian structure. Parameterizations correspond to choices of
coordinate systems and the Fisher information matrix in a given
parameterization provides the metric in the corresponding coordinate
system [,]. By thinking of statistical
models as manifolds, hypothesis spaces become places and it
doesn't take much to imagine some of these places as models for
the only physical place there is out there, namely:
spacetime. In section of the paper we apply the
techniques of information geometry to show that the space of radially
symmetric distributions admits a foliation into pseudo-spheres of
increasing radius. If we think of a radially symmetric distribution as
describing an uncertain physical position we discover a hypothesis
space that, in many ways resembles an expanding spacetime. An
isotropic, homogeneous space with pseudo-spherical symmetries and with
time increasing with decreasing curvature radius. This
admittedly simple toy model of reality already suggests a number of
truly remarkable consequences for the nature of spacetime. Here are
three of them:
- The appearance of time is a consequence of uncertainty.
-
Space is infinite dimensional and only on the average appears as
four dimensional.
-
Spin is a property of space and not of a particle so that all truly
fundamental particles must have spin.
I must emphasize that, at the time of writing, there is no direct
experimental evidence in favor of any of the above statements.
Nevertheless there is indirect evidence that they should not be too
quickly dismissed as nonsense.
With respect to the first statement. Recall that the appearance of
the axis of time, in standard general relativity, is a consequence of
specifying an initial and a final 3-geometry on two spacelike
hypersurfaces plus evolution according to the field
equation []. Time is therefore a consequence of
3-space geometry and the field equation of general relativity, which
in turn seems to be of an statistical nature (see []
and section below). These, I believe, are facts
that support, at least in spirit, the first claim above.
There is absolutely no evidence that space has infinitely many
dimensions, but had this be true, it would explain why we observe only
four of them. It also seems a priori desirable to have a model that
produces observed space as a macroscopic object not unlike
pressure or temperature.
With respect to the third statement. Hestenes[] shows
that many of the rules for manipulating spin have nothing to do with
quantum mechanics but are just general expressions for the geometry of
space. It is also worth noticing that the standard model allows the
existence of elementary particles without spin but these have not yet
been observed.
But there is still more. Think of the different roles that the
concepts of, entropy, curvature, and, local hyperbolicity, play in
statistics and in physics and you will realize that the link is a
useful bridge for transporting ideas from physics to statistics and
vice-versa. The following sections (,
, ) of this paper do exactly
that. That is, they examine the meaning of each of these concepts
(entropy, curvature and local hyperbolicity) keeping the proposed link
between inference and physics in mind.
The link between information geometry and general relativity promises
big rewards for both, statistical inference and physics. For example,
statisticians may look at the field equations of general relativity as
a procedure for generating statistical models from prior information
encoded in the distribution of energy and matter fields. On the other
hand, physicists may see information geometry as a possible language
for the elusive theory of quantum gravity since it is a language
already made out of the right ingredients: uncertainty and
differential geometry.
2 The Hypothesis Space of Radially Symmetric Distributions
Let Â3 be the collection of all radially symmetric distributions of
three dimensional euclidean space. The probability of an infinitesimal
region around the point x Î Â3, of volume d3x, that is
assigned by a general element of Â3 is given by,
P( d3x |y,q,s) = |
1
s3
|
|
ê ê
ê
|
y |
æ ç
è
|
|
ê ê
ê
|
|
x-q
s
|
|
ê ê
ê
|
2
|
|
ö ÷
ø
|
|
ê ê
ê
|
2
|
d3x |
| (1) |
where, q Î Â3 is a location parameter, s > 0 is a
scale parameter and y is an arbitrary differentiable function of
r2 > 0 satisfying the normalization condition:
|
ó õ
|
¥
0
|
r2|y(r2)|2 dr = |
1
4p
|
|
| (2) |
Equation (2) assures that the probability assigned to the
whole space by (1) is in fact 1. The derivative,
y¢, must also decrease to 0 sufficiently fast so that the
integrals () exist. Since y is an infinite
dimensional parameter, Â3 is also an infinite dimensional manifold but
the space , Â3 (y), of radially symmetric distributions for a
given function y is a four dimensional submanifold of Â3
parameterized by
(q0,q1,q2,q3) = (s,q). The metric in Â3 (y) is given by the
4 ×4 Fisher information matrix (see [] p. 63) with entries:
gmn = 4 |
ó õ
|
(¶mf)(¶nf) d3x |
| (3) |
where m,n = 0,¼,3, the function f is the square root of
the density given in (1) i.e.,
f(x|q,s) = s-3/2 y |
æ ç
è
|
|
ê ê
ê
|
|
x-q
s
|
|
ê ê
ê
|
2
|
|
ö ÷
ø
|
|
| (4) |
and ¶m denotes the derivative with respect to qm.
Let us separate the computation of the metric tensor terms into three
parts. The entries gij, the entries g0i for i,j = 1,2,3 and
the element g00. Replacing (4) into (3),
doing the change of variables x = q+sy and using the fact
that
¶iy(y2) = -2 yiy¢(y2)/s
we get,
gij = |
16
s2
|
|
ó õ
|
yi yj(y¢(y2))2 d3y |
| (5) |
where y2 = |y|2 is the Clifford product of the vector y by
itself. Carrying out the integration in spherical coordinates we obtain,
gij = 0 for i ¹ j
and,
gii = |
64p
3s2
|
|
ó õ
|
r4|y¢(r2)|2 dr |
| (6) |
The derivative with respect to s of the function given
in (4) is,
¶0f = - |
s-5/2
2
|
[ 3y+ 4 y2y¢ ] |
| (7) |
and therefore computing as in (5) we have,
g0i = 4 |
ó õ
|
(¶0f)(¶if) d3x µ |
ó õ
|
[3y+ 4y2y¢] yiy¢ d3y = 0 |
| (8) |
where the value of 0 for the last integral follows by performing the
integration in spherical coordinates or simply by symmetry, after
noticing that the integrand is odd. Finally from (7)
we get,
g00 = |
4p
s2
|
|
ó õ
|
[ 3y(r2) + 4r2y¢(r2) ]2 r2 dr |
| (9) |
Expanding the square and integrating the cross term by parts to
show that,
|
ó õ
|
r4y(r2)y¢(r2) dr = - |
3
4
|
( |
1
4p
|
) |
| (10) |
where we took u = yr3/2 and v¢ = 2ry¢ for the
integration by parts and we have used (2). We obtain,
g00 = |
4p
s2
|
|
é ê
ë
|
|
-9
4p
|
+ 16 |
ó õ
|
r6 |y¢(r2)|2 dr |
ù ú
û
|
|
| (11) |
The full matrix tensor is then,
(g) = [1/( s2)]diag(J(y),K(y),K(y),K(y)),
where J(y) and K(y) are just short hand notations for the
factors of [1/( s2)] in (11) and
(6). These functions are always positive and they depend
only on y. Straight forward calculations, best done with a
symbolic manipulator like MAPLE, show that a space with this metric
has constant negative scalar curvature given by -1/J(y). It
follows that for a fix value of the function y the hypothesis
space of radially symmetric distributions Â3 (y) is the
pseudo-sphere of radius J1/2(y). We have therefore
shown that the space of radially symmetric distributions has a
foliation (i.e. a partition of submanifolds) of pseudo-spheres of
increasing radius.
This is a mathematical theorem. There can be nothing controversial about it.
What it may be disputed, however, is my belief that the hypothesis
space of radially symmetric distributions may be telling us something new
about the nature of real physical spacetime. What I find interesting is
the fact that if we think of position subject to radially symmetric
uncertainty then the mathematical object describing the positions (i.e.
the space of its distributions) has all the symmetries of space plus time.
It seems that time, or something like time, pops out automatically when
we have uncertain positions. I like to state this hypothesis with
the phrase:
there is no time, only uncertainty
2.1 Uncertain Spinning Space?
The hypothesis space of radially symmetric distributions is the
space of distributions for a random vector y Î Â3 of
the form,
where x Î Â3 is a non random location vector, and
e Î Â3 is a random vector with a distribution radially
symmetric about the origin and with standard deviation s > 0 in
all directions. It turns out that exactly the same hypothesis space
is obtained if instead of (12) we use,
where i is the constant unit pseudo scalar of the Clifford algebra
of Â3. The pseudo scalar i has unit magnitude, commutes with
all the elements of the algebra, squares to -1 and it represents the
oriented unit volume of Â3 []. By taking
expectations with the probability measure indexed by (x,s,y)
we obtain that,
E(y|x,s,y) = x
and,
Equation (14) shows that, even though the space of radially
symmetric distributions is infinite dimensional, on the average
the intervals look like the usual spacetime intervals.
We may think of y in (13) as encoding a position in 3-space
together with an uncertain degree of orientation given by the bivector
part of y, i.e. ie. In other words we assign to the point
x and intrinsic orientation of direction [^(e)] and
magnitude |e|. In this model the uncertainty is not directly
about the location x (as in (12)) but about its postulated
degree of orientation (or spinning).
3 Entropy and Ignorance
The notion of statistical entropy is not only related to the
corresponding notion in physics it is exactly the same thing as
demonstrated long ago by Jaynes []. Entropy appears
indisputable as the central quantity of information geometry. In
particular, from the Kullback number (relative entropy) between two
distributions in the model we obtain the metric, the volume element, a
large class of connections [], and a notion of
ignorance within the model given by the so called entropic priors
[]. In this section I present a simple argument,
inspired by the work of Zellner on MDIP priors [],
showing that entropic priors are the statistical representation of the
vacuum of information in a given hypothesis space.
Let \cal H = { f(x|q): q Î Q} be a general regular
hypothesis space of probability density functions f(x|q) for a
vector of observations x conditional on a vector of parameters
q = (qm). Let us denote by f(x,q), the joint
density of x and q and by f(x) and p(q) the
marginal density of x and the prior on q respectively. We
have,
f(x,q) = f(x|q) p(q).
Since \cal H is regular, the Fisher information matrix,
gmn(q) = 4 |
ó õ
|
(¶mf1/2)(¶nf1/2) dx |
| (15) |
exists and it is continuous and positive definite (thus non singular)
at every q. As in (3), ¶m denotes the
partial derivative with respect to qm. The space \cal H with
the metric g = (gmn) given in (15) forms a
Riemannian manifold. Therefore, the invariant element of volume is
given by,
h(dq) µ Ö{det g(q)} dq.
This is in fact a differential form [] that
provides a notion of surface area for the manifold \cal H and it is
naturally interpreted as the uniform distribution over \cal H . This
formula is known as Jeffreys rule, is often used as a universal
method for building total ignorance priors. However, Jeffreys rule
does not take into account the fact that a truly ignorant prior for
q should contain as little information as possible about the
data x. The entropic prior in \cal H demands that the joint
distribution of x and q, f(x,q), be as difficult as
possible to discriminate from the independent model h(x) Ö{det g(q)}, where h(x) is an initial guess for f(x). That is, we
are looking for the prior that minimizes the Kullback number between
f(x,q) and the independent model, or in other words, the
prior that makes the joint distribution of x and q to have
maximum entropy relative to the measure
h(x)Ö{det g(q)}dxdq. Thus, the entropic prior is the density p(q) that
solves the variational problem,
|
min
p
|
|
ó õ
|
f(x,q) log |
f(x,q)
|
dx dq |
| (16) |
Replacing f(x,q) = f(x|q) p(q)
into (16), simplifying, and using a
lagrange multiplier, l, for the normalization constraint, that
òp(q)dq = 1, we find that p must minimize,
|
ó õ
|
p(q) I(q:h) dq+ |
ó õ
|
p(q) log |
p(q)
|
dq + l |
ó õ
|
p(q) dq |
| (17) |
where, I(q:h) denotes the Kullback number between f(x|q) and
h(x), i.e.,
I(q:h) = |
ó õ
|
f(x|q) log |
f(x|q)
h(x)
|
dq |
| (18) |
The Lagrangian \cal L is given by the sum of the integrands in
(17) and the Euler-Lagrange equation is then,
|
¶\cal L
¶p
|
= I(q:h) + log |
p(q)
|
+ 1 + l = 0 |
| (19) |
from where we obtain that,
The numerical values of the probabilities obtained with the formula
(20) depend on the basis for the logarithm used in
(16). However, the basis for the logarithm that appears in the
definition of the Kullback number is arbitrary (entropy is defined
only up to a proportionality constant). Thus, (20) is not
just one density, but a family of densities,
indexed by the parameter a > 0 and the function h. Equation
(21) is the family of entropic priors introduced in
[] and studied in more detail in
[],[] and [].
It was shown in [] that the parameter a should
be interpreted as the number of virtual observations supporting
h(x) as a guess for the distribution of x. Large values
of a should go with reliable guesses for h(x) but, as it
was shown in [], the inferences are less robust. This
indicates that ignorant priors should be entropic priors with the
smallest possible value for a, i.e., with,
a* = |
inf
| {a > 0 : |
ó õ
|
e -aI(q,h) h(dq) < ¥} |
| (22) |
Here is the canonical example.
3.1 Example: The Gaussians
Consider the hypothesis space of one dimensional gaussians parameterized by
the mean m and the standard deviation s.
When h is an arbitrary gaussian with parameters m0 and s0
straight forward computations show that the entropic prior is given by,
p(m,s|a,m0,s0) = |
1
Z
|
sa-2exp |
é ê
ë
|
|
-a ((m-m0)2+s2)
2s02
|
|
ù ú
û
|
|
| (23) |
where the normalization constant Z is defined for a > 1
and is given by,
Z = |
2
|
|
æ ç
è
|
|
a
2
|
|
ö ÷
ø
|
a/2
|
G( |
a- 1
2
|
)-1 s0-a |
| (24) |
Thus, in this case a* = 1 and the most ignorant prior is
obtained by taking the limit a® 1 and
s0® ¥ in (23) obtaining, in the limit, an
improper density proportional to 1/s, which makes every body
happy, frequentists and bayesians alike.
4 Curvature and Information
Curvature seems to be well understood only in physics, specially from
the point of view of gauge theories where the curvature form
associated to a connection has been shown to encode field strengths
for all the four fundamental forces of nature []. In
statistics, on the other hand, the only thing we know (so far) about
the role of curvature is that the higher the scalar curvature is at a
given point of the model, the more difficult it is to do estimation at
that point. This already agrees nicely with the idea of black holes ,
for if in a given model there is a curvature R0 beyond which
estimation is essentially impossible then the space is partitioned
into three regions with curvatures, R < R0, R = R0 and
R > R0 that correspond to regular points, horizon points and points
inside black holes. No body has found an example of a hypothesis
space with this kind of inferential black holes yet, but no body has
tried to look for one either. Before rushing into a hunt it seems
necessary to clarify what exactly it is meant by the words:
estimation is essentially impossible at a point.
I believe that one of the most promising areas for research in the
field of information geometry is the clarification of the role of
curvature in statistical inference. If indeed physical spacetime can
be best modeled as a hypothesis space then, what is to be learned from
the research on statistical curvature will have direct implications
for the nature of physical space. On the other hand, it also seems
promising to re-evaluate what is already known in physics about
curvature under the light of the proposed link with inference. Even a
naive first look will show indications of what to expect for the role
of curvature in inference. Here is an attempt at that first look.
- From the classic statement:
- Mass-energy is the source of
gravity and the strength of the gravity field is measured by the
curvature of spacetime
- We guess:
- Information is the source of the curvature of
hypothesis spaces. That is, prior information is the source of the form of
the model
- From:
- The dynamics of how mass-energy curves spacetime are
controlled by the field equation:
G = kT
where G is the Einstein tensor, T is the
stress-energy tensor and k is a proportionality factor
- guess:
- The field equation controls the dynamics of how
prior information produces models
- From:
- The field equation for empty space is the
Euler-Lagrange equation that characterizes the extremum of the Hilbert
action, with respect to the choice of geometry. That is it extremizes
Sg = |
ó õ
|
R dW, dW = |
æ Ö
|
|
|
d4x, |
| (25) |
where the integral is taken over the interior of a
four-dimensional region W, R is the scalar curvature and g
is the metric
- guess1:
- The form of hypothesis spaces based on no prior
information must satisfy
where gij is the Fisher information matrix, Rij is the
Ricci tensor and R is the scalar curvature as above.
- guess2:
- Given a hypothesis space with Fisher information
matrix g(q), the Einstein tensor, G, i.e. the left hand side
of (26), quantifies the amount of prior information locally
contained in the model at each point q.
5 Hyperbolicity
What it seems most intriguing with respect to the link between
information geometry and general relativity is the role of
hyperbolicity. We know from general relativity that physical
spacetimes are Riemannian manifolds which are locally Lorentzian.
That is, at each point, the space looks locally like Minkowski space.
Or, in other words, the symmetries of the tangent space at each point
are those of hyperbolic space. On the other hand, in information
geometry, hyperbolicity appears at two very basic levels. First,
hyperbolicity appears connected to the notion of regularity through
the property of local asymptotic normality (LAN for short see
[]). This is in close agreement with what happens in
physics. The LAN property says that the manifold of distributions of
n independent and identically regularly distributed observations can
be locally approximated by gaussians for large n, and since the
gaussians are known to form hyperbolic spaces, the correspondence with
physics is perfect. Second, in statistical inference hyperbolicity
also appears mysteriously connected to entropy and Bayes' theorem!
(see my From Euclid to Entropy[]) and by
following the link back to general relativity we obtain a completely
new and unexpected result: entropy and Bayes theorem are the
source of the local hyperbolicity of spacetime!. That entropy and
thermodynamics are related to general relativity may have seem
outrageous in the past, but not today. It does not seem outrageous at
all when we consider that, Bekenstein found that the entropy of a
black hole is proportional to its surface area [],
when we consider that Hawking discovered that black holes have a
temperature [] and specially when we consider that
Jacobson showed that the field equation is like an equation of state
in thermodynamics [].
File translated from TEX by TTH, version 1.50.