Are We Cruising a Hypothesis Space?

C. C. RODRIGUEZ

Abstract

This paper is about Information Geometry, a relatively new subject within mathematical statistics that attempts to study the problem of inference by using tools from modern differential geometry. This paper provides an overview of some of the achievements and possible future applications of this subject to physics.

1  Introduction

It is not surprising that geometry should play a fundamental role in the theory of inference. The very idea of what constitutes a good model cannot be stated clearly without reference to geometric concepts such as size and form of the model as well as distance between probability distributions. Recall that a statistical model (hypothesis space) is a collection of probability distributions for the data. Therefore, a good model should be big enough to include a close approximation to the true distribution of the data, but small enough to facilitate the task of identifying this approximation. As Occam said: as simple as possible but not too simple. Moreover, regular statistical models have a natural Riemannian structure. Parameterizations correspond to choices of coordinate systems and the Fisher information matrix in a given parameterization provides the metric in the corresponding coordinate system [,]. By thinking of statistical models as manifolds, hypothesis spaces become places and it doesn't take much to imagine some of these places as models for the only physical place there is out there, namely: spacetime. In section  of the paper we apply the techniques of information geometry to show that the space of radially symmetric distributions admits a foliation into pseudo-spheres of increasing radius. If we think of a radially symmetric distribution as describing an uncertain physical position we discover a hypothesis space that, in many ways resembles an expanding spacetime. An isotropic, homogeneous space with pseudo-spherical symmetries and with time increasing with decreasing curvature radius. This admittedly simple toy model of reality already suggests a number of truly remarkable consequences for the nature of spacetime. Here are three of them:

  1. The appearance of time is a consequence of uncertainty.
  2. Space is infinite dimensional and only on the average appears as four dimensional.

  3. Spin is a property of space and not of a particle so that all truly fundamental particles must have spin.

I must emphasize that, at the time of writing, there is no direct experimental evidence in favor of any of the above statements. Nevertheless there is indirect evidence that they should not be too quickly dismissed as nonsense.

With respect to the first statement. Recall that the appearance of the axis of time, in standard general relativity, is a consequence of specifying an initial and a final 3-geometry on two spacelike hypersurfaces plus evolution according to the field equation []. Time is therefore a consequence of 3-space geometry and the field equation of general relativity, which in turn seems to be of an statistical nature (see [] and section  below). These, I believe, are facts that support, at least in spirit, the first claim above.

There is absolutely no evidence that space has infinitely many dimensions, but had this be true, it would explain why we observe only four of them. It also seems a priori desirable to have a model that produces observed space as a macroscopic object not unlike pressure or temperature.

With respect to the third statement. Hestenes[] shows that many of the rules for manipulating spin have nothing to do with quantum mechanics but are just general expressions for the geometry of space. It is also worth noticing that the standard model allows the existence of elementary particles without spin but these have not yet been observed.

But there is still more. Think of the different roles that the concepts of, entropy, curvature, and, local hyperbolicity, play in statistics and in physics and you will realize that the link is a useful bridge for transporting ideas from physics to statistics and vice-versa. The following sections (, , ) of this paper do exactly that. That is, they examine the meaning of each of these concepts (entropy, curvature and local hyperbolicity) keeping the proposed link between inference and physics in mind.

The link between information geometry and general relativity promises big rewards for both, statistical inference and physics. For example, statisticians may look at the field equations of general relativity as a procedure for generating statistical models from prior information encoded in the distribution of energy and matter fields. On the other hand, physicists may see information geometry as a possible language for the elusive theory of quantum gravity since it is a language already made out of the right ingredients: uncertainty and differential geometry.

2  The Hypothesis Space of Radially Symmetric Distributions

Let Â3 be the collection of all radially symmetric distributions of three dimensional euclidean space. The probability of an infinitesimal region around the point x Î Â3, of volume d3x, that is assigned by a general element of Â3 is given by,

P( d3x |y,q,s) = 1
s3
ê
ê
ê
y æ
ç
è
ê
ê
ê
x-q
s
ê
ê
ê
2

 
ö
÷
ø
ê
ê
ê
2

 
d3x
(1)
where, q Î Â3 is a location parameter, s > 0 is a scale parameter and y is an arbitrary differentiable function of r2 > 0 satisfying the normalization condition:

ó
õ
¥

0 
r2|y(r2)|2 dr = 1
4p
(2)
Equation (2) assures that the probability assigned to the whole space by (1) is in fact 1. The derivative, y¢, must also decrease to 0 sufficiently fast so that the integrals () exist. Since y is an infinite dimensional parameter, Â3 is also an infinite dimensional manifold but the space , Â3 (y), of radially symmetric distributions for a given function y is a four dimensional submanifold of Â3  parameterized by (q0,q1,q2,q3) = (s,q). The metric in Â3 (y) is given by the 4 ×4 Fisher information matrix (see [] p. 63) with entries:

gmn = 4 ó
õ
(mf)(nf) d3x
(3)
where m,n = 0,¼,3, the function f is the square root of the density given in (1) i.e.,

f(x|q,s) = s-3/2 y æ
ç
è
ê
ê
ê
x-q
s
ê
ê
ê
2

 
ö
÷
ø
(4)
and m denotes the derivative with respect to qm. Let us separate the computation of the metric tensor terms into three parts. The entries gij, the entries g0i for i,j = 1,2,3 and the element g00. Replacing (4) into (3), doing the change of variables x = q+sy and using the fact that iy(y2) = -2 yiy¢(y2)/s we get,

gij = 16
s2
ó
õ
yi yj(y¢(y2))2 d3y
(5)
where y2 = |y|2 is the Clifford product of the vector y by itself. Carrying out the integration in spherical coordinates we obtain, gij = 0 for i ¹ j and,

gii = 64p
3s2
ó
õ
r4|y¢(r2)|2 dr
(6)
The derivative with respect to s of the function given in (4) is,

0f = - s-5/2
2
[ 3y+ 4 y2y¢ ]
(7)
and therefore computing as in (5) we have,

g0i = 4 ó
õ
(0f)(if) d3x µ ó
õ
[3y+ 4y2y¢] yiy¢ d3y = 0
(8)
where the value of 0 for the last integral follows by performing the integration in spherical coordinates or simply by symmetry, after noticing that the integrand is odd. Finally from (7) we get,

g00 = 4p
s2
ó
õ
[ 3y(r2) + 4r2y¢(r2) ]2 r2 dr
(9)
Expanding the square and integrating the cross term by parts to show that,

ó
õ
r4y(r2)y¢(r2) dr = - 3
4
( 1
4p
)
(10)
where we took u = yr3/2 and v¢ = 2ry¢ for the integration by parts and we have used (2). We obtain,

g00 = 4p
s2
é
ê
ë
-9
4p
+ 16 ó
õ
r6 |y¢(r2)|2 dr ù
ú
û
(11)
The full matrix tensor is then, (g) = [1/( s2)]diag(J(y),K(y),K(y),K(y)), where J(y) and K(y) are just short hand notations for the factors of [1/( s2)] in (11) and (6). These functions are always positive and they depend only on y. Straight forward calculations, best done with a symbolic manipulator like MAPLE, show that a space with this metric has constant negative scalar curvature given by -1/J(y). It follows that for a fix value of the function y the hypothesis space of radially symmetric distributions Â3 (y) is the pseudo-sphere of radius J1/2(y). We have therefore shown that the space of radially symmetric distributions has a foliation (i.e. a partition of submanifolds) of pseudo-spheres of increasing radius.

This is a mathematical theorem. There can be nothing controversial about it. What it may be disputed, however, is my belief that the hypothesis space of radially symmetric distributions may be telling us something new about the nature of real physical spacetime. What I find interesting is the fact that if we think of position subject to radially symmetric uncertainty then the mathematical object describing the positions (i.e. the space of its distributions) has all the symmetries of space plus time. It seems that time, or something like time, pops out automatically when we have uncertain positions. I like to state this hypothesis with the phrase: there is no time, only uncertainty

2.1  Uncertain Spinning Space?

The hypothesis space of radially symmetric distributions is the space of distributions for a random vector y Î Â3 of the form,

y = x + e
(12)
where x Î Â3 is a non random location vector, and e Î Â3 is a random vector with a distribution radially symmetric about the origin and with standard deviation s > 0 in all directions. It turns out that exactly the same hypothesis space is obtained if instead of (12) we use,

y = x + i e
(13)
where i is the constant unit pseudo scalar of the Clifford algebra of Â3. The pseudo scalar i has unit magnitude, commutes with all the elements of the algebra, squares to -1 and it represents the oriented unit volume of Â3 []. By taking expectations with the probability measure indexed by (x,s,y) we obtain that, E(y|x,s,y) = x and,

E(y2|x,s,y) = x2 - s2
(14)
Equation (14) shows that, even though the space of radially symmetric distributions is infinite dimensional, on the average the intervals look like the usual spacetime intervals.

We may think of y in (13) as encoding a position in 3-space together with an uncertain degree of orientation given by the bivector part of y, i.e. ie. In other words we assign to the point x and intrinsic orientation of direction [^(e)] and magnitude |e|. In this model the uncertainty is not directly about the location x (as in (12)) but about its postulated degree of orientation (or spinning).

3  Entropy and Ignorance

The notion of statistical entropy is not only related to the corresponding notion in physics it is exactly the same thing as demonstrated long ago by Jaynes []. Entropy appears indisputable as the central quantity of information geometry. In particular, from the Kullback number (relative entropy) between two distributions in the model we obtain the metric, the volume element, a large class of connections [], and a notion of ignorance within the model given by the so called entropic priors []. In this section I present a simple argument, inspired by the work of Zellner on MDIP priors [], showing that entropic priors are the statistical representation of the vacuum of information in a given hypothesis space.

Let \cal H  = { f(x|q): q Î Q} be a general regular hypothesis space of probability density functions f(x|q) for a vector of observations x conditional on a vector of parameters q = (qm). Let us denote by f(x,q), the joint density of x and q and by f(x) and p(q) the marginal density of x and the prior on q respectively. We have, f(x,q) = f(x|q) p(q). Since \cal H  is regular, the Fisher information matrix,

gmn(q) = 4 ó
õ
(mf1/2)(nf1/2) dx
(15)
exists and it is continuous and positive definite (thus non singular) at every q. As in (3), m denotes the partial derivative with respect to qm. The space \cal H  with the metric g = (gmn) given in (15) forms a Riemannian manifold. Therefore, the invariant element of volume is given by, h(dq) µ Ö{det g(q)} dq. This is in fact a differential form [] that provides a notion of surface area for the manifold \cal H  and it is naturally interpreted as the uniform distribution over \cal H  . This formula is known as Jeffreys rule, is often used as a universal method for building total ignorance priors. However, Jeffreys rule does not take into account the fact that a truly ignorant prior for q should contain as little information as possible about the data x. The entropic prior in \cal H  demands that the joint distribution of x and q, f(x,q), be as difficult as possible to discriminate from the independent model h(x) Ö{det g(q)}, where h(x) is an initial guess for f(x). That is, we are looking for the prior that minimizes the Kullback number between f(x,q) and the independent model, or in other words, the prior that makes the joint distribution of x and q to have maximum entropy relative to the measure h(x)Ö{det g(q)}dxdq. Thus, the entropic prior is the density p(q) that solves the variational problem,


min
p 
ó
õ
f(x,q) log f(x,q)
h(x)   æ
Ö

det
 g(q)
 
 dx dq
(16)
Replacing f(x,q) = f(x|q) p(q) into (16), simplifying, and using a lagrange multiplier, l, for the normalization constraint, that òp(q)dq = 1, we find that p must minimize,

ó
õ
p(q) I(q:h) dq+ ó
õ
p(q) log p(q)
  æ
Ö

det
g(q)
 
 dq + l ó
õ
p(q) dq
(17)
where, I(q:h) denotes the Kullback number between f(x|q) and h(x), i.e.,

I(q:h) = ó
õ
f(x|q) log f(x|q)
h(x)
 dq
(18)
The Lagrangian \cal L is given by the sum of the integrands in (17) and the Euler-Lagrange equation is then,

\cal L
p
= I(q:h) + log p(q)
  æ
Ö

det
g(q)
 
+ 1 + l = 0
(19)
from where we obtain that,

p(q) µ e -I(q:h)   æ
Ö

det
g(q)
 
(20)
The numerical values of the probabilities obtained with the formula (20) depend on the basis for the logarithm used in (16). However, the basis for the logarithm that appears in the definition of the Kullback number is arbitrary (entropy is defined only up to a proportionality constant). Thus, (20) is not just one density, but a family of densities,

p(q|a,h) µ e -aI(q:h)   æ
Ö

det
g(q)
 
(21)
indexed by the parameter a > 0 and the function h. Equation (21) is the family of entropic priors introduced in [] and studied in more detail in [],[] and [].

It was shown in [] that the parameter a should be interpreted as the number of virtual observations supporting h(x) as a guess for the distribution of x. Large values of a should go with reliable guesses for h(x) but, as it was shown in [], the inferences are less robust. This indicates that ignorant priors should be entropic priors with the smallest possible value for a, i.e., with,

a* = inf
{a > 0 : ó
õ
e -aI(q,h)  h(dq) < ¥}
(22)
Here is the canonical example.

3.1  Example: The Gaussians

Consider the hypothesis space of one dimensional gaussians parameterized by the mean m and the standard deviation s. When h is an arbitrary gaussian with parameters m0 and s0 straight forward computations show that the entropic prior is given by,

p(m,s|a,m0,s0) = 1
Z
sa-2exp é
ê
ë
-a ((m-m0)2+s2)
2s02
ù
ú
û
(23)
where the normalization constant Z is defined for a > 1 and is given by,

Z = 2
  æ
Ö

p
 
æ
ç
è
a
2
ö
÷
ø
a/2

 
G( a- 1
2
)-1 s0-a
(24)
Thus, in this case a* = 1 and the most ignorant prior is obtained by taking the limit a® 1 and s0® ¥ in (23) obtaining, in the limit, an improper density proportional to 1/s, which makes every body happy, frequentists and bayesians alike.

4  Curvature and Information

Curvature seems to be well understood only in physics, specially from the point of view of gauge theories where the curvature form associated to a connection has been shown to encode field strengths for all the four fundamental forces of nature []. In statistics, on the other hand, the only thing we know (so far) about the role of curvature is that the higher the scalar curvature is at a given point of the model, the more difficult it is to do estimation at that point. This already agrees nicely with the idea of black holes , for if in a given model there is a curvature R0 beyond which estimation is essentially impossible then the space is partitioned into three regions with curvatures, R < R0, R = R0 and R > R0 that correspond to regular points, horizon points and points inside black holes. No body has found an example of a hypothesis space with this kind of inferential black holes yet, but no body has tried to look for one either. Before rushing into a hunt it seems necessary to clarify what exactly it is meant by the words: estimation is essentially impossible at a point.

I believe that one of the most promising areas for research in the field of information geometry is the clarification of the role of curvature in statistical inference. If indeed physical spacetime can be best modeled as a hypothesis space then, what is to be learned from the research on statistical curvature will have direct implications for the nature of physical space. On the other hand, it also seems promising to re-evaluate what is already known in physics about curvature under the light of the proposed link with inference. Even a naive first look will show indications of what to expect for the role of curvature in inference. Here is an attempt at that first look.

From the classic statement:
Mass-energy is the source of gravity and the strength of the gravity field is measured by the curvature of spacetime

We guess:
Information is the source of the curvature of hypothesis spaces. That is, prior information is the source of the form of the model

From:
The dynamics of how mass-energy curves spacetime are controlled by the field equation: G = kT where G is the Einstein tensor, T is the stress-energy tensor and k is a proportionality factor

guess:
The field equation controls the dynamics of how prior information produces models

From:
The field equation for empty space is the Euler-Lagrange equation that characterizes the extremum of the Hilbert action, with respect to the choice of geometry. That is it extremizes

Sg = ó
õ
R dW,     dW =   æ
Ö

| det
g|
 
d4x,
(25)
where the integral is taken over the interior of a four-dimensional region W, R is the scalar curvature and g is the metric

guess1:
The form of hypothesis spaces based on no prior information must satisfy

Rij - 1
2
R gij = 0
(26)
where gij is the Fisher information matrix, Rij is the Ricci tensor and R is the scalar curvature as above.

guess2:
Given a hypothesis space with Fisher information matrix g(q), the Einstein tensor, G, i.e. the left hand side of (26), quantifies the amount of prior information locally contained in the model at each point q.

5  Hyperbolicity

What it seems most intriguing with respect to the link between information geometry and general relativity is the role of hyperbolicity. We know from general relativity that physical spacetimes are Riemannian manifolds which are locally Lorentzian. That is, at each point, the space looks locally like Minkowski space. Or, in other words, the symmetries of the tangent space at each point are those of hyperbolic space. On the other hand, in information geometry, hyperbolicity appears at two very basic levels. First, hyperbolicity appears connected to the notion of regularity through the property of local asymptotic normality (LAN for short see []). This is in close agreement with what happens in physics. The LAN property says that the manifold of distributions of n independent and identically regularly distributed observations can be locally approximated by gaussians for large n, and since the gaussians are known to form hyperbolic spaces, the correspondence with physics is perfect. Second, in statistical inference hyperbolicity also appears mysteriously connected to entropy and Bayes' theorem! (see my From Euclid to Entropy[]) and by following the link back to general relativity we obtain a completely new and unexpected result: entropy and Bayes theorem are the source of the local hyperbolicity of spacetime!. That entropy and thermodynamics are related to general relativity may have seem outrageous in the past, but not today. It does not seem outrageous at all when we consider that, Bekenstein found that the entropy of a black hole is proportional to its surface area [], when we consider that Hawking discovered that black holes have a temperature [] and specially when we consider that Jacobson showed that the field equation is like an equation of state in thermodynamics [].


File translated from TEX by TTH, version 1.50.