Entropic Priors

1 Introduction

This assay is about a new method for building a prior model on the parameters of a regular family of distributions. Parts of this method have been know for some time but the full generality of entropic priors is a recent discovery. With this single procedure it is possible to reproduce useful and well-known results in diverse areas such as time series, image processing and empirical bayes. A deeper understanding of the basic ideas underlying the method could lead to better solutions for these applications, as well as to a new and more objective theory of inference.

The method is summarized with the following formula: "Given a regular parametric model containing probability measures P_q the entropic priors on q are given by

Pr(dq|a,q₀) µ exp{-aI(q:q₀) } g^1/2 dq

(1)

where a is a positive scalar parameter, q₀ is an initial guess for q, I(q:q₀) denotes the Kullback number between the probability measures P_q and P_q₀ and g is the determinant of the Fisher information matrix at q".

Equation (1) has a number of desirable properties: it is of general applicability, it is invariant under smooth changes in the coordinate systems of both the parameter and the data spaces, and it contains the methods of maximum entropy and Jeffreys' noninformative priors as two opposite poles (a® ¥, a® 0). Also, entropic priors are essentially implied by the assumptions of invariance and no apriori correlations (see Skilling ([],[]) and also Shore and Johnson ([]), Rodrí guez ([],[])).

There is a very simple rationale behind equation (1). The amount of probability assigned to the parameter q decreases exponentially with the Kullback distance from the initial guess of q₀. That is to say, the more difficult it is to discriminate the measure P_q from P_q₀, the more prior probability is given to it. The parameter a controls the sensitivity to changes in the distance. Thus, reliable q₀¢s should go with large a while unreliable guesses should be given small a. In the latter case the dependence of prior probabilities on q is controlled not so much by the Kullback number (and therefore by the initial guess of q₀) but by the surface area of the model given by the invariant measure g^1/2 dq.

The entropic priors assign prior probabilities by combining in an intricate way the information theoretic concept of entropy with the underlying Riemannian structure of the parametric model. This combination of entropy and geometry is even more intricate than suggested by the notation of equation (1); for, the information metric, and therefore also its determinant g, arise from infinitesimal variations of the Kullback number (see equation () below).

1.1 Background: Entropy, Geometry, and Priors

Underlying the problem of inference we find the following fundamental question: How should the data and the prior information about a physical process be used to generate a manifold of distributions and a measure of prior uncertainty? The relevance of this question is clear, for once we fix the model and the prior, bayesian inference is reduced to the computational problem of posterior probabilities. Needless to say, unless the meaning of prior information is explicitly defined, there is no general solution to this problem. However, if a parametric model is available for the data and all the prior information (besides the knowledge of this parametric model) is contained in the initial value of the parameter, then the entropic priors, as defined by equation (1), provide invariant assignments of prior probabilities with desirable properties.

Entropic priors are, therefore, part of the answer to the fundamental question stated above. Although they solve only half of the problem, they solve the important half. Due to the lack of general methods for transforming prior information into assignments of prior probabilities, there is greater agreement among statisticians about models than about priors. The maximum entropy formalism (see Jaynes []) and the total ignorance priors obtained from invariance (see Jeffreys [], and Jaynes []) are two isolated exceptions. It is remarkable that entropic priors include both methods as extreme special cases.

An overview of the main quatities that define formula (1) is presented below. To this aim it is necessary to introduce some standard definitions and results.

1.1.1 The Kullback Number

Let the data space be an arbitrary Haussdorf space X endowed with the Borel sigma field B. In most applications, X is just a measurable piece of \bbbr^d but it is equally easy to deal with the more general case. The central quantity in the theory is the Kullback number I(P:Q) between a probability measure P and a s-finite measure Q both defined on the measurable data space (X, B). Define

I(P:Q) =

ó
õ

P(dD)log

(D)

(2)

when P is absolutely continuous with respect to Q and the integral exists and define it as ¥ otherwise. When both P and Q are probability measures absolutely continuous with respect to each other, the above integral always exists even though it may be infinite (see Kullback [] p. 5). Moreover, in this case, a simple application of Jensen's inequality shows that I(P:Q) is non negative and that it is zero only when P º Q. This last property makes it very similar to a metric, but of course is not symmetric and it does not satisfy the usual triangular inequality.

In statistical language, the Kullback number is the expected log-likelihood ratio of P to Q when P is the "true" model. In information theoretic terms, it is said to measure the mean information per observation for discrimination in favor of P and against Q when sampling from P. In stronger information theoretic language, it is the expected amount of information transmitted by the message: "The information source has been moved from Q to P". All these well established interpretations convey the same idea of separation of P from Q. But, the large deviation property reproduced below explicitly quantifies this separation:

Let X = { 1, 2,..., k } and let q be the probability measure that assigns probability q_j to the jth element of X. We can imagine an urn with known proportions q_j of balls type j for j = 1,2,...,k. If we draw n balls from this urn and we denote by n_j the number of observed balls of type j then the probability of observing a frequency distribution p₁, p₂,..., p_k with p_j = n_j / n is given by the multinomial

Pr(p |q) = Pr(n₁,...,n_k |q) =

n₁! n₂!...n_k!

q₁^n₁...q_k^n_k

(3)

i.e. the chance of seeing a p-distribution when sampling n times from a q-distribution. Taking the logarithm on both sides of equation (3) and using Stirling's formula up to first order i.e.

logm! = m logm - m + o(m)

(4)

we obtain after a little simplification that,

logPr(p|q) = -n

k
å
j = 1

p_j log

p_j

q_j

+ o(n)

(5)

Therefore when all the n_js are large, this last equation together with (2) imply

Pr(p|q) µ e^{-n I(p:q)}

(6)

which has the exact form of (1). Notice that since n is large but finite, there are only a finite number of p-distributions and that is why the g in (6) is constant; there is no underlying continuosly parametrized manifold of distributions, only a finite number of them. Also, we obtain once more the parameter a associated to the amount of information about q₀. In this case, however, the term amount of information takes the explicit form of the number of observations from the q-distribution.

1.1.2 Fisher Information Metric

The connection between a and amount of information in the sense of number of observations holds in general as it is shown below. Before we prove it we notice the classic relationship between entropy and geometry, i.e., the natural Riemannian metric on a regular model appears from second variations of the entropy. More explicitly, the Riemannian length of a tangent vector v attached to the point q is proportional to I(P_{q+ ev}:P_q) when e is small. To see this, just consider (for fix q and v) the function u(e) = I(P_{q+ ev}:P_q) that has a global minimum of zero at e = 0. Expanding up to second order terms in e we have,

u(e) = u(0) + eu¢(0) +

e² u¢¢(0) + o(e²)

(7)

by straight forward computation we find u(0) = u¢(0) = 0 and

u¢¢(0) =

å
i,j

vⁱ

é
ê
ë

ó
õ

p_q

¶p_q

¶q^j

p_q

¶p_q

¶qⁱ

P_q(dD)

ù
ú
û

v^j

(8)

where p_q = p_q(D) denotes the density of P_q with respect to an arbitrary, but fix, dominating measure. The last formula defines a quadratic form on tangent vectors v. Notice that the quadratic form must be positive definite because u(e) has a global minimum at e = 0. The matrix of this quadratic form is known as the Fisher information matrix and the induced Riemannian metric is known as the information metric¹ (see [],[] or []). Thus,

I(P_{q+ ev}:P_q) =

e²

å
i,j

vⁱ g_ij(q) v^j + o(e²)

(9)

where g_ij(q) are the elements of Fisher's matrix.

1.1.3 Prior Information is More Data

An interesting question suggested by (1) that turns out to have a surprisingly simple and useful answer is the following: What are the entropic priors when the data space consists of vectors D = (x¹,...,xⁿ) of independent and identically distributed components?

To answer the above question let P_q⁽ⁿ⁾ and g_ij⁽ⁿ⁾(q) denote the corresponding quantities in the data space Xⁿ. The components are iid so,

P_q⁽ⁿ⁾ (dD) =

n
Õ
i = 1

P_q(dxⁱ).

(10)

Therefore, from this, (2), and Fubini's theorem we have

I(P_q⁽ⁿ⁾:P_q₀⁽ⁿ⁾) = n I(P_q:P_q₀).

(11)

From equations (9) and (11) we can write

g_ij⁽ⁿ⁾(q) = n g_ij(q)

(12)

and taking the determinant on both sides of this matrix equation we obtain g⁽ⁿ⁾ µ g that together with (1) and (11) imply

Pr⁽ⁿ⁾(dq| a, q₀) = Pr(dq| na, q₀)

(13)

Equation (13) links the positive scalar parameter a to the number of observations. It suggests to regard a as a kind of continuous virtual number of independent observations supporting the choice of q₀ as the initial guess. This interpretation squares nicely with our previous results. More over, from equation (13) we can write the posterior as

Pr⁽ⁿ⁾(dq|D, a, q₀) µ

n
Õ
i = 1

{ P_q(dxⁱ) e^-aI(q:q₀) } g^1/2(q) dq

(14)

and therefore the posterior densities with respect to the invariant measure on the manifold satisfy

p⁽ⁿ⁾(q|x¹,...,xⁿ,a,q₀) =

n
Õ
i = 1

p(q|xⁱ,a,q₀)

(15)

The last equation not only provides a recursive procedure that simplifies the computation of the posteriors when new data are available but it also reinforces the above interpretation for a. Given the classic connection between the Kullback number and the exponential family, a more detailed examination of the form of the posterior in the exponential family case (specially for q close to q₀) should produce interesting and informative results.

1.2 Applications

1.2.1 Empirical Bayes

The empirical Bayes approach to statistics is due to H. Robbins who has demonstrated the usefulness of this technique in both the parametric and nonparametric cases. However, in the parametric case a prior over the parameters is needed and, as it is shown here with an example, equation (1) supplies one that makes sense.

Efron and Morris were the first to provide an empirical Bayes context to Stein-type estimators, but their priors and unbiased method of estimation were chosen in an ad-hoc manner explicitly designed to obtain the result (see Robbins collection of selected papers [] and list of references on pages 1-5).The simplest non-trivial application of entropic priors generates the celebrated Stein's shrinking phenomenon without the ad-hockeries and at the same time provides a rationale for it.

Let x = (x¹,¼,x^k) be a single observation from a k-variate Gaussian distribution with unknown mean vector q but known variance matrix s² I with s > 0. i.e. P_q º N(q,s²I). In this case the Kullback number is given by

I(q:q₀) =

2s²

E_q {|x - q₀|² - |x - q|² } =

2s²

|q- q₀|²

(16)

i.e. proportional to the square of the euclidean distance between q and q₀. Thus, it follows directly from equation (9) that the information matrix is proportional to the identity and that the invariant measure is proportional to the Lebesgue measure on \bbbr^k. Therefore, replacing in (1) we obtain the entropic prior

Pr(dq|a,q₀) µ exp

æ
ç
è

2s²

|q- q₀|²

ö
÷
ø

dq.

(17)

The prior is then N(q₀,[(s²)/(a)] I). Let f(x|a,q₀) denote the marginal density of the data vector with respect to the Lebesgue measure on \bbbr^k. By conditioning on q and using the above entropic prior we have

f(x|a,q₀) µ

ó
õ

exp

ì
í
î

-1

2s²

[ |x-q|² + a|q-q₀|² ]

ü
ý
þ

(18)

thus, after replacing the equation

|x-q|² = |x-q₀|² + |q₀-q|² - 2 áx-q₀,q-q₀ ñ

(19)

and completing the square inside the exponential, it follows that

f(x|a,q₀) µ e^{{ [( -a|x-q₀|²)/(2(a+1)s² )] }}

ó
õ

e^{{ [(-(a+1)| q- (q₀- [1/(a+1)] (q₀ - x) ) |²)/(2s² )] }} dq

from where we can immediately deduce that the marginal distribution of x is given by

f(x|a,q₀) º N

æ
ç
è

q₀,

a+1

s² I

ö
÷
ø

(20)

and that the posterior distribution of q after observing a single x is proportional to the expression inside the integral i.e.,

p(q|x,a,q₀) º N

æ
ç
è

x -

a+ 1

(x - q₀),

s²

a+1

ö
÷
ø

(21)

Hence, for quadratic loss, the Bayes' estimate of q is not the single observed vector x, but the mean of the posterior, that shows a shrinkage towards the guessed value q₀. The amount of shrinkage can be estimated by extracting the information about a contained in the observed vector x. To see this, let l = [(a)/(a+ 1)] and notice that from (20) and (21) it follows that:

f(x|l,q₀) º

k
Õ
j = 1

æ
ç
è

q₀^j,

s²

ö
÷
ø

(22)

and,

E(q|x,l,q₀) = x - l(x-q₀).

(23)

The idea is to replace in (23) the most conservative estimator of the unknown l that could be obtained from the inference problem defined by (22). Guided again by equation (1) it follows that in each independent component of x the total ignorance prior about the parameters of the j-th problem is given by:

h( dq₀^j,dl) µ

dldq₀^j

l²

(24)

Thus, the ignorance prior for all the parameters is:

h( dq₀,dl) µ

dldq₀

l²

(25)

This prior makes l ignorant not only of q₀ but also about the dimensionality k. From the prior (25) and the likelihood (20) it follows that,

Pr(dl|q₀,x) µ l^{^k/₂-2} exp

ì
í
î

-l|x-q₀|²

2s²

ü
ý
þ

and this is a gamma distribution with mean

E( l|q₀,x ) =

(k-2)s²

|x-q₀|²

(26)

Replacing the unknown l in (23) by the right hand side of (26) we obtain the estimator:

q^* = x -

(k-2)s²

|x-q₀|²

(x - q₀)

(27)

which is the original Stein's estimator having uniformly better risk than x for all choices of q₀ when k ³ 3. Thus, in dimensions grater than two, x is inadmissible (see for example []). These are all classic results of the theory of estimation produced effortlessly from equation (1).

Stein's estimator exemplifies the general (Entropic) Empirical Bayes approach: first, reduce the problem of estimation of the (multi) parameter q to the estimation of the positive scalar parameter a, as in (20). Second, replace in the posterior distribution of q the unknown a by its estimator, as in (21) and (27).

1.2.2 Time Series

It is shown in this section that the general formalism of entropic priors could be applied to solve time series problems. Preliminary results for the model of a parametric signal plus white noise are shown.

The Signal Manifold

Let us assume that the vector of observations x = (x¹,¼,x^N) collected at times t₁, t₂,¼,t_N, not necessarily equally spaced, can be modeled by:

x^l = f(t_l,q) + e_l, for l = 1,2,¼,N

(28)

where q Î Q Ì \bbbr^k, and the set

S = { r(q) = (f(t₁,q),¼,f(t_N,q)) Î \bbbr^N : q Î Q}

(29)

is a smooth k-dimensional surface in \bbbr^N. The intrinsic geometry of S is characterized by its metric tensor:

(q) = á

¶r

¶q_i

¶r

¶q_j

ñ =

N
å
l = 1

f_i(t_l,q)f_j(t_l,q)

(30)

where f_i(t_l,q) denotes the partial derivative of f(t_l,q) with respect to q_i. The errors, e_l, are assumed² to be independent for different times and gaussian with mean 0 and variance s² for all times. Hence, the joint likelihood function for q and s, denoted by L(q,s), is given by,

L(q,s) µ s^-N exp

ì
í
î

2s²

N
å
l = 1

(x^l - f(t_l,q))²

ü
ý
þ

(31)

that defines the hypothesis space of all the N-variate gaussian probability measures with mean vector on the surface S and variance matrix s² I. The Riemannian geometry on this hypothesis space is intimately related to the geometry on S. The metric tensor (or Fisher information matrix [g(q,s)] ) in this space can be obtained by taking the derivatives of the log-likelihood (see equation (8) or Amari ([]), Rodrí guez ([])). Hence, the matrices [g(q,s)] and [[`g](q)] satisfy,

[g(q,s)] =

s²

æ
ç
ç
ç
è

é
ë

(q)

ù
û

ö
÷
÷
÷
ø

The total ignorance prior, h(dq,ds), can be written in terms of the square root of the determinant of the previous matrix. We have,

h(dq,ds) µ s^-(k+1)

ê
ê

(q)

ê
ê

¹/₂

dqds.

(32)

Notice that when s is assumed to be known, the non-informative prior for q coincides with the k-dimensional surface area of S and the Kullback number is similar to (16). By looking at equation (21) it is therefore reasonable to estimate q by q^* such that,

r(q^*) = x -

a+1

( x - r(q₀) ).

(33)

It is natural to expect q^* to show properties similar to Stein's estimator, but only when the surface S is sufficiently flat between q^* and q₀. The exact solution could be computed (or at least approximated in specific examples) and it would provide a new way for recovering a signal buried in white noise.

Separating Frequencies from Amplitudes

In the rest of this section we analize the time series estimation problem when a little more structure than just S is assumed for the signal. Computations are significantly simplified if linear (usually nuisance) parameters are separated from non-linear (frequencies like) parameters. Following Bretthorst ([]) let us assume that the signal can be modeled by:

f(t,q) =

m
å
j = 1

B_j G_j(t,w)

(34)

where q = (B₁,¼,B_m,w₁,¼,w_k-m) º (B,w). Using (30) we can write:

N
å
l = 1

G_i(t_l,w)G_j(t_l,w) for i,j £ m

(35)

which are independent of B. The other components are given by,

m
å
j = 1

B_j

æ
è

N
å
l = 1

G_i(t_l,w) G_ja(t_l,w)

ö
ø

for i £ m, a > m

(36)

where G_ja denotes the partial derivative of G_j with respect to q_a º w_a-m. These components are linear in B. Finally we can write:

å
i,j

B_i B_j

æ
è

N
å
l = 1

G_ia(t_l,w) G_ib(t_l,w)

ö
ø

for a,b > m.

These components are quadratic in B. The whole matrix [[`g](B,w) ] can be split into four blocks:

é
ë

(B,w)

ù
û

æ
ç
ç
ç
ç
è

é
ë

B,B

ù
û

é
ë

B,w

ù
û

é
ë

B,w

ù
û

é
ë

w,w

ù
û

ö
÷
÷
÷
÷
ø

æ
ç
è

[ Indep. of B ]

[ Linear in B ]

[ Quadratic in B ].

ö
÷
ø

The determinant of this matrix is the sum of products of elements taken from each row and column. All having the form:

±u(w) B₁^a₁ B₂^a₂¼B_m^a_m with a_j ³ 0 and

m
å
j = 1

a_j = 2(k-m).

(37)

Hence, the determinant shows the following homogeneity property:

ê
ê

(lB,w)

ê
ê

= l^2(k-m)

ê
ê

(B,w)

ê
ê

(38)

More over, [[`g]_B,B ] is a symmetric positive definite matrix that appears also in the likelihood:

L(B,w,s)

s^-N exp

ì
í
î

-1

2s²

m
å
j = 1

æ
è

x^j -

m
å
l = 1

B_j G_j(t_l,w)

ö
ø

ü
ý
þ

s^-N exp

ì
í
î

2s²

ü
ý
þ

where Q is defined by,

Q =

d²

å
j

h_j B_j +

å
i,j

B_i B_j

(39)

also,

d²

N
å
l = 1

(x^l)² and h_j =

N
å
l = 1

x^l G_j(t_l,w).

(40)

Thus, by diagonalizing [[`g]_B,B] we simplify, not only the ignorance prior, but also the likelihood function. It is, therefore, straight forward to re-write the analysis in [] using equation (1) as the prior. For a = 0 (total ignorance case) we obtain the posterior distribution for the frequencies given by:

Pr(dw|x,a = 0) µ

æ
Ö

(h,w)|

æ
ç
ç
ç
ç
è

1 -

h²

d²

ö
÷
÷
÷
÷
ø

-[(N+k-m)/2]

(41)

which is different from the result obtained in []. However, when N is large, the generalized periodogram [`(h²)] peaks about the "true" w but from equation (38) the term Ö{[`g](h,w)} increases only polynomially in h which is negligible compared to the other term that increases exponentially in h. Hence, for large N equation (41) coincides with the result in Bretthorst. We have computed both posteriors on simulated data containing one or two harmonics observing always that the two curves become indistinguishable very rapidly for N around 20.

Hence, non-informative entropic priors reproduce the results in the Bayesian spectral analysis of Bretthorst. More over, entropic priors make possible to incorporate definite prior knowledge (e.g. about the frequencies w) improving the estimate of the signal.

1.2.3 Image Reconstruction

All image reconstruction problems are inverse problems. The mathematics of image reconstruction can be reduced, in theory, to the inversion of a linear functional operator. The (approximate) linear functional relation between the input (object) and the output (data) is fixed by the physics of the particular situation, but the basic idea common to most methods is very simple (see []). The object f to be reconstructed, is hit with some kind of radiation, and a characteristic g, of the scattered output field is recorded. Typical examples include: radio frequency in NMR, x-rays in computed tomography, or sound waves in ultrasound imaging. Usually, g is linearly related to f through

g = Af

(42)

where A is a linear functional from the hypothesis space F (also known as image space or solution space) into the data space G. The functional A, is typically a Fredholm integral operator that can be specified in terms of some kernel K(x,y) in the form:

(Af)(y) =

ó
õ

K(x,y) f(x) dx.

(43)

Hence, in theory the object f is obtained from g by,

f = A^-1g

(44)

but in practice, the value of g is only known with noise at N discrete points y₁,y₂,¼,y_N. What is observed are N noisy data,

D_j

g(y_j) + e_j

(45)

ó
õ

K(x,y_j) f(x) dx + e_j for i = 1,¼,N .

Therefore, in reality what needs to be solved is not the algebraic deterministic problem (45) but the inference problem: Guess f from the N noisy data. Notice that (44) is equivalent to (28) with a very large (possibly infinite) dimensional q.

Digital Imaging

A discretization of x into pixels transforms problem (45) into a regression problem but with more parameters than data. The data alone is not enough for ranking all the possible pictures in F, and a prior distribution over F is needed.

Prior knowledge about the set of possible images can be used to reduce the dimensionality of the hypothesis space F, providing an encoding of the image with fewer number of parameters than the number of pixels. However, if nothing is known about the picture, the space F can be identified with the set of all the non-degenerate probability measures over the finite set of pixels. The Kullback number and invariant measure in this space of discrete distributions are easily computed from their respective definitions. Replacing in (1) it follows that the log of the entropic prior density is

log

ì
í
î

Pr(df|a,m)

ü
ý
þ

µ -a

å
i

f_i log

f_i

m_i

log

Õ
i

f_i

(46)

where f_i = f(x_i) i.e. proportion of luminosity assigned to the i-th pixel and m denotes the initial guess for f.

The likelihood of f is fixed by fixing a distribution for the errors in (45) and hence, the most likely posterior choice for f is the solution of:

max
f Î F

ì
ï
í
ï
î

L(f) - a

å
i

f_i log

f_i

m_i

log

Õ
i

f_i

ü
ï
ý
ï
þ

(47)

where L(f) denotes the logarithm of the probability of the data given f. The MAP defined by the solution of (47) is what is computed by the celebrated MEMSYSs algorithm of Skilling, Gull and their co-workers (see []). It is interesting to notice that the last term in (47) i.e. the invariant measure on F was discovered experimentally by testing different functions of f to penalize values close to zero (Skilling personal communication, see also []).

An Example with Binary Images

To illustrate the use of the main formula (1) when specific prior information is available, consider the following simple problem: On a square of M pixels (e.g. the terminal screen) a few rectangles of random lengths and widths are allocated at random (e.g. by turning on the pixels of the rectangles). The image is then destroyed as follows: each of the M pixels is independently reversed with probability 0 < q < 0.5. It is required to recover as close as possible the original image from these data.

Let's denote by q = (q₁,¼,q_M) the unknown image, where

q_j =

ì
í
î

if j-th pixel was originally on

if j-th pixel was originally off

similarly, let z = (z₁,...,z_M) be the observed pixels (i.e. the pixels of the destroyed image. Thus, the likelihood is given by:

Pr(z|q,q) =

M
Õ
j = 1

{q^{|z_j - q_j|} (1-q)^{1-|z_j - q_j|} }

(48)

Let's denote by P_q,q the probability measure in (48). The Kullback number for a given initial guess P_m,r simplifies to,

I( P_q,q:P_m,r ) = M I(q:r) - (1-2q)log

æ
ç
è

1-r

ö
÷
ø

M
å
j = 1

1(q_j ¹ m_j)

(49)

where 1(A) is 1 if A is true and zero otherwise and,

I(q:r) = qlog

+ (1-q)log

1-q

1-r

The entropic prior density is then given by

p(q,q|a,m,r) µ exp{ -aI(P_q,q:P_m,r)}

(50)

but relative to the invariant measure on the hypothesis space. The information given in the problem is not sufficiently precise to fix this measure from first principles. However, it is clear that it should decrease rapidly with the number of rectangles of the image q to agree with the information that q contains only a few rectangles. A useful choice is

h(dq,dq) µ exp{-kN_R(q)} dqdq

(51)

where k is a positive scalar parameter (like a) and N_R(q) denotes the number or rectangles in the image q. By using equations (48,49,50,51) it follows that the posterior mode is obtained by solving:

max
q,q

ì
í
î

Mlog(1-q) + log

æ
ç
è

1-q

ö
÷
ø

M
å
j = 1

1(q_j ¹ z_j) +

a(1-2q)log

1-r

M
å
j = 1

1(q_j ¹ m_j) - aM I(q:r) - kN_R(q)

ü
ý
þ

(52)

An algorithm to approximate the solution of problem (52) could be designed but only after deciding what to do with the parameters: a,k,r,m. A fast and simple smoothing of the data will fix r and m and then (assuming enough computational muscle) solve (52) for different values of a and k. Although, (52) is the problem to solve, when the proportion of pixels on in the original image q is small (i.e. few rectangles in q) then q can be closely approximated from the data. In this case the optimization problem (52) simplifies to

min
q

ì
í
î

M
å
j = 1

1(q_j ¹ z_j) + g

M
å
j = 1

1(q_j ¹ m_j) + bN_R(q)

ü
ý
þ

(53)

where the positive parameters g and b are defined by:

g = a(1-2q) and b =

-k

log

q-1

A further simplification is obtained by writing,

å
j

[1(q_j ¹ z_j) + g1(q_j ¹ m_j) ] = (g+1)

å
z_j = m_j

1(q_j ¹ z_j) +

å
j

1(z_j ¹ m_j)

Hence, replacing in (53), dividing through by g (which is positive), and redefining b accordingly, the optimization problem to be solved reduces to:

min
q

ì
í
î

å
z_j = m_j

1(q_j ¹ z_j) + bN_R(q)

ü
ý
þ

(54)

The exact solution of the above combinatorial optimization problem is out of the question with a regular desktop computer; the search space of qs has 2^M or more than 10^3,000 elements for a square of 100×100 pixels. However, an stochastic optimization algorithm (like simulated annealing) rapidly produces high quality solutions³ (see []).

It seems profitable to exploit the use of entropic priors in image processing problems of this kind.

1.3 Towards a Bayesian Theory of Information Geometry

The main advantage of having a geometric theory of inference is the enhancement on imagination that it produces. Geometry allows to see by relating familiar images from our every-day three dimensional space, to otherwise abstract mathematical objects living in obscure spaces. By thinking of the hypothesis space (i.e. parametric model) as a Riemannian manifold, we can picture the possible probability measures for our data living as points on a curved surface. With this imagery in mind, and the paraphernalia of modern geometry, the theory of inference is rapidly developing into a new science that Amari has baptized as information geometry [].

1.3.1 Robustness and the Lie Derivative

This section contributes to the ongoing development of information geometry by showing how the geometric concept of the Lie derivative is naturally related to the statistical concept of robustness.

Statistical robustness is a desirable property of an inferential procedure. It can be defined, qualitatively, as the stability of the procedure against small changes in the assumptions. The pictures that come to mind, almost simultaneously with this definition, are those of solids able to relax back to their original form after feeling the effects of a small deformation or strain. It is therefore not surprising, that ideas developed in the physical theories of elasticity and the mechanics of continuous media, are found useful to quantify statistical robustness.

One aspect of the intrinsic robustness of a hypothesis space can be quantified by the strain tensor (see for example []). Recall that the strain tensor is a covariant tensor of order two (i.e. a bilinear form on vectors) that quantifies how the metric of the space changes under deformations defined by a vector field. Evidently, this definition can only encode one aspect of robustness: the robustness of the metric. However, remember that the strain tensor is obtained as the Lie derivative of the metric with respect to a vector field, and that the Lie derivative is defined not only for metrics but for arbitrary tensors on the manifold. Hence, by taking Lie derivatives, it is possible to quantify, the rate of change with small deformations of the space, for many statistically meaningful quantities defined on the model. A non exhaustive list of quantities for which it would be desirable to know the Lie derivative and their properties, includes:

Information matrix (metric tensor), g_ij(q)
Total ignorance prior, Ö{|g(q)|}dq º Ö{|g(q)|}dq¹Ùdq²Ù¼Ùdq^k
Entropic prior, exp{-aI(q:q₀)} g^1/2 dq
Posterior distribution, P_q(dx)exp{-aI(q:q₀)} g^1/2 dq
A parameter (scalar function), n = n(P_q)
The gradient of a scalar function (Influence function)
The Riemann curvature tensor.

The total ignorance prior has been written with the wedge product of differential forms (or Clifford algebra) to emphasize the fact that measures on the (oriented) manifold are really totally antisymmetric tensors and therefore their Lie derivatives are well-defined.

There is no technical problem in computing the Lie derivatives of the above list. In fact, ready-made formulas for the items 1,2,5,6 and 7 can be found in any good book on modern geometry. The derivatives of item 4, and a naï ve version of item 5, can be computed directly from the defining formula for the Lie derivative of a tensor. However, before rushing to write down the formulas, it would be convenient to subject the posterior to a more detailed examination. One is dealing here not only with the manifold of the model, but also with the manifold of the data space, and a deeper analysis should consider deformations of both of them. It would be very convenient to know how the posterior changes with small kicks applied to the model and the data space.

By looking at statistical robustness with this geometric eye, new avenues for the imagination open up, and previous approaches can be seen from a different perspective. Consider, for example, the approach based on the influence function i.e. the Gâteaux derivative of a parameter evaluated at a point mass distribution. The only deformations of the underlying space that can be encoded with this approach are those characterized by vector fields of central forces. Gâteaux derivatives apply to the model a very small class of local kicks. Only deformations of the space that send (locally) every probability measure closer to a fix point are considered.

From the more global perspective of information geometry, the Lie derivative, as a technical tool for quantifying robustness, is only one example of the many possibilities that Lie theory can bring to statistical inference. To illustrate the use of some Lie theory in statistics the gaussians and the discrete distributions are introduced below as examples of Lie groups.

1.4 Lie Theory and the Gaussians

The one dimensional gaussians, i.e., the space of all the probability measures with Lebesgue densities given by gaussian curves, not only has the manifold structure of the Lobachevskian plane, but also the algebraic structure of an affine group. The group operation is defined by

N(m,s²) °N(m,s²) = N(m+ sm,s² s²) for m,m Î \bbbr and s, s > 0.

The identity and inverse elements are given by:

e = N(0,1) N(m,s²)^-1 = N

æ
ç
è

s²

ö
÷
ø

(55)

and it is not difficult to show that the group operation and the operation of taking the inverse are C^¥ maps.

It is interesting to note that, by decomposing the group of isometries into one parameter subgroups, the familiar statistical transformations of changing location and scale are automatically generated. To see this, recall that the group of direct isometries of the Lobachevskian plane, is the connected component of the identity of SO(1,2), which is isomorphic to SL(2,\bbbr)/{±1} which is isomorphic to the group of linear-fractional transformations with coefficients given by the entries of the matrices in SL(2,\bbbr). It is worth noting, by passing, that it follows from here that the group of symmetries of the hypothesis space of one dimensional gaussians are nothing but the orthochronous Lorentz transformations of \bbbr³₁ i.e. three dimensional space-time!.

The Lie algebra of the gaussians is then given by sl(2,\bbbr)/{±1} and the one parameter subgroups of isometries are obtained from the exponential function exp(tX) with t Î \bbbr and X Î sl(2,\bbbr)/{±1}. The matrices X have zero trace and real entries i.e.

X =

æ
ç
è

-a

ö
÷
ø

and X² = lI

with a, b, c Î \bbbr and l = (a²+bc). Therefore, it follows that,

e^tX =

æ
ç
ç
ç
ç
ç
ç
ç
ç
ç
ç
ç
ç
ç
ç
è

cosh(

t) +

sinh(

)

sinh(

)

sinh(

)

cosh(

t) -

sinh(

)

ö
÷
÷
÷
÷
÷
÷
÷
÷
÷
÷
÷
÷
÷
÷
ø

(56)

Replacing in (56) different values for a,b and c the following four one-parameter subgroups are obtained:

* I. The Group of Location Transformations The generator is the matrix

X =

æ
ç
è

ö
÷
ø

obtained from (56) when a = c = 0, b = 1 and l = 0. The group elements are:

exp(tX) =

æ
ç
è

ö
÷
ø

with corresponding linear-fractional transformation:

z¢ =

1z + t

0z + 1

or, in terms of the components z = (m,s),

m¢

m+ t

s¢

with t Î \bbbr.

* II. The Group of Scale Transformations The generator is the matrix

X =

æ
ç
è

-1

ö
÷
ø

obtained from (56) when a = 1, b = c = 0 and l = 1. The group elements are:

exp(tX) =

æ
ç
è

e^t

e^-t

ö
÷
ø

with associated linear-fractional transformation:

z¢ =

e^tz + 0

0z + e^-t

= e^2t z

or, in terms of the components z = (m,s),

m¢

s¢

where a = e^2t > 0.

* III. The Group of Hyperbolic Rotations The generator is the matrix

X =

æ
ç
è

ö
÷
ø

obtained from (56) when a = 0, b = c = 1 and l = 1. The group elements are:

exp(tX) =

æ
ç
è

cosht

sinht

cosht

ö
÷
ø

with associated linear-fractional transformation:

z¢ =

cosh(t)z + sinh(t)

sinh(t)z + cosh(t)

* IV. The Group of Euclidean Rotations The generator is the matrix

X =

æ
ç
è

-1

ö
÷
ø

obtained from (56) when a = 0, b = 1, c = -1 and l = -1. The group elements are:

exp(tX) =

æ
ç
è

cost

sint

-sint

cost

ö
÷
ø

1.5 The Discrete Distributions as a Lie Group

Consider the hypothesis space of all the non-degenerate probability measures over a finite set. The elements of this set can be parametrized by vectors p = (p₁,p₂,¼,p_k) with p_j > 0 and å_j p_j = 1. It is not difficult to see (e.g. using equation (9)) that this hypothesis space is topologically equivalent to a half sphere. A group structure can be defined by the operation:

(p₁,p₂,¼,p_k)°(q₁,q₂,...,p_k) =

k
å
j = 1

p_jq_j

(p₁q₁,p₂q₂,¼,p_kq_k)

(57)

with identity:

e = (

,¼,

)

(58)

and inverse element,

p^-1 =

æ
ç
è

å
j

p_j

ö
÷
ø

-1

(

p₁

p₂

,¼,

p_k

)

and the operations of multiplication and inverse are C^¥⁴. Thus, the discrete model with the operation (57), is a Lie group.

This Lie group, let us call it G_k, is isomorphic to the abelian group of positive diagonal matrices with proportional metrices being identified, i.e.,

G_k @ D_k⁺ / µ

where, D_k⁺ is the group of k by k diagonal metrices with strictly positive entries, and µ is the equivalence relation:

A µ B iff $c > 0 with AB^-1 = cI.

What is most remarkable about this, is that (57) is also Bayes theorem⁵; the normalized prior p, operated with the normalized likelihood q, produces the normalized posterior r given by the right hand side of equation (57).

It is not clear what all this implies but it seems worth it to find out.

1.6 Summary of Goals and Specific Research Aims

In this section the specific goals that appear scattered throughout the body of the proposal are collected for easy reference and supplemented with more concrete intended approaches. The main goals follow closely the three main categories of the proposal: Basics, applications, and information geometry.

The Basics

It is clear, at least in the subjects presented above, that the entropic priors are producing either useful well known results or new compelling ones. However, a more informative example would be obtained with an application where entropic priors just give the wrong answer. It will show how to modify them and what is missing.

On the positive side what is most important at this time is to find general principles for generating the two parameters that appear in the definition of an entropic prior: the initial value q₀ and the information parameter a. The total ignorance prior (i.e. a = 0) is always available but not always desirable because it is often improper.

An important question concerning q₀ is the following: Given a regular parametric model and no other information, what single value for the parameter should be selected?. Geometry may suggest to pick a value as close as possible to the center but then one runs into the problem of defining the center. If the model is a Lie group, as many regular models are, the identity is immediately singled out as an interesting candidate. In fact, as it is shown by equations (55) and (58), in the gaussian case the standard normal is the identity and in the discrete case the uniform is the identity and it is nice that they are so. But what if there is no natural Lie group structure? Could a gaussian approximation or a discretization be useful in this case?

The positive scalar parameter a that quantifies the amount of prior information supporting the initial choice q₀ is the other quantity that needs to be fixed to be able to compute with entropic priors. There are several promising avenues to explore: first, this is a parameter defining a regular model therefore use an entropic prior for it and then an entropic prior for the entropic....(see []). Second, stop at level one and use total ignorance. Third, use empirical Bayes. Four, use the smallest a that makes the prior proper, etc. There are many possibilities and at this point it is not clear if there is a best approach. Specific examples need to be studied in detail to gain insight into general principles.

Another interesting open question at the moment is the asymptotic consistency of the posterior in the non parametric generalization of entropic priors. Contrary to the underlying current myth priors were not all created equal and in fact most of them misbehave in non parametric settings. It would be very surprising to find inconsistency with posteriors computed from entropic priors, but this needs to be checked.

Applications

Empirical Bayes, although listed in the main body of the proposal as part of the applications it is really a technique of inference that needs entropic priors. Entropic priors can be applied to empirical Bayes which in turn can be used for estimation. The terrain is fertile for investigating the performance of Stein type estimators in the problem of recovering a signal buried in white noise (see equation (33)). This is only an approximation that needs to be compared with the exact solution. Special attention should be paid to the change in performance with changes in the curvature of the signal manifold. Different estimators for a could be tried and compared. Insight could be gained with simulations and the resulting procedure could be applied for estimating frequencies and this method should be compared with the plain posterior given by equation (41).

In the image processing application it is desired to extend the binary image example in several ways: First, a design and implementation of a simulating annealing algorithm based on the full formula (52) should be develop and its performance compared with that based on formula (53), specially when q could not be easily estimated from the raw data. Second, to study the parallel implementation of these algorithms on a Boltzmann machine. Third, to look for hypothesis spaces with more structure than the rectangles where the full range of possibilities of entropic priors could be successfully exploited; e.g. consider the space of images to be created from a finite number of symbols that could be continuously deformed.

Information Geometry

This subject is virtually untouched, and almost every idea seems like a good one. A method for classifying hypothesis spaces in terms of robustness should be very desirable. This could be investigated in several ways, e.g., by considering the set of vector fields for which the Lie derivative of a given quantity is either zero, bounded or small. As it was mentioned in the main body of the grant, the problem of computing the rate of change of the (entropic) posterior when small deformations (or kicks) are applied to both the model and the data space seems very fundamental and promising.

Footnotes:

¹ Regular models are smooth Riemannian manifolds relative to the Haussdorf topology of the Hellinger distance.

² This assumption follows from (1): make q denote an arbitrary distribution for the errors with a given finite variance, flat initial, and large a.

³ I have implemented the algorithm in Quick-C and it takes a few minutes runnig on a 386/33Mhz to obtain solutions comparable to those in [].

⁴ Regular models are smooth Riemannian manifolds relative to the Haussdorf topology of the Hellinger distance.

⁵ I recently learned about this from an unpublished manuscript by Michael Hardy a graduate student in statistics at the university of Minnesota

Carlos Rodriguez <carlos@math.albany.edu>

Last modified: Tue Jul 13 19:50:18 EDT 1999