This is a summary of a thread in the Newsgroup sci.stat.math at:
What follows is my last attempt at convincing Radford Neal (and other present and future
watchers out there) of the importance of Theorem1 in,
Among all possible distributions on the parameters of a regular
parametric model, the one that is most difficult to discriminate from an
independent model on (parameters,data) is the Entropic Prior.
The Entropic Prior for the parameters of a Gaussian mixture turns out to
be very similar to the popular conjugate prior except that the
uncertainty on the parameters of each component depends on the weight
assigned to that component. The smaller the weight the larger the
Radford Neal's position:
Forget Fact1. I find your Specific Fact2 counter-intuitive and even
irrational. Ergo, we can happily forget about the Entropic Prior
Whatever intuition you may have about how a most ignorant prior about
the data should look like, if your intuition doesn't agree with the
Facts above then, provided the Facts above are correct, we can happily
forget about your intuition.
- Why should anyone care about the prior that YOU say is the most
difficult to discriminate from an independent model meaning something
that looks to me as a cooked up manipulation of symbols to get what you
- And look, your specific Fact2 is clearly crazy for I can find
lots of real life examples where it appears as encoding prior
information that doesn't exist or even worst that runs contrary to what
we know for that problem.
Here is an example: Suppose we want to study how far the
population of beatles from the north of Srilanka are able to travel.
Suppose that we know that the beatles from Srilanka are of one of two
kinds. One populous species and one rare species. We naturally model the
observed data of traveled distances as a two component mixture or
gaussians. Your entropic prior will assign A PRIORI more uncertain to
the average distance traveled by the rare species. That's SILLY I could
have all kinds of bio info against that!
- The math behind Fact1 is standard and (subtle but once understood)
The Kullback number between two probability measures P and Q, denoted
(*** where for us:
(i.e. likelihood times prior) and
(i.e. an independent model, some arbitrary but fix
density h() for data and the local uniform g() on the (manifold)
parameters) defined on the same measurable space (data,parameters) ****)
is the universally accepted information-theoretic-probabilistic measure
of how easy it is to discriminate Q from P. It is nothing but the mean
information for discrimination in favor or P and against Q when sampling
from P. Look at the first chapter of Kullback's book or ask your gurus
or search the net or whatever. Just in case you still have issues with
I(P:Q) let me remind anyone watching that a simple monotone increasing
function of I(P:Q) is an upper bound to the total variation distance
between P and Q (Bretagnolle-Huber inequality).
TRANSLATION: If I(P:Q)
is small (close to 0) then P and Q are close in total variation i.e.
close in the most natural way for probability measures.
The proper prior p(params) that minimizes I(P:Q), when data consists of
alpha independent observations of the model, is the Entropic Prior with
parameters h and alpha. It is only natural to call this prior most
ignorant about the data since Q is an independent (product) model
"h(data)*g(params)" where params are statistically independent of the
There is nothing fishy or unnatural about Fact1. The true power of Fact1
comes from its generality. It holds for ANY regular hypothesis space in
any number of dimensions. Even in infinite dimensional (i.e. for
stochastic processes) hypothesis spaces. However, there is no room in the
electronic margins of this note to show you the proof...
*** I remind whoever is listening that once you allow Fact1 to get
"jeegee" (As in Austin Powers "Get jeegee with it") with your mind, you
become pregnant and there is no need to bother with answering (2). Your
baby-to-be will give you the answer! For those virtuous minds still out
there, here is a way:
- The only prior information that we assume that we have about the
beatles is the one in the likelihood and the parameters of the entropic
prior (h and alpha). NOTHING ELSE. If there is extra prior info,
biological or whatever ,that info must be EXPLICITLY included in the
problem. Either in the likelihood, h,alpha or as a constraint for the
minimization of I(P:Q). Only after including ALL the info that we want
to consider, only after that, we maximize honesty and take the most
ignorant prior consistent with what we know. Fact2, as it is presented
here, applies only to that ignorant case.
When all we assume we know is the likelihood, Fact2 is not only sane but
obvious. Of course the parameters of the rare components of the mixture
are A PRIORI more uncertain. There is always less info coming from there
and we know that A PRIORI even BEFORE we collect any data. Another way
to state this is:
THE ONLY way to be able to assume equal uncertainty for all the
components regardless of their mixture weights is to ASSUME a source of
information OTHER than the likelihood. Q.E.D.
NOTE: The above argument opens the gates of uncertainty to
all the MCMC simulations based on the standard conjugate prior for
mixtures of Gaussians.