Neal Vs. ME

This is a summary of a thread in the Newsgroup sci.stat.math at: www.google.com

What follows is my last attempt at convincing Radford Neal (and other present and future watchers out there) of the importance of Theorem1 in, https://omega0.xyz/omega8008/0201016.pdf

General Fact1:

Among all possible distributions on the parameters of a regular parametric model, the one that is most difficult to discriminate from an independent model on (parameters,data) is the Entropic Prior.

Specific Fact2:

The Entropic Prior for the parameters of a Gaussian mixture turns out to be very similar to the popular conjugate prior except that the uncertainty on the parameters of each component depends on the weight assigned to that component. The smaller the weight the larger the uncertainty.

Radford Neal's position:

Forget Fact1. I find your Specific Fact2 counter-intuitive and even irrational. Ergo, we can happily forget about the Entropic Prior business.

My position:

Whatever intuition you may have about how a most ignorant prior about the data should look like, if your intuition doesn't agree with the Facts above then, provided the Facts above are correct, we can happily forget about your intuition.

Neal's Argument:

Why should anyone care about the prior that YOU say is the most difficult to discriminate from an independent model meaning something that looks to me as a cooked up manipulation of symbols to get what you want?
And look, your specific Fact2 is clearly crazy for I can find lots of real life examples where it appears as encoding prior information that doesn't exist or even worst that runs contrary to what we know for that problem.
Here is an example: Suppose we want to study how far the population of beatles from the north of Srilanka are able to travel. Suppose that we know that the beatles from Srilanka are of one of two kinds. One populous species and one rare species. We naturally model the observed data of traveled distances as a two component mixture or gaussians. Your entropic prior will assign A PRIORI more uncertain to the average distance traveled by the rare species. That's SILLY I could have all kinds of bio info against that!

My Argument:

The math behind Fact1 is standard and (subtle but once understood) trivial. The Kullback number between two probability measures P and Q, denoted I(P:Q)
(*** where for us:
P=f(data|params)*p(params)
(i.e. likelihood times prior) and
Q=h(data)*g(params)
(i.e. an independent model, some arbitrary but fix density h() for data and the local uniform g() on the (manifold) parameters) defined on the same measurable space (data,parameters) ****)
is the universally accepted information-theoretic-probabilistic measure of how easy it is to discriminate Q from P. It is nothing but the mean information for discrimination in favor or P and against Q when sampling from P. Look at the first chapter of Kullback's book or ask your gurus or search the net or whatever. Just in case you still have issues with I(P:Q) let me remind anyone watching that a simple monotone increasing function of I(P:Q) is an upper bound to the total variation distance between P and Q (Bretagnolle-Huber inequality).
TRANSLATION: If I(P:Q) is small (close to 0) then P and Q are close in total variation i.e. close in the most natural way for probability measures.
FACT1 (again):
The proper prior p(params) that minimizes I(P:Q), when data consists of alpha independent observations of the model, is the Entropic Prior with parameters h and alpha. It is only natural to call this prior most ignorant about the data since Q is an independent (product) model "h(data)*g(params)" where params are statistically independent of the data.
There is nothing fishy or unnatural about Fact1. The true power of Fact1 comes from its generality. It holds for ANY regular hypothesis space in any number of dimensions. Even in infinite dimensional (i.e. for stochastic processes) hypothesis spaces. However, there is no room in the electronic margins of this note to show you the proof...
*** I remind whoever is listening that once you allow Fact1 to get "jeegee" (As in Austin Powers "Get jeegee with it") with your mind, you become pregnant and there is no need to bother with answering (2). Your baby-to-be will give you the answer! For those virtuous minds still out there, here is a way:
The only prior information that we assume that we have about the beatles is the one in the likelihood and the parameters of the entropic prior (h and alpha). NOTHING ELSE. If there is extra prior info, biological or whatever ,that info must be EXPLICITLY included in the problem. Either in the likelihood, h,alpha or as a constraint for the minimization of I(P:Q). Only after including ALL the info that we want to consider, only after that, we maximize honesty and take the most ignorant prior consistent with what we know. Fact2, as it is presented here, applies only to that ignorant case. When all we assume we know is the likelihood, Fact2 is not only sane but obvious. Of course the parameters of the rare components of the mixture are A PRIORI more uncertain. There is always less info coming from there and we know that A PRIORI even BEFORE we collect any data. Another way to state this is: THE ONLY way to be able to assume equal uncertainty for all the components regardless of their mixture weights is to ASSUME a source of information OTHER than the likelihood. Q.E.D.

NOTE:
The above argument opens the gates of uncertainty to all the MCMC simulations based on the standard conjugate prior for mixtures of Gaussians.

Carlos Rodriguez <carlos@math.albany.edu>

Last modified: Mon Apr 8 20:13:34 EDT 2002