Bayes factors with (overly) informative priors
Richard A Lockhart

TL;DR
This paper examines the pitfalls of using overly informative priors with many independent parameters, demonstrating through examples and large sample theory how they hinder learning from data.
Contribution
It provides a theoretical analysis of the issues caused by overly informative priors and illustrates these problems with practical examples.
Findings
Overly informative priors can impede data learning.
Large sample theory reveals the detrimental effects of such priors.
Examples demonstrate the practical implications of the theoretical findings.
Abstract
Priors in which a large number of parameters are specified to be independent are dangerous; they make it hard to learn from data. I present a couple of examples from the literature and work through a bit of large sample theory to show what happens.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Advanced Statistical Process Monitoring · Statistical Methods and Bayesian Inference
**Bayes factors with (overly) informative priors
**
Richard Lockhart111The author thanks Michael Stephens for many useful conversations on the topics discussed here. The author acknowledges grant support from the Natural Sciences and Engineering Research Council of Canada.
Department of Statistics & Actuarial Science
Simon Fraser University
Burnaby, BC V5A 1S6 CANADA
Abstract
Priors in which a large number of parameters are specified to be independent are dangerous; they make it hard to learn from data. I present a couple of examples from the literature and work through a bit of large sample theory to show what happens.
Keywords: Teaching examples, hierarchical priors, independence priors, pitfalls in prior specification.
MSC Classification Code: 62F15, 62E20
1 Introduction
Bélisle et al. (2002) fit five interesting models to a data set on progression of Alzheimer’s disease. They use Bayes factors (both prior and posterior) for model selection. Most of the Bayes factors reported are numbers of the form with on the order of plus or minus tens to hundreds; that is, in most cases when two models are compared one of the two is overwhelmingly preferred. This is not too surprising from a Bayes perspective but the results reveal some comparisons which are surprising for a frequentist.
In some of these models there is a parameter which varies from individual to individual. In other models this parameter is constant across individuals. For a frequentist the latter model is a submodel of the former. A frequentist would, I submit, be surprised to be told that in the face of the data the submodel is overwhelmingly (like time more likely) to be preferred to the richer model. In the results presented by Bélisle et al. (2002) this is exactly what happens: their model 5 is strongly preferred to their model 4 of which model 5 is a submodel. They report prior Bayes factors of and in favour of the submodel in two different data sets. They also compute posterior Bayes factors, known to be computationally easier, of and in favour of the submodel in the same two datasets.
Their modelling suggests, in fact, that neither of these models is as good as their models 1, 2 or 3 which are more complex. It seems nevertheless worthwhile to understand the potential for Bayes factors strongly to prefer submodels to full models. In this note I use the one way layout model to give an example in which a wrong submodel is overwhelmingly preferred to the full model.
Priors in which a large number of parameters are specified as being independent can easily cause problems. For a second example I consider a stylized survey sampling problem rom Wasserman (2004) who presents it as a problem in which Bayes methods struggle.
In Section 2 I present the mathematical details of the one way layout problem and discuss briefly the relation to the situation in Bélisle et al. (2002). In Section 3 I simplify Wasserman’s problem and present the standard Bayesian analysis. I finish the paper with a short discussion arguing that
Priors in which many parameters are independent can be too informative to be safely used in data analysis. 2. 2.
Hierarchical priors, as Bayesians know well, avoid these pitfalls. 3. 3.
In order to learn from what happens in one measurement about another measurement a Bayesian must, before making the first measurement, regard the two outcomes as dependent.
I hope the examples here might be useful for pedagogy and for highlighting dangers of careless use of priors.
2 Example 1: the one way layout
Consider the sample problem with known variance. We suppose we have data . We will be considering the standard analysis of variance model. We assume the to be independent with having a normal distribution with mean and variance 1. This will be Model 1. Model 2 is the nested submodel with all ; that is, for Model 2 the data are iid standard normal.
This is usually treated by frequentists as a hypothesis testing problem but we will look at it here as a model selection problem. We will consider a Bayesian approach. Conditional on Model 2 holding there are no further parameters on which to put a prior distribution. Conditional on Model 1 holding we need a prior for the . We consider the simple conjugate prior under which the means are independent random variables.
If the 2 models are given prior probabilities and then the posterior probability that model 2 holds given the data is
[TABLE]
where is the prior Bayes factor given by
[TABLE]
with being the marginal density of the data under model . (That is, is the model joint density averaged with respect to the prior on the and is the joint density of iid standard normals.)
Elementary normal distribution theory calculations show that
[TABLE]
where we let be the vector of all the values. It follows that
[TABLE]
Now we want to consider the frequentist properties of by imagining a sequence . I will take these values to be deterministic and try to see how behaves when there are many samples, that is, when is large. I begin by computing the first few moments of . For simplicity I will take all the to be equal and use to denote their common value. In this case, for each we find that
[TABLE]
and these sums, , are independent over . Thus
[TABLE]
Thus the mean of is
[TABLE]
We can now check that if is small enough then can easily be positive. A simpler computation ensues if the are in fact generated in an iid way from a Normal distribution; I now make this assumption. Then holding and fixed and letting , the number of samples, tend to , we see that
[TABLE]
where . Define and check that and . The term is no more than which converges to 0 as . So for each pair there is an so small that the limit of is positive. For this sequence — a sequence for which the null hypothesis is clearly false — the Bayes factor in favour of the null model, grows exponentially fast. If and , for instance, then . Numerically we find that with the limit of is . In other words if the actual variability in the means is 30% of the variability predicted by the prior then the submodel is roughly times as credible as the full model.
A reasonable form of the analytic law of is available when the are iid . In this case
[TABLE]
has a normal distribution with mean 0 and variance
[TABLE]
Since the are iid it follows that
[TABLE]
Thus
[TABLE]
so that
[TABLE]
Using this form I plot, in Figure 1, the median of against the number of samples for , and . I use a logarithmic scale for on the -axis and the nearly perfect linearity is evident. Note, too, that the actual median values are very large by the time and vast for .
The situation here is relatively simple. In the model above the value of is estimated from 10 observations. The prior specified iid with variance 1 so that the standard deviation of the arising was a priori expected to be 1. In fact that standard deviation was smaller, 0.3 only. As a consequence when is moderately large the prior for Model 1, the less restrictive model, is contradicted for the data. Model 2 is also contradicted (and false). For or so, as with and held fixed, we find that tends exponentially fast to and we erroneously conclude that the means are all 0.
In Bélisle et al. (2002), Model 5 differs from Model 4 in allowing a certain variance parameter to be different from subject to subject. These variance parameters were taken to be independent across individuals and to have expectation 2; they were given inverse Gaussian distributions with this expectation and with standard deviation 2. There were subjects in one of the two data sets and in the other. I do not have access to the data (honestly, I haven’t tried to get access to the data) so I cannot check, but I assume that it is likely that a fit with some smaller mean and or some other standard deviation for these variance parameters would yield more natural Bayes factors.
The phenonemon described above arises in other model selection problems. For instance, we regress on predictors and assume that each has a non-zero coefficient independently of all the other covariates and that is the common value then if we do model selection by Bayesian methods the posterior will be concentrated on the event that the number of active predictors is in the range .
3 Example 2: Survey Sampling
A related phenomenon occurs in Larry Wasserman’s book All of Statistics. Example 11.19 in the book is a description of a missing data problem but it has the flavour of survey sampling. There are parameters , each a probability in . We draw a sample with replacement from the index set ; let be the indices selected. It is assumed that these are iid and uniform on . Having selected, on the th draw, unit , we may or may not observe an observation which is Bernoulli with parameter . The example given supposes that ‘may or may not’ is determined by a Bernoulli variable with parameter which is known to the analyst. In the text it is argued that the Horwitz-Thompson estimator of
[TABLE]
is a good one and that Bayesian methods fail. Here I simplify the model, eliminating the possibility of missing data and make the sampling law a bit more in line with survey sampling work.
I will assume that we generate a set , the sample, of indices in with design probability ; I assume the design is not informative; that is the are known to the surveyor in advance and unrelated to . Then for each we observe with a Bernoulli distribution independently of all other values. I let denote the vector of observed values.
The likelihood is the probability of observing the data we observed:
[TABLE]
This likelihood is observed to depend on only of the entries in the vector of all . Wasserman now argues that “for most the posterior distribution is equal to the prior distribution since those do not appear in the likelihood.” The problem, however, is that the prior is not clearly specified. If I specify that the are a priori iid with density then the posterior for unobserved is still iid with density . This appears to me to be the argument intended by Wasserman. But this prior specifies that if I measure one I do not learn about any other . This sort of prior is exactly the sort discussed above and has the same weakness. The unobserved parameters will average, essentially, to . If, as the text assumes, is very large compared to then of course I will not learn anything important about ; a priori I knew that was close to .
I would argue that the correct message here is as above; independence priors about many parameters are priors which deny learning. If I had to guess the mass of a flea I would struggle but if you let me weigh one of a group of many I would suddenly know far more about the average weight of the group of fleas. I need a hierarchical prior.
In the Wasserman example (as modified here) one might specify such a hierarchical prior by taking the to be iid with a Beta() distribution and then putting a prior on . For the discussion which follows I reparametrize the Beta distribution in terms of its mean
[TABLE]
and its variance
[TABLE]
The parameters and may be expressed in terms of and by noting
[TABLE]
and
[TABLE]
This gives
[TABLE]
A simple prior specification might then be to give some Beta() density denoted on and to give some conditional density given . Notice that so this conditional density must be supported on this interval. One simple conditional prior would be to take to be uniform on ; I use for some generic conditional prior for given and then specialize where needed to the uniform case where
[TABLE]
A priori we have .
The joint law of the data, the parameters and the hyperparameters and is
[TABLE]
From this I now deduce: the joint law of the data and the hyperparameters and ; the usual Bayes estimate of , namely ; and the Bayes estimate of , namely .
For the first of these I simply integrate away all the . For we have
[TABLE]
For the integral needed becomes
[TABLE]
Considering the two cases and separately we find
[TABLE]
Thus the joint law of is
[TABLE]
where and we use to denote the cardinality of .
The conditional density, , of given and is then
[TABLE]
Now integrate out to find the conditional density of given and , namely,
[TABLE]
If is the Beta() then this posterior is a Beta density; the Bayes estimate of is the mean of this density, namely,
[TABLE]
I remark that the improper prior arises in the limit and . This leads to the Horwitz-Thompson estimator
[TABLE]
suggested by Wasserman. Of course this is our estimate of while Wasserman is estimating .
Finally I compute the Bayes estimate of which is
[TABLE]
I compute separately according to or .
Notice that
[TABLE]
In order to compute the inside conditional expectation I return to joint law (1). For this joint law may be written in the form
[TABLE]
where the quantity does not depend on . It follows that given the conditional law of is, for , Beta with parameters and . Thus
[TABLE]
Finally we see that
[TABLE]
When the kernel of (1) is
[TABLE]
which is the kernel of the Beta density with parameters and . The mean of this density is
[TABLE]
and we need to compute
[TABLE]
This is
[TABLE]
The required conditional density is given above but we must write the inner conditional mean in terms of and . First we get
[TABLE]
We also find
[TABLE]
Our integral becomes
[TABLE]
I now use the conditional uniform distribution of to do the inside integral to get
[TABLE]
In summary we have found that
[TABLE]
From this we find that the Bayes estimate of is
[TABLE]
When is small compared to the second term is negligible. In fact, for the special choice of the non-informative improper prior the second term vanishes exactly and the Bayes estimate is just the Horvitz-Thompson estimator. Finally remark that the second term is
[TABLE]
This quantity is easily seen to be bounded, in absolute value, by
[TABLE]
Thus the correction to is negligible if is large.
As a practical matter this Horvitz-Thompson has variance which is acceptably small. Treating the as fixed parameters the variance is
[TABLE]
I do not think this quantity admits a useful estimate.
4 Discussion
When many parameters are specified as a priori independent, the prior is very informative about averages of functions of those parameters. The result is that estimates can depend little on the data and model selection methods can have very poor frequency theory properties. Analysts using Bayesian methods must be careful of such priors and ought, in many cases, to make the prior hierarchical, inducing dependence between the parameters and permitting learning about some parameters by measuring others.
References
Bélisle, P.; Joseph, L.; Wolfson, D.; and Zhou, X. (2002). Bayesian estimation of cognitive decline in patients with Alzheimer’s disease. The Canadian Journal of Statistics, 30, 37–54.
Wasserman, Larry (2004). All of Statistics, Springer: New York.
