Bayes factors with (overly) informative priors

Richard A Lockhart

arXiv:1907.02473·math.ST·July 9, 2019

Bayes factors with (overly) informative priors

Richard A Lockhart

PDF

Open Access

TL;DR

This paper examines the pitfalls of using overly informative priors with many independent parameters, demonstrating through examples and large sample theory how they hinder learning from data.

Contribution

It provides a theoretical analysis of the issues caused by overly informative priors and illustrates these problems with practical examples.

Findings

01

Overly informative priors can impede data learning.

02

Large sample theory reveals the detrimental effects of such priors.

03

Examples demonstrate the practical implications of the theoretical findings.

Abstract

Priors in which a large number of parameters are specified to be independent are dangerous; they make it hard to learn from data. I present a couple of examples from the literature and work through a bit of large sample theory to show what happens.

Equations96

\frac{π _{2}}{π _{1} + π _{2} F}

\frac{π _{2}}{π _{1} + π _{2} F}

F = \frac{f _{2} ( x )}{f _{1} ( x )}

F = \frac{f _{2} ( x )}{f _{1} ( x )}

f_{1} (x) = (2 π)^{- n /2} i = 1 \prod k (1 + n_{i} τ^{2})^{- 1/2} exp ⎩ ⎨ ⎧ - \frac{1}{2} ij \sum x_{ij}^{2} + \frac{τ ^{2} \sum _{i} ( \sum _{j} x _{ij} ) ^{2}}{2 ( 1 + n _{i} τ ^{2} )} ⎭ ⎬ ⎫

f_{1} (x) = (2 π)^{- n /2} i = 1 \prod k (1 + n_{i} τ^{2})^{- 1/2} exp ⎩ ⎨ ⎧ - \frac{1}{2} ij \sum x_{ij}^{2} + \frac{τ ^{2} \sum _{i} ( \sum _{j} x _{ij} ) ^{2}}{2 ( 1 + n _{i} τ ^{2} )} ⎭ ⎬ ⎫

lo g (F) = - i \sum \frac{τ ^{2} ( \sum _{j} X _{ij} ) ^{2}}{2 ( 1 + n _{i} τ ^{2} )} + \frac{1}{2} i \sum lo g (1 + n_{i} τ^{2}) .

lo g (F) = - i \sum \frac{τ ^{2} ( \sum _{j} X _{ij} ) ^{2}}{2 ( 1 + n _{i} τ ^{2} )} + \frac{1}{2} i \sum lo g (1 + n_{i} τ^{2}) .

T_{i} \equiv j \sum X_{ij} \sim Normal (n μ_{i}, n)

T_{i} \equiv j \sum X_{ij} \sim Normal (n μ_{i}, n)

E (T_{i}^{2}) = n + n^{2} μ_{i}^{2} .

E (T_{i}^{2}) = n + n^{2} μ_{i}^{2} .

E [lo g (F)] = - \frac{τ ^{2}}{2 ( 1 + n τ ^{2} )} (nk + n^{2} i \sum μ_{i}^{2}) + \frac{1}{2} k lo g (1 + n τ^{2}) .

E [lo g (F)] = - \frac{τ ^{2}}{2 ( 1 + n τ ^{2} )} (nk + n^{2} i \sum μ_{i}^{2}) + \frac{1}{2} k lo g (1 + n τ^{2}) .

2 \frac{lo g ( F )}{k} \to lo g (1 + a) - \frac{a ( 1 + n ϵ ^{2} )}{1 + a}

2 \frac{lo g ( F )}{k} \to lo g (1 + a) - \frac{a ( 1 + n ϵ ^{2} )}{1 + a}

T_{i} = n μ_{i} + j \sum (X_{ij} - μ_{i})

T_{i} = n μ_{i} + j \sum (X_{ij} - μ_{i})

σ^{2} = n^{2} ϵ^{2} + n .

σ^{2} = n^{2} ϵ^{2} + n .

i \sum T_{i}^{2} / σ^{2} \sim χ_{k}^{2} .

i \sum T_{i}^{2} / σ^{2} \sim χ_{k}^{2} .

lo g (F) = k lo g (1 + n τ^{2}) /2 - τ^{2} σ^{2} i \sum T_{i}^{2} / σ^{2}

lo g (F) = k lo g (1 + n τ^{2}) /2 - τ^{2} σ^{2} i \sum T_{i}^{2} / σ^{2}

P (lo g (F) > k t) = P [\frac{χ _{k}^{2}}{k} > (1 + n τ^{2}) \frac{2 t - lo g ( 1 + n τ ^{2} )}{σ ^{2} τ ^{2}}]

P (lo g (F) > k t) = P [\frac{χ _{k}^{2}}{k} > (1 + n τ^{2}) \frac{2 t - lo g ( 1 + n τ ^{2} )}{σ ^{2} τ ^{2}}]

ψ_{B} \equiv \frac{\sum _{1}^{B} θ _{j}}{B}

ψ_{B} \equiv \frac{\sum _{1}^{B} θ _{j}}{B}

π_{J} j \in J \prod θ_{j}^{Y_{j}} (1 - θ_{j})^{1 - Y_{i}} .

π_{J} j \in J \prod θ_{j}^{Y_{j}} (1 - θ_{j})^{1 - Y_{i}} .

ψ = \frac{α}{α + β}

ψ = \frac{α}{α + β}

η \equiv \frac{ψ ( 1 - ψ )}{α + β + 1} .

η \equiv \frac{ψ ( 1 - ψ )}{α + β + 1} .

α + β = \frac{ψ ( 1 - ψ )}{η} - 1

α + β = \frac{ψ ( 1 - ψ )}{η} - 1

α = (α + β) ψ = \frac{ψ ^{2} ( 1 - ψ )}{η} - ψ

α = (α + β) ψ = \frac{ψ ^{2} ( 1 - ψ )}{η} - ψ

β = \frac{ψ ( 1 - ψ )}{η} - 1 - (\frac{ψ ^{2} ( 1 - ψ )}{η} - ψ) = \frac{( 1 - ψ ) ( ψ - ψ ^{2} - η )}{η} .

β = \frac{ψ ( 1 - ψ )}{η} - 1 - (\frac{ψ ^{2} ( 1 - ψ )}{η} - ψ) = \frac{( 1 - ψ ) ( ψ - ψ ^{2} - η )}{η} .

h (η ∣ ψ) = \frac{1 ( 0 \leq η \leq ψ ( 1 - ψ ) )}{ψ ( 1 - ψ )} .

h (η ∣ ψ) = \frac{1 ( 0 \leq η \leq ψ ( 1 - ψ ) )}{ψ ( 1 - ψ )} .

π_{J} j \in J \prod θ_{j}^{Y_{j}} (1 - θ_{j})^{1 - Y_{j}} j \prod \frac{θ _{j}^{α - 1} ( 1 - θ _{j} ) ^{β - 1}}{Beta ( α , β )} h (η ∣ ψ) h_{0} (ψ) .

π_{J} j \in J \prod θ_{j}^{Y_{j}} (1 - θ_{j})^{1 - Y_{j}} j \prod \frac{θ _{j}^{α - 1} ( 1 - θ _{j} ) ^{β - 1}}{Beta ( α , β )} h (η ∣ ψ) h_{0} (ψ) .

\int_{0}^{1} \frac{θ _{j}^{α - 1} ( 1 - θ _{j} ) ^{β - 1}}{Beta ( α , β )} d θ_{j} = 1.

\int_{0}^{1} \frac{θ _{j}^{α - 1} ( 1 - θ _{j} ) ^{β - 1}}{Beta ( α , β )} d θ_{j} = 1.

\int_{0}^{1} θ_{j}^{Y_{j}} (1 - θ_{j})^{1 - Y_{j}} \frac{θ _{j}^{α - 1} ( 1 - θ _{j} ) ^{β - 1}}{Beta ( α , β )} d θ_{j} = \frac{Beta ( α + Y _{j} , β + 1 - Y _{j} )}{Beta ( α , β )} .

\int_{0}^{1} θ_{j}^{Y_{j}} (1 - θ_{j})^{1 - Y_{j}} \frac{θ _{j}^{α - 1} ( 1 - θ _{j} ) ^{β - 1}}{Beta ( α , β )} d θ_{j} = \frac{Beta ( α + Y _{j} , β + 1 - Y _{j} )}{Beta ( α , β )} .

\int_{0}^{1} θ_{j}^{Y_{j}} (1 - θ_{j})^{1 - Y_{j}} \frac{θ _{j}^{α - 1} ( 1 - θ _{j} ) ^{β - 1}}{Beta ( α , β )} d θ_{j} = ψ^{Y_{j}} (1 - ψ)^{1 - Y_{j}} .

\int_{0}^{1} θ_{j}^{Y_{j}} (1 - θ_{j})^{1 - Y_{j}} \frac{θ _{j}^{α - 1} ( 1 - θ _{j} ) ^{β - 1}}{Beta ( α , β )} d θ_{j} = ψ^{Y_{j}} (1 - ψ)^{1 - Y_{j}} .

π_{J} j \in J \prod ψ_{j}^{Y} (1 - ψ)^{1 - Y_{j}} h (η ∣ ψ) = π_{J} ψ^{S} (1 - ψ)^{∣ J ∣ - S} h (η ∣ ψ) h_{0} (ψ)

π_{J} j \in J \prod ψ_{j}^{Y} (1 - ψ)^{1 - Y_{j}} h (η ∣ ψ) = π_{J} ψ^{S} (1 - ψ)^{∣ J ∣ - S} h (η ∣ ψ) h_{0} (ψ)

\frac{ψ ^{S} ( 1 - ψ ) ^{∣ J ∣ - S} h ( η ∣ ψ )}{Beta ( S + 1 , ∣ J ∣ - S + 1 )} .

\frac{ψ ^{S} ( 1 - ψ ) ^{∣ J ∣ - S} h ( η ∣ ψ )}{Beta ( S + 1 , ∣ J ∣ - S + 1 )} .

\frac{ψ ^{S} ( 1 - ψ ) ^{∣ J ∣ - S}}{B ( S + 1 , ∣ J ∣ - S + 1 )} h_{0} (ψ) .

\frac{ψ ^{S} ( 1 - ψ ) ^{∣ J ∣ - S}}{B ( S + 1 , ∣ J ∣ - S + 1 )} h_{0} (ψ) .

\hat{ψ} = \frac{S + α _{0}}{∣ J ∣ + α _{0} + β _{0}}

\hat{ψ} = \frac{S + α _{0}}{∣ J ∣ + α _{0} + β _{0}}

\hat{ψ}_{HT} = \frac{S}{∣ J ∣}

\hat{ψ}_{HT} = \frac{S}{∣ J ∣}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Advanced Statistical Process Monitoring · Statistical Methods and Bayesian Inference

Full text

**Bayes factors with (overly) informative priors

**

Richard Lockhart111The author thanks Michael Stephens for many useful conversations on the topics discussed here. The author acknowledges grant support from the Natural Sciences and Engineering Research Council of Canada.

Department of Statistics & Actuarial Science

Simon Fraser University

Burnaby, BC V5A 1S6 CANADA

Abstract

Priors in which a large number of parameters are specified to be independent are dangerous; they make it hard to learn from data. I present a couple of examples from the literature and work through a bit of large sample theory to show what happens.

Keywords: Teaching examples, hierarchical priors, independence priors, pitfalls in prior specification.

MSC Classification Code: 62F15, 62E20

1 Introduction

Bélisle et al. (2002) fit five interesting models to a data set on progression of Alzheimer’s disease. They use Bayes factors (both prior and posterior) for model selection. Most of the Bayes factors reported are numbers of the form $10^{x}$ with $x$ on the order of plus or minus tens to hundreds; that is, in most cases when two models are compared one of the two is overwhelmingly preferred. This is not too surprising from a Bayes perspective but the results reveal some comparisons which are surprising for a frequentist.

In some of these models there is a parameter which varies from individual to individual. In other models this parameter is constant across individuals. For a frequentist the latter model is a submodel of the former. A frequentist would, I submit, be surprised to be told that in the face of the data the submodel is overwhelmingly (like $10^{12}$ time more likely) to be preferred to the richer model. In the results presented by Bélisle et al. (2002) this is exactly what happens: their model 5 is strongly preferred to their model 4 of which model 5 is a submodel. They report prior Bayes factors of $10^{22.7}$ and $10^{12}$ in favour of the submodel in two different data sets. They also compute posterior Bayes factors, known to be computationally easier, of $10^{26.5}$ and $10^{16.3}$ in favour of the submodel in the same two datasets.

Their modelling suggests, in fact, that neither of these models is as good as their models 1, 2 or 3 which are more complex. It seems nevertheless worthwhile to understand the potential for Bayes factors strongly to prefer submodels to full models. In this note I use the one way layout model to give an example in which a wrong submodel is overwhelmingly preferred to the full model.

Priors in which a large number of parameters are specified as being independent can easily cause problems. For a second example I consider a stylized survey sampling problem rom Wasserman (2004) who presents it as a problem in which Bayes methods struggle.

In Section 2 I present the mathematical details of the one way layout problem and discuss briefly the relation to the situation in Bélisle et al. (2002). In Section 3 I simplify Wasserman’s problem and present the standard Bayesian analysis. I finish the paper with a short discussion arguing that

Priors in which many parameters are independent can be too informative to be safely used in data analysis. 2. 2.

Hierarchical priors, as Bayesians know well, avoid these pitfalls. 3. 3.

In order to learn from what happens in one measurement about another measurement a Bayesian must, before making the first measurement, regard the two outcomes as dependent.

I hope the examples here might be useful for pedagogy and for highlighting dangers of careless use of priors.

2 Example 1: the one way layout

Consider the $k$ sample problem with known variance. We suppose we have data $X_{ij};j=1,\ldots,n_{i};i=1,\ldots,k$ . We will be considering the standard analysis of variance model. We assume the $X_{ij}$ to be independent with $X_{ij}$ having a normal distribution with mean $\mu_{i}$ and variance 1. This will be Model 1. Model 2 is the nested submodel with all $\mu_{i}=0$ ; that is, for Model 2 the data are iid standard normal.

This is usually treated by frequentists as a hypothesis testing problem but we will look at it here as a model selection problem. We will consider a Bayesian approach. Conditional on Model 2 holding there are no further parameters on which to put a prior distribution. Conditional on Model 1 holding we need a prior for the $\mu_{i}$ . We consider the simple conjugate prior under which the means $\mu_{i}$ are independent $N(0,\tau^{2})$ random variables.

If the 2 models are given prior probabilities $\pi_{1}$ and $\pi_{2}=1-\pi_{1}$ then the posterior probability that model 2 holds given the data is

[TABLE]

where $F$ is the prior Bayes factor given by

[TABLE]

with $f_{i}$ being the marginal density of the data under model $i$ . (That is, $f_{1}$ is the model joint density averaged with respect to the prior on the $\mu_{i}$ and $f_{2}$ is the joint density of iid standard normals.)

Elementary normal distribution theory calculations show that

[TABLE]

where we let $x$ be the vector of all the $x_{ij}$ values. It follows that

[TABLE]

Now we want to consider the frequentist properties of $\log(F)$ by imagining a sequence $\mu_{1},\mu_{2},\ldots$ . I will take these values to be deterministic and try to see how $\log(F)$ behaves when there are many samples, that is, when $k$ is large. I begin by computing the first few moments of $\log(F)$ . For simplicity I will take all the $n_{i}$ to be equal and use $n$ to denote their common value. In this case, for each $i$ we find that

[TABLE]

and these sums, $T_{i}$ , are independent over $i$ . Thus

[TABLE]

Thus the mean of $\log(F)$ is

[TABLE]

We can now check that if $\overline{\mu^{2}}=\sum_{i}\mu_{i}^{2}/k$ is small enough then $\log(F)$ can easily be positive. A simpler computation ensues if the $\mu_{i}$ are in fact generated in an iid way from a Normal $(0,\epsilon^{2})$ distribution; I now make this assumption. Then holding $n$ and $\tau$ fixed and letting $k$ , the number of samples, tend to $\infty$ , we see that

[TABLE]

where $a=n\tau^{2}$ . Define $f(a)=\log(1+a)-a/(1+a)$ and check that $f(0)=0$ and $f^{\prime}(a)=a/(1+a)^{2}>0$ . The term $an\epsilon^{2}/(1+a)$ is no more than $n\epsilon^{2}$ which converges to 0 as $\epsilon\to 0$ . So for each $n,\tau$ pair there is an $\epsilon$ so small that the limit of $2\log(F)/k$ is positive. For this sequence — a sequence for which the null hypothesis is clearly false — the Bayes factor in favour of the null model, grows exponentially fast. If $n=10$ and $\tau=1$ , for instance, then $a=10$ . Numerically we find that with $\epsilon=0.3$ the limit of $\log(F)/k$ is $0.67\approx 2/3$ . In other words if the actual variability in the means is 30% of the variability predicted by the prior then the submodel is roughly $\exp(2k/3)$ times as credible as the full model.

A reasonable form of the analytic law of $\log(F)$ is available when the $\mu_{i}$ are iid $N(0,\epsilon^{2})$ . In this case

[TABLE]

has a normal distribution with mean 0 and variance

[TABLE]

Since the $T_{i}$ are iid it follows that

[TABLE]

Thus

[TABLE]

so that

[TABLE]

Using this form I plot, in Figure 1, the median of $F$ against the number $k$ of samples for $n=10$ , $\tau=1$ and $\epsilon=0.3$ . I use a logarithmic scale for $F$ on the $y$ -axis and the nearly perfect linearity is evident. Note, too, that the actual median values are very large by the time $k=50$ and vast for $k=200$ .

The situation here is relatively simple. In the model above the value of $\mu_{i}$ is estimated from 10 observations. The prior specified iid $\mu_{i}$ with variance 1 so that the standard deviation of the $\mu_{i}$ arising was a priori expected to be 1. In fact that standard deviation was smaller, 0.3 only. As a consequence when $k$ is moderately large the prior for Model 1, the less restrictive model, is contradicted for the data. Model 2 is also contradicted (and false). For $\epsilon<0.404$ or so, as $k\to\infty$ with $n$ and $tau$ held fixed, we find that $F$ tends exponentially fast to $\infty$ and we erroneously conclude that the means are all 0.

In Bélisle et al. (2002), Model 5 differs from Model 4 in allowing a certain variance parameter to be different from subject to subject. These variance parameters were taken to be independent across individuals and to have expectation 2; they were given inverse Gaussian distributions with this expectation and with standard deviation 2. There were $n=65$ subjects in one of the two data sets and $n=55$ in the other. I do not have access to the data (honestly, I haven’t tried to get access to the data) so I cannot check, but I assume that it is likely that a fit with some smaller mean and or some other standard deviation for these variance parameters would yield more natural Bayes factors.

The phenonemon described above arises in other model selection problems. For instance, we regress $Y$ on $p$ predictors $X_{1},\ldots,X_{p}$ and assume that each $X_{i}$ has a non-zero coefficient independently of all the other covariates and that $\theta\in(0,1)$ is the common value then if we do model selection by Bayesian methods the posterior will be concentrated on the event that the number of active predictors is in the range $p\theta\pm 2\sqrt{p\theta(1-\theta)}$ .

3 Example 2: Survey Sampling

A related phenomenon occurs in Larry Wasserman’s book All of Statistics. Example 11.19 in the book is a description of a missing data problem but it has the flavour of survey sampling. There are $B$ parameters $\theta_{1},\ldots,\theta_{B}$ , each a probability in $[0,1]$ . We draw a sample with replacement from the index set $\{1,\cdots,B\}$ ; let $X_{1},\ldots,X_{n}$ be the indices selected. It is assumed that these are iid and uniform on $\{1,\cdots,B\}$ . Having selected, on the $i$ th draw, unit $X_{i}$ , we may or may not observe an observation $Y_{i}$ which is Bernoulli with parameter $\theta_{X_{i}}$ . The example given supposes that ‘may or may not’ is determined by a Bernoulli variable $R_{i}$ with parameter $\xi_{X_{i}}$ which is known to the analyst. In the text it is argued that the Horwitz-Thompson estimator of

[TABLE]

is a good one and that Bayesian methods fail. Here I simplify the model, eliminating the possibility of missing data and make the sampling law a bit more in line with survey sampling work.

I will assume that we generate a set $J$ , the sample, of indices in $\{1,\ldots,B\}$ with design probability $\pi_{J}$ ; I assume the design is not informative; that is the $\pi_{J}$ are known to the surveyor in advance and unrelated to $\psi$ . Then for each $j\in J$ we observe $Y_{j}$ with a Bernoulli $(\theta_{j})$ distribution independently of all other $Y$ values. I let $Y$ denote the vector of $J$ observed values.

The likelihood is the probability of observing the data we observed:

[TABLE]

This likelihood is observed to depend on only $n$ of the $B$ entries in the vector $\theta$ of all $\theta_{i}$ . Wasserman now argues that “for most $\theta_{j}$ the posterior distribution is equal to the prior distribution since those $\theta_{j}$ do not appear in the likelihood.” The problem, however, is that the prior is not clearly specified. If I specify that the $\theta_{i}$ are a priori iid with density $\pi(\cdot)$ then the posterior for unobserved $\theta_{i}$ is still iid with density $\pi(\cdot)$ . This appears to me to be the argument intended by Wasserman. But this prior specifies that if I measure one $\theta_{i}$ I do not learn about any other $\theta_{i}$ . This sort of prior is exactly the sort discussed above and has the same weakness. The unobserved $B-n$ parameters $\theta_{j}$ will average, essentially, to $\psi\equiv\int_{0}^{1}\theta\pi(\theta)d\theta$ . If, as the text assumes, $B-n$ is very large compared to $n$ then of course I will not learn anything important about $\psi_{B}$ ; a priori I knew that $\psi_{B}$ was close to $\psi$ .

I would argue that the correct message here is as above; independence priors about many parameters are priors which deny learning. If I had to guess the mass of a flea I would struggle but if you let me weigh one of a group of many I would suddenly know far more about the average weight of the group of fleas. I need a hierarchical prior.

In the Wasserman example (as modified here) one might specify such a hierarchical prior by taking the $\theta_{i}$ to be iid with a Beta( $\alpha,\beta$ ) distribution and then putting a prior on $(\alpha,\beta)\in(0,\infty)^{2}$ . For the discussion which follows I reparametrize the Beta distribution in terms of its mean

[TABLE]

and its variance

[TABLE]

The parameters $\alpha$ and $\beta$ may be expressed in terms of $\psi$ and $\eta$ by noting

[TABLE]

and

[TABLE]

This gives

[TABLE]

A simple prior specification might then be to give $\psi$ some Beta( $\alpha_{0},\beta_{0}$ ) density denoted $h_{0}(\psi)$ on $[0,1]$ and to give $\eta$ some conditional density given $\psi$ . Notice that $0\leq\eta\leq\psi(1-\psi)$ so this conditional density must be supported on this interval. One simple conditional prior would be to take $\eta$ to be uniform on $[0,\psi_{\infty}(1-\psi_{\infty})]$ ; I use $h(\eta|\psi)$ for some generic conditional prior for $\eta$ given $\psi$ and then specialize where needed to the uniform case where

[TABLE]

A priori we have ${\rm E}\left(\psi_{B}|\psi,\eta\right)=\psi$ .

The joint law of the data, the parameters $\theta_{i}$ and the hyperparameters $\psi$ and $\eta$ is

[TABLE]

From this I now deduce: the joint law of the data $J,Y$ and the hyperparameters $\psi$ and $\eta$ ; the usual Bayes estimate of $\psi$ , namely ${\rm E}(\psi|Y)$ ; and the Bayes estimate of $\psi_{B}$ , namely ${\rm E}(\psi_{B}|Y)$ .

For the first of these I simply integrate away all the $\theta_{j}$ . For $j\not\in J$ we have

[TABLE]

For $j\in J$ the integral needed becomes

[TABLE]

Considering the two cases $Y_{j}=1$ and $Y_{j}=0$ separately we find

[TABLE]

Thus the joint law of $J,Y,\psi,\eta$ is

[TABLE]

where $S=\sum_{J}Y_{j}$ and we use $|J|$ to denote the cardinality of $J$ .

The conditional density, $f(\psi,\eta|Y,J)$ , of $\psi,\eta$ given $Y$ and $J$ is then

[TABLE]

Now integrate out $\eta$ to find the conditional density of $\psi$ given $Y$ and $J$ , namely,

[TABLE]

If $h_{0}$ is the Beta( $\alpha_{0},\beta_{0}$ ) then this posterior is a Beta density; the Bayes estimate of $\psi$ is the mean of this density, namely,

[TABLE]

I remark that the improper prior $\psi^{-1}(1-\psi)^{-1}$ arises in the limit $\alpha_{0}\to 0$ and $\beta_{0}\to 0$ . This leads to the Horwitz-Thompson estimator

[TABLE]

suggested by Wasserman. Of course this is our estimate of $\psi$ while Wasserman is estimating $\psi_{B}$ .

Finally I compute the Bayes estimate of $\psi_{B}$ which is

[TABLE]

I compute ${\rm E}(\theta_{j}|Y)$ separately according to $j\not\in J$ or $j\in J$ .

Notice that

[TABLE]

In order to compute the inside conditional expectation I return to joint law (1). For $j\not\in J$ this joint law may be written in the form

[TABLE]

where the quantity $C$ does not depend on $\theta_{j}$ . It follows that given $Y,J,\psi,\eta$ the conditional law of $\theta_{j}$ is, for $j\not\in J$ , Beta with parameters $\psi$ and $\eta$ . Thus

[TABLE]

Finally we see that

[TABLE]

When $j\in J$ the kernel of (1) is

[TABLE]

which is the kernel of the Beta density with parameters $\alpha+Y_{j}$ and $\beta+1-Y_{j}$ . The mean of this density is

[TABLE]

and we need to compute

[TABLE]

This is

[TABLE]

The required conditional density is given above but we must write the inner conditional mean in terms of $\psi$ and $\eta$ . First we get

[TABLE]

We also find

[TABLE]

Our integral becomes

[TABLE]

I now use the conditional uniform distribution of $\eta$ to do the inside integral to get

[TABLE]

In summary we have found that

[TABLE]

From this we find that the Bayes estimate of $\psi$ is

[TABLE]

When $|J|$ is small compared to $B$ the second term is negligible. In fact, for the special choice of the non-informative improper prior $\psi^{-1}(1-\psi)^{-1}$ the second term vanishes exactly and the Bayes estimate is just the Horvitz-Thompson estimator. Finally remark that the second term is

[TABLE]

This quantity is easily seen to be bounded, in absolute value, by

[TABLE]

Thus the correction to $\hat{\psi}$ is negligible if $B$ is large.

As a practical matter this Horvitz-Thompson has variance which is acceptably small. Treating the $\theta_{j}$ as fixed parameters the variance is

[TABLE]

I do not think this quantity admits a useful estimate.

4 Discussion

When many parameters are specified as a priori independent, the prior is very informative about averages of functions of those parameters. The result is that estimates can depend little on the data and model selection methods can have very poor frequency theory properties. Analysts using Bayesian methods must be careful of such priors and ought, in many cases, to make the prior hierarchical, inducing dependence between the parameters and permitting learning about some parameters by measuring others.

References

Bélisle, P.; Joseph, L.; Wolfson, D.; and Zhou, X. (2002). Bayesian estimation of cognitive decline in patients with Alzheimer’s disease. The Canadian Journal of Statistics, 30, 37–54.

Wasserman, Larry (2004). All of Statistics, Springer: New York.