A Note on Bayesian Model Selection for Discrete Data Using Proper   Scoring Rules

A. Philip Dawid; Monica Musio; Silvia Columbu

arXiv:1703.03353·math.ST·April 28, 2020

A Note on Bayesian Model Selection for Discrete Data Using Proper Scoring Rules

A. Philip Dawid, Monica Musio, Silvia Columbu

PDF

TL;DR

This paper proposes a Bayesian model selection method for discrete data using proper scoring rules, enabling the use of improper priors and demonstrating consistent model choice between Poisson and Negative Binomial models through simulations.

Contribution

It introduces a scoring rule-based Bayesian approach for model selection with improper priors, specifically applied to discrete distributions like Poisson and Negative Binomial.

Findings

01

The method consistently selects the correct model in simulations.

02

Homogeneous scoring rules effectively handle improper priors.

03

Prequential application ensures reliable model discrimination.

Abstract

We consider the problem of choosing between parametric models for a discrete observable, taking a Bayesian approach in which the within-model prior distributions are allowed to be improper. In order to avoid the ambiguity in the marginal likelihood function in such a case, we apply a homogeneous scoring rule. For the particular case of distinguishing between Poisson and Negative Binomial models, we conduct simulations that indicate that, applied prequentially, the method will consistently select the true model.

Equations66

S (x, P) = G_{x - 1}^{'} {\frac{p ( x )}{p ( x - 1 )}} + G_{x} {\frac{p ( x + 1 )}{p ( x )}} - \frac{p ( x + 1 )}{p ( x )} G_{x}^{'} {\frac{p ( x + 1 )}{p ( x )}} (x = 0, 1, \dots)

S (x, P) = G_{x - 1}^{'} {\frac{p ( x )}{p ( x - 1 )}} + G_{x} {\frac{p ( x + 1 )}{p ( x )}} - \frac{p ( x + 1 )}{p ( x )} G_{x}^{'} {\frac{p ( x + 1 )}{p ( x )}} (x = 0, 1, \dots)

y = 0 \sum \infty f_{y} G_{y} (v_{y}) + (f_{y + 1} - f_{y} v_{y}) G_{y}^{'} (v_{y})

y = 0 \sum \infty f_{y} G_{y} (v_{y}) + (f_{y + 1} - f_{y} v_{y}) G_{y}^{'} (v_{y})

y = 0 \sum \infty f_{y} G_{y} (\frac{θ}{y + 1}) + (f_{y + 1} - \frac{f _{y}}{y + 1} θ) G_{y}^{'} (\frac{θ}{y + 1}) .

y = 0 \sum \infty f_{y} G_{y} (\frac{θ}{y + 1}) + (f_{y + 1} - \frac{f _{y}}{y + 1} θ) G_{y}^{'} (\frac{θ}{y + 1}) .

G_{x} (v) = - (x + 1)^{a} v^{m} / m (m - 1) (m > 0, m \neq = 1) .

G_{x} (v) = - (x + 1)^{a} v^{m} / m (m - 1) (m > 0, m \neq = 1) .

S(x,P)=\left\{\begin{array}[c]{lr}m^{-1}\left\{{p(1)}/{p(0)}\right\}^{m}&(x=0)\\ &\\ \{m(m-1)\}^{-1}\left[(m-1)(x+1)^{a}\left\{{p(x+1)}/{p(x)}\right\}^{m}\right.\\ {}\quad\quad\quad\quad\quad\quad\quad\left.-mx^{a}\left\{{p(x)}/{p(x-1)}\}\right)^{m-1}\right]&(x>0).\end{array}\right.

S(x,P)=\left\{\begin{array}[c]{lr}m^{-1}\left\{{p(1)}/{p(0)}\right\}^{m}&(x=0)\\ &\\ \{m(m-1)\}^{-1}\left[(m-1)(x+1)^{a}\left\{{p(x+1)}/{p(x)}\right\}^{m}\right.\\ {}\quad\quad\quad\quad\quad\quad\quad\left.-mx^{a}\left\{{p(x)}/{p(x-1)}\}\right)^{m-1}\right]&(x>0).\end{array}\right.

p_{M} (x) = \int_{Θ_{M}} p_{M} (x ∣ θ_{M}) π_{M} (θ_{M}) d θ_{M} .

p_{M} (x) = \int_{Θ_{M}} p_{M} (x ∣ θ_{M}) π_{M} (θ_{M}) d θ_{M} .

π_{M} (θ_{M}) = c_{M} f_{M} (θ_{M}),

π_{M} (θ_{M}) = c_{M} f_{M} (θ_{M}),

L_{M} \propto c_{M} \int_{Θ_{M}} p_{M} (x ∣ θ_{M}) f_{M} (θ_{M}) d θ_{M} .

L_{M} \propto c_{M} \int_{Θ_{M}} p_{M} (x ∣ θ_{M}) f_{M} (θ_{M}) d θ_{M} .

p (x ∣ λ) = e^{- k λ} (k λ)^{x} / x! (x = 0, 1, \dots),

p (x ∣ λ) = e^{- k λ} (k λ)^{x} / x! (x = 0, 1, \dots),

π (λ) = \frac{β ^{α}}{Γ ( α )} λ^{α - 1} e^{- β λ} .

π (λ) = \frac{β ^{α}}{Γ ( α )} λ^{α - 1} e^{- β λ} .

p (x) = \frac{Γ ( α + x )}{Γ ( α ) x !} (1 - ϕ)^{α} ϕ^{x}

p (x) = \frac{Γ ( α + x )}{Γ ( α ) x !} (1 - ϕ)^{α} ϕ^{x}

S (0, P)

S (0, P)

S (x, P)

\displaystyle S_{N}(\mbox{\boldmath$0$},P)

\displaystyle S_{N}(\mbox{\boldmath$0$},P)

\displaystyle S_{N}(\mbox{\boldmath$x$},P)

\Lambda\,|\,\mbox{\boldmath$X$}^{n-1}=\mbox{\boldmath$x$}^{n-1}\sim\Gamma\left\{\alpha+t_{n-1},\beta+(n-1)k\right\}.

\Lambda\,|\,\mbox{\boldmath$X$}^{n-1}=\mbox{\boldmath$x$}^{n-1}\sim\Gamma\left\{\alpha+t_{n-1},\beta+(n-1)k\right\}.

S_{n}^{*} (0, P)

S_{n}^{*} (0, P)

S_{n}^{*} (x_{n}, P)

\displaystyle S_{N}(\mbox{\boldmath$0$},P)

\displaystyle S_{N}(\mbox{\boldmath$0$},P)

\displaystyle S_{N}(\mbox{\boldmath$x$},P)

S_{n}^{*} (0, P)

S_{n}^{*} (0, P)

S_{n}^{*} (x_{n}, P)

p (x ∣ θ) = \frac{( s + x - 1 )!}{x ! ( s - 1 )!} (1 - θ)^{s} θ^{x} (x = 0, 1, \dots),

p (x ∣ θ) = \frac{( s + x - 1 )!}{x ! ( s - 1 )!} (1 - θ)^{s} θ^{x} (x = 0, 1, \dots),

π (θ) = \frac{Γ ( p + q )}{Γ ( p ) Γ ( q )} θ^{p - 1} (1 - θ)^{q - 1} .

π (θ) = \frac{Γ ( p + q )}{Γ ( p ) Γ ( q )} θ^{p - 1} (1 - θ)^{q - 1} .

p (x) = \frac{Γ ( p + q )}{Γ ( p ) Γ ( q )} \frac{( s + x - 1 )!}{x ! ( s - 1 )!} \frac{Γ ( p + x ) Γ ( q + s )}{Γ ( p + q + s + x )} .

p (x) = \frac{Γ ( p + q )}{Γ ( p ) Γ ( q )} \frac{( s + x - 1 )!}{x ! ( s - 1 )!} \frac{Γ ( p + x ) Γ ( q + s )}{Γ ( p + q + s + x )} .

\frac{p ( x + 1 )}{p ( x )} = \frac{( x + s ) ( x + p )}{( x + 1 ) ( x + p + q + s )},

\frac{p ( x + 1 )}{p ( x )} = \frac{( x + s ) ( x + p )}{( x + 1 ) ( x + p + q + s )},

S (0, P)

S (0, P)

S (x, P)

\displaystyle S_{N}(\mbox{\boldmath$0$},P)

\displaystyle S_{N}(\mbox{\boldmath$0$},P)

\displaystyle S_{N}(\mbox{\boldmath$x$},P)

\Theta\,|\,\mbox{\boldmath$X$}^{n-1}=\mbox{\boldmath$x$}^{n-1}\sim\beta\left\{p+t_{n-1},q+(n-1)s\right\}.

\Theta\,|\,\mbox{\boldmath$X$}^{n-1}=\mbox{\boldmath$x$}^{n-1}\sim\beta\left\{p+t_{n-1},q+(n-1)s\right\}.

S_{n}^{*} (0, P)

S_{n}^{*} (0, P)

S_{n}^{*} (x_{n}, P)

\displaystyle S_{N}(\mbox{\boldmath$0$},P)

\displaystyle S_{N}(\mbox{\boldmath$0$},P)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Note on Bayesian Model Selection for Discrete Data Using

Proper Scoring Rules

A. Philip Dawid Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, U.K.

Monica Musio Dipartiment of Mathematics, University of Cagliari, Italy

Silvia Columbu Dipartiment of Mathematics, University of Cagliari, Italy

Abstract

We consider the problem of choosing between parametric models for a discrete observable, taking a Bayesian approach in which the within-model prior distributions are allowed to be improper. In order to avoid the ambiguity in the marginal likelihood function in such a case, we apply a homogeneous scoring rule. For the particular case of distinguishing between Poisson and Negative Binomial models, we conduct simulations that indicate that, applied prequentially, the method will consistently select the true model.

Keywords: Consistent model selection; homogeneous score; discrete data; prequential

1 Introduction

It is well known that Bayesian model selection with improper within-model prior distributions is not well-defined, owing to the presence of an arbitrary multiplicative constant in each term of the marginal likelihood function. Recently (Dawid and Musio, 2015) it has been shown how this problem can be overcome if one replaces negative log-likelihood (the log score) by another, homogeneous, proper scoring rule (Parry et al., 2012)—since then the arbitrary constants do not enter into the formulae. That paper considered the case of continuous variables and, in particular, the Hyvärinen scoring rule (Hyvärinen, 2005), and showed that this approach will generally lead to consistent selection of the correct model.

The above approach can not be applied directly when the data are discrete, since then we need to use scoring rules specifically adapted to the discrete case, as characterised in Dawid et al. (2012). Here we investigate, by example, such a discrete data problem. In particular we consider the problem of distinguishing between the Poisson and the Negative Binomial distributions. Simulations indicate that the method will again deliver consistent selection of the true model.

2 Local scoring rules

Let ${\cal X}$ be a discrete sample space endowed with a structure whereby with each $x\in{\cal X}$ is associated a neighbourhood $N_{x}\subseteq{\cal X}$ , containing $x$ . In Dawid et al. (2012) it was shown how to define a proper local scoring rule $S(x,P)$ on ${\cal X}$ , where $x\in{\cal X}$ , and $P$ is a distribution over ${\cal X}$ . The rule is proper if, for all $P$ , $S(P,Q):={\mbox{E}}_{X\sim P}S(X,Q)$ is minimised for $Q=P$ , and local if $S(x,P)$ depend on $P$ only through the probabilities it assigns to points in $N_{x}$ . Under a condition on the neighbourhoods, we can define an undirected graph ${\cal G}$ on ${\cal X}$ such that we can take $y\in N_{x}$ just when $x$ and $y$ are identical or are adjacent in ${\cal G}$ . Then all proper local scorings can be characterised, and (on excluding the log score, yielding what are termed key local proper scoring rules) any of these will be homogeneous in the sense that its value is unchanged when all probabilities in $N_{x}$ are scaled by the same positive constant.

In particular, suppose the sample space ${\cal X}$ is the set $\mathbb{Z}_{\geq 0}$ of non-negative integers, and we regard $x$ and $y$ as neighbours if and only if they differ by at most 1. It is shown in Dawid et al. (2012) that a key local scoring rule adapted to this structure has the form

[TABLE]

where, for each $x\in\mathbb{Z}_{\geq 0}$ , $p(x)=P(X=x)$ , $G_{x}$ is a concave function on $\mathbb{R}^{+}$ , and the first term in (1) is absent if $x=0$ . It is clear from the way in which ratios enter (1) that such a scoring rule is homogeneous.

The cumulative score (1) based on an independent and identically distributed sample $(x_{1},\ldots,x_{n})$ in which the frequency of $y$ is $f_{y}$ $(y=0,1,\ldots)$ is

[TABLE]

with $v_{y}:=p(y+1)/p(y)$ . If for example we wished to fit the Poisson model $p(x)\propto\theta^{x}/x!$ , we might estimate $\theta$ by minimising the total empirical score

[TABLE]

In the sequel we shall use the special case of (1) with

[TABLE]

This gives the scoring rule

[TABLE]

3 Bayesian Model Selection

Let ${\cal M}$ be a finite or countable class of statistical models for the same observable $X\in{\cal X}$ . Each $M\in{\cal M}$ is a parametric family, with parameter $\theta_{M}\in\Theta_{M}$ , a $d_{M}$ -dimensional Euclidean space; when $M$ obtains, with parameter value $\theta_{M}$ , then $X$ has distribution $P_{\theta_{M}}$ , with density function (probability mass function) $p_{M}(x\,|\,\theta_{M})$ . Having observed data $X=x$ , we wish to make inference about which model $M\in{\cal M}$ (and possibly which parameter-value $\theta_{M}$ ) actually generated the data.

The Bayesian approach assigns, within each model $M$ , a prior distribution $\Pi_{M}$ , with density $\pi_{M}(\cdot)$ say, for its parameter $\theta_{M}$ . The associated predictive distribution $P_{M}$ of $X$ (given only the validity of model $M$ , but no information on its parameter) has density function

[TABLE]

Any function over ${\cal M}$ proportional to $p_{M}(x)$ (considered as a function of $M$ , for fixed $x$ ) supplies the marginal likelihood function, $L(M)$ , based on data $X=x$ . In typical asymptotic scenarios, selection of the model maximising $L(M)$ , or, equivalently, minimising the log score $S_{L}(x,P_{M}):=-\log p_{M}(x)$ , will consistently select the true model (Dawid, 2011).

“Objective” Bayesian inference attempts to use standardised within-model priors $\Pi_{M}$ intended to represent “prior ignorance”. In many applications, such an “ignorance prior” for $\theta_{M}$ is not a genuine distribution, but rather an “improper” $\sigma$ -finite but not finite measure, with a “density” $\pi_{M}(\cdot)$ that does not have a finite integral and so can not be normalised to be a proper probability density. Typically one writes $\pi_{M}(\theta_{M})\propto f_{M}(\theta_{M})$ , where $f_{M}$ is a given non-integrable function and the constant of proportionality is not specified. Even without that specification, this allows mechanical computation of a formal within-model- $M$ posterior density $\pi_{M}(\theta_{M}\,|\,x)$ , by application of Bayes’s formula: $\pi_{M}(\theta_{M}\,|\,x)\propto p_{M}(x\,|\,\theta_{M})\,\pi_{M}(\theta_{M})\propto p_{M}(x\,|\,\theta_{M})\,f_{M}(\theta_{M})$ . This will often yield an integrable function and hence the possibility of normalisation to supply a genuine probability density.

However things do not work out so well when we turn to model selection. We have, for each model $M$ ,

[TABLE]

where $c_{M}$ is the unspecified proportionality constant. This formally leads to the marginal likelihood function

[TABLE]

But since this involves the unspecified constants $c_{M}$ , which could vary arbitrarily with $M$ , it is no longer meaningful to compare models by means of their marginal likelihoods.

A way round this problem was proposed in Dawid and Musio (2015): instead of attempting to minimise the log score $S_{L}(x,P_{M}):=-\log p_{M}(x)$ , we replace that with another proper scoring rule $S(x,P_{M})$ . And if that scoring rule is homogeneous, it will simply not involve the unspecified constant $c_{M}$ . In Dawid and Musio (2015) a detailed analysis of this approach was conducted for the case of continuous data and the Hyvärinen scoring rule, and it was shown that it will typically deliver consistent selection of the true model.

4 Discrete model selection

We shall investigate empirically, for a simple example, the validity of the above results when generalised to the case of discrete data. We shall use the scoring rule (5), and apply this to the choice between a Poisson and a Negative Binomial model. For this purpose we first need to compute, for each of these models separately, the appropriate score.

5 Poisson model

Consider the Poisson model $X\sim{\cal P}(k\Lambda)$ :

[TABLE]

with conjugate prior $\Lambda\sim\Gamma(\alpha,\beta)$ :

[TABLE]

For propriety we require $\alpha>0$ , $\beta>0$ .

The predictive distribution $P$ has density function

[TABLE]

with $\phi:=k/(\beta+k)$ .

Then $p(x+1)/p(x)=\phi(x+\alpha)/(x+1)$ , and so

[TABLE]

5.1 Multiple observations

Suppose now we have $N$ independent and identically distributed observations $\mbox{\boldmath$ X $}_{N}=(X_{1},\ldots,X_{N})$ from the above Poisson distribution. We can apply the above score in two different ways:

(a).

Apply direct to the sufficient statistic. 2. (b).

Apply prequentially to all observations.

5.1.1 Sufficient statistic

The sufficient statistic is $T_{N}=\sum_{i=1}^{N}X_{i}$ , with distribution ${\cal P}(Nk\Lambda)$ . So the score computed this way is simply obtained from (10) and (11) on replacing $x$ by $t_{N}$ and $k$ by $Nk$ . This gives

[TABLE]

where $\phi_{N}:=Nk/(\beta+Nk)$ .

5.1.2 Prequential

Now suppose we have already observed $\mbox{\boldmath$ X $}^{n-1}=\mbox{\boldmath$ x $}^{n-1}$ . The posterior distribution of $\Lambda$ is

[TABLE]

So the predictive distribution of $X_{n}$ , given the previous observations $\mbox{\boldmath$ X $}^{n-1}=\mbox{\boldmath$ x $}^{n-1}$ , is obtained from (10) and (11) on replacing $x$ with $x_{n}$ , $\alpha$ with $\alpha+t_{n-1}$ , and $\beta$ with $\beta+(n-1)k$ . The incremental contribution to the prequential score is thus given by:

[TABLE]

with $\phi_{n}^{*}:=k/(\beta+nk)$ .

The total prequential score is obtained by summing this from $n=1$ to $N$ .

5.2 Improper prior

The usual improper prior is the formal limit with $\alpha,\beta\downarrow 0$ . In this case (12) and (13) become:

[TABLE]

Note that the score is well-defined even when all observations are [math], in which case the posterior is improper.

For the prequential version, we obtain, from (14) and (15):

[TABLE]

An alternative improper prior is the Jeffreys prior, having $\alpha=1/2$ , $\beta\downarrow 0$ , which is easily handled similarly.

6 Negative Binomial model

Now we consider an alternative model, the Negative Binomial $X\sim{\cal NB}(s;\Theta)$ , having

[TABLE]

with conjugate prior $\Theta\sim\beta(p,q)$ :

[TABLE]

For propriety we require $p>0$ , $q>0$ .

The predictive density is

[TABLE]

Then

[TABLE]

and so we have:

[TABLE]

6.1 Multiple observations

Again, we can handle multiple observations either by restricting to the sufficient statistic, or by cumulating the prequential score.

6.1.1 Sufficient statistic

The sufficient statistic is $T_{N}=\sum_{i=1}^{N}X_{i}$ , with distribution ${\cal NB}(Ns,\Theta)$ . So the score computed this way is simply obtained from (23) and (24) on replacing $x$ by $t_{N}$ and $s$ by $Ns$ . This gives

[TABLE]

6.1.2 Prequential

Now suppose we have already observed $\mbox{\boldmath$ X $}^{n-1}=\mbox{\boldmath$ x $}^{n-1}$ . The posterior distribution of $\Theta$ is

[TABLE]

So the predictive density of $X_{n}$ , given the previous observations $\mbox{\boldmath$ X $}^{n-1}=\mbox{\boldmath$ x $}^{n-1}$ , is obtained from (23) and (24) on replacing $x$ with $x_{n}$ , $p$ with $p+t_{n-1}$ , and $q$ with $q+(n-1)s$ . The incremental contribution to the prequential score is thus given by:

[TABLE]

The total prequential score is obtained by summing this from $n=1$ to $N$ .

6.2 Improper prior

The usual improper prior is the formal limit with $p,q\downarrow 0$ . In this case (25) and (26) become:

[TABLE]

The score is well-defined even when all observations are [math], in which case the posterior is improper.

For the prequential version, we obtain, from (27) and (28):

[TABLE]

The total prequential score is obtained by summing this from $n=1$ to $N$ .

Again, similar expressions can be found using the improper Jeffreys prior, which has $p\downarrow 0$ , $q=1/2$ .

7 Simulations

We generated observations from either the Poisson distribution (7) with $k=1$ , $\lambda=10$ , or the Negative Binomial distribution (20) with $s=81$ , $\theta=0.1$ . These both have variance $10$ , the former having mean $10$ , and the latter mean $9$ . We used, as the scoring rule, the special case of (5) having $a=m=2$ , namely

[TABLE]

For each generating distribution we computed the excess of the cumulative prequential score for the wrong model over that for the correct model. These differences are shown, as a function of increasing data, in Figures 1 and 2 respectively. Each figure displays 10 sample sequences generated from the indicated distribution, as well as the average taken over a sample Areof 100 sequences.

In each case we see a clear linear upward trend, supporting the expectation of consistent model selection, although even with 1000 observations there is a non-negligible probability of a negative value, giving a misleading preference for the wrong model.

8 Conclusions

We have extended the Bayesian model selection methodology of Dawid and Musio (2015) to apply to problems with discrete data. We have conducted a simulation study to compare Poisson and Negative Binomial distributions. The results suggest that the method will consistently select the correct model as the number of data points increases.

Acknowledgements

Philip Dawid’s research was supported through an Emeritus Fellowship from the Leverhulme Trust.

Bibliography5

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Dawid (2011) Dawid, A. P. (2011). Posterior model probabilities. In Philosophy of Statistics , (ed. P. S. Bandyopadhyay and M. Forster), pp. 607–30. Elsevier, New York.
2Dawid et al. (2012) Dawid, A. P., Lauritzen, S. L. and Parry, M. (2012). Proper local scoring rules on discrete sample spaces. Ann. Statist. 40 593–608.
3Dawid and Musio (2015) Dawid, A. P. and Musio, M. (2015). Bayesian model selection based on proper scoring rules (with Discussion). Bayesian Analysis 10 479–521.
4Hyvärinen (2005) Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning 6 695–709.
5Parry et al. (2012) Parry, M. F., Dawid, A. P., and Lauritzen, S. L. (2012). Proper local scoring rules. Annals of Statistics 40 561–92.