A Note on Bayesian Model Selection for Discrete Data Using Proper Scoring Rules
A. Philip Dawid, Monica Musio, Silvia Columbu

TL;DR
This paper proposes a Bayesian model selection method for discrete data using proper scoring rules, enabling the use of improper priors and demonstrating consistent model choice between Poisson and Negative Binomial models through simulations.
Contribution
It introduces a scoring rule-based Bayesian approach for model selection with improper priors, specifically applied to discrete distributions like Poisson and Negative Binomial.
Findings
The method consistently selects the correct model in simulations.
Homogeneous scoring rules effectively handle improper priors.
Prequential application ensures reliable model discrimination.
Abstract
We consider the problem of choosing between parametric models for a discrete observable, taking a Bayesian approach in which the within-model prior distributions are allowed to be improper. In order to avoid the ambiguity in the marginal likelihood function in such a case, we apply a homogeneous scoring rule. For the particular case of distinguishing between Poisson and Negative Binomial models, we conduct simulations that indicate that, applied prequentially, the method will consistently select the true model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Note on Bayesian Model Selection for Discrete Data Using
Proper Scoring Rules
A. Philip Dawid Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, U.K.
Monica Musio Dipartiment of Mathematics, University of Cagliari, Italy
Silvia Columbu Dipartiment of Mathematics, University of Cagliari, Italy
Abstract
We consider the problem of choosing between parametric models for a discrete observable, taking a Bayesian approach in which the within-model prior distributions are allowed to be improper. In order to avoid the ambiguity in the marginal likelihood function in such a case, we apply a homogeneous scoring rule. For the particular case of distinguishing between Poisson and Negative Binomial models, we conduct simulations that indicate that, applied prequentially, the method will consistently select the true model.
Keywords: Consistent model selection; homogeneous score; discrete data; prequential
1 Introduction
It is well known that Bayesian model selection with improper within-model prior distributions is not well-defined, owing to the presence of an arbitrary multiplicative constant in each term of the marginal likelihood function. Recently (Dawid and Musio, 2015) it has been shown how this problem can be overcome if one replaces negative log-likelihood (the log score) by another, homogeneous, proper scoring rule (Parry et al., 2012)—since then the arbitrary constants do not enter into the formulae. That paper considered the case of continuous variables and, in particular, the Hyvärinen scoring rule (Hyvärinen, 2005), and showed that this approach will generally lead to consistent selection of the correct model.
The above approach can not be applied directly when the data are discrete, since then we need to use scoring rules specifically adapted to the discrete case, as characterised in Dawid et al. (2012). Here we investigate, by example, such a discrete data problem. In particular we consider the problem of distinguishing between the Poisson and the Negative Binomial distributions. Simulations indicate that the method will again deliver consistent selection of the true model.
2 Local scoring rules
Let be a discrete sample space endowed with a structure whereby with each is associated a neighbourhood , containing . In Dawid et al. (2012) it was shown how to define a proper local scoring rule on , where , and is a distribution over . The rule is proper if, for all , is minimised for , and local if depend on only through the probabilities it assigns to points in . Under a condition on the neighbourhoods, we can define an undirected graph on such that we can take just when and are identical or are adjacent in . Then all proper local scorings can be characterised, and (on excluding the log score, yielding what are termed key local proper scoring rules) any of these will be homogeneous in the sense that its value is unchanged when all probabilities in are scaled by the same positive constant.
In particular, suppose the sample space is the set of non-negative integers, and we regard and as neighbours if and only if they differ by at most 1. It is shown in Dawid et al. (2012) that a key local scoring rule adapted to this structure has the form
[TABLE]
where, for each , , is a concave function on , and the first term in (1) is absent if . It is clear from the way in which ratios enter (1) that such a scoring rule is homogeneous.
The cumulative score (1) based on an independent and identically distributed sample in which the frequency of is is
[TABLE]
with . If for example we wished to fit the Poisson model , we might estimate by minimising the total empirical score
[TABLE]
In the sequel we shall use the special case of (1) with
[TABLE]
This gives the scoring rule
[TABLE]
3 Bayesian Model Selection
Let be a finite or countable class of statistical models for the same observable . Each is a parametric family, with parameter , a -dimensional Euclidean space; when obtains, with parameter value , then has distribution , with density function (probability mass function) . Having observed data , we wish to make inference about which model (and possibly which parameter-value ) actually generated the data.
The Bayesian approach assigns, within each model , a prior distribution , with density say, for its parameter . The associated predictive distribution of (given only the validity of model , but no information on its parameter) has density function
[TABLE]
Any function over proportional to (considered as a function of , for fixed ) supplies the marginal likelihood function, , based on data . In typical asymptotic scenarios, selection of the model maximising , or, equivalently, minimising the log score , will consistently select the true model (Dawid, 2011).
“Objective” Bayesian inference attempts to use standardised within-model priors intended to represent “prior ignorance”. In many applications, such an “ignorance prior” for is not a genuine distribution, but rather an “improper” -finite but not finite measure, with a “density” that does not have a finite integral and so can not be normalised to be a proper probability density. Typically one writes , where is a given non-integrable function and the constant of proportionality is not specified. Even without that specification, this allows mechanical computation of a formal within-model- posterior density , by application of Bayes’s formula: . This will often yield an integrable function and hence the possibility of normalisation to supply a genuine probability density.
However things do not work out so well when we turn to model selection. We have, for each model ,
[TABLE]
where is the unspecified proportionality constant. This formally leads to the marginal likelihood function
[TABLE]
But since this involves the unspecified constants , which could vary arbitrarily with , it is no longer meaningful to compare models by means of their marginal likelihoods.
A way round this problem was proposed in Dawid and Musio (2015): instead of attempting to minimise the log score , we replace that with another proper scoring rule . And if that scoring rule is homogeneous, it will simply not involve the unspecified constant . In Dawid and Musio (2015) a detailed analysis of this approach was conducted for the case of continuous data and the Hyvärinen scoring rule, and it was shown that it will typically deliver consistent selection of the true model.
4 Discrete model selection
We shall investigate empirically, for a simple example, the validity of the above results when generalised to the case of discrete data. We shall use the scoring rule (5), and apply this to the choice between a Poisson and a Negative Binomial model. For this purpose we first need to compute, for each of these models separately, the appropriate score.
5 Poisson model
Consider the Poisson model :
[TABLE]
with conjugate prior :
[TABLE]
For propriety we require , .
The predictive distribution has density function
[TABLE]
with .
Then , and so
[TABLE]
5.1 Multiple observations
Suppose now we have independent and identically distributed observations \mbox{\boldmathX}_{N}=(X_{1},\ldots,X_{N}) from the above Poisson distribution. We can apply the above score in two different ways:
- (a).
Apply direct to the sufficient statistic. 2. (b).
Apply prequentially to all observations.
5.1.1 Sufficient statistic
The sufficient statistic is , with distribution . So the score computed this way is simply obtained from (10) and (11) on replacing by and by . This gives
[TABLE]
where .
5.1.2 Prequential
Now suppose we have already observed \mbox{\boldmathX}^{n-1}=\mbox{\boldmathx}^{n-1}. The posterior distribution of is
[TABLE]
So the predictive distribution of , given the previous observations \mbox{\boldmathX}^{n-1}=\mbox{\boldmathx}^{n-1}, is obtained from (10) and (11) on replacing with , with , and with . The incremental contribution to the prequential score is thus given by:
[TABLE]
with .
The total prequential score is obtained by summing this from to .
5.2 Improper prior
The usual improper prior is the formal limit with . In this case (12) and (13) become:
[TABLE]
Note that the score is well-defined even when all observations are [math], in which case the posterior is improper.
For the prequential version, we obtain, from (14) and (15):
[TABLE]
An alternative improper prior is the Jeffreys prior, having , , which is easily handled similarly.
6 Negative Binomial model
Now we consider an alternative model, the Negative Binomial , having
[TABLE]
with conjugate prior :
[TABLE]
For propriety we require , .
The predictive density is
[TABLE]
Then
[TABLE]
and so we have:
[TABLE]
6.1 Multiple observations
Again, we can handle multiple observations either by restricting to the sufficient statistic, or by cumulating the prequential score.
6.1.1 Sufficient statistic
The sufficient statistic is , with distribution . So the score computed this way is simply obtained from (23) and (24) on replacing by and by . This gives
[TABLE]
6.1.2 Prequential
Now suppose we have already observed \mbox{\boldmathX}^{n-1}=\mbox{\boldmathx}^{n-1}. The posterior distribution of is
[TABLE]
So the predictive density of , given the previous observations \mbox{\boldmathX}^{n-1}=\mbox{\boldmathx}^{n-1}, is obtained from (23) and (24) on replacing with , with , and with . The incremental contribution to the prequential score is thus given by:
[TABLE]
The total prequential score is obtained by summing this from to .
6.2 Improper prior
The usual improper prior is the formal limit with . In this case (25) and (26) become:
[TABLE]
The score is well-defined even when all observations are [math], in which case the posterior is improper.
For the prequential version, we obtain, from (27) and (28):
[TABLE]
The total prequential score is obtained by summing this from to .
Again, similar expressions can be found using the improper Jeffreys prior, which has , .
7 Simulations
We generated observations from either the Poisson distribution (7) with , , or the Negative Binomial distribution (20) with , . These both have variance , the former having mean , and the latter mean . We used, as the scoring rule, the special case of (5) having , namely
[TABLE]
For each generating distribution we computed the excess of the cumulative prequential score for the wrong model over that for the correct model. These differences are shown, as a function of increasing data, in Figures 1 and 2 respectively. Each figure displays 10 sample sequences generated from the indicated distribution, as well as the average taken over a sample Areof 100 sequences.
In each case we see a clear linear upward trend, supporting the expectation of consistent model selection, although even with 1000 observations there is a non-negligible probability of a negative value, giving a misleading preference for the wrong model.
8 Conclusions
We have extended the Bayesian model selection methodology of Dawid and Musio (2015) to apply to problems with discrete data. We have conducted a simulation study to compare Poisson and Negative Binomial distributions. The results suggest that the method will consistently select the correct model as the number of data points increases.
Acknowledgements
Philip Dawid’s research was supported through an Emeritus Fellowship from the Leverhulme Trust.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Dawid (2011) Dawid, A. P. (2011). Posterior model probabilities. In Philosophy of Statistics , (ed. P. S. Bandyopadhyay and M. Forster), pp. 607–30. Elsevier, New York.
- 2Dawid et al. (2012) Dawid, A. P., Lauritzen, S. L. and Parry, M. (2012). Proper local scoring rules on discrete sample spaces. Ann. Statist. 40 593–608.
- 3Dawid and Musio (2015) Dawid, A. P. and Musio, M. (2015). Bayesian model selection based on proper scoring rules (with Discussion). Bayesian Analysis 10 479–521.
- 4Hyvärinen (2005) Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning 6 695–709.
- 5Parry et al. (2012) Parry, M. F., Dawid, A. P., and Lauritzen, S. L. (2012). Proper local scoring rules. Annals of Statistics 40 561–92.
