Bayesian Model Selection for Misspecified Models in Linear Regression

MB de Kock; HC Eggers

arXiv:1706.03343·stat.ME·December 15, 2017

Bayesian Model Selection for Misspecified Models in Linear Regression

MB de Kock, HC Eggers

PDF

Open Access

TL;DR

This paper develops a unified Bayesian approach that combines the strengths of BIC and AIC for linear regression, enhancing robustness against model misspecification and low signal-to-noise scenarios.

Contribution

It introduces a novel prior in an augmented model-plus-noise space that unifies BIC and AIC assumptions, improving model selection under misspecification.

Findings

01

Unified prior inherits properties of BIC and AIC

02

Enhanced robustness to model misspecification

03

Applicable in low signal-to-noise ratio conditions

Abstract

While the Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) are powerful tools for model selection in linear regression, they are built on different prior assumptions and thereby apply to different data generation scenarios. We show that in the finite-dimensional case their respective assumptions can be unified within an augmented model-plus-noise space and construct a prior in this space which inherits the beneficial properties of both AIC and BIC. This allows us to adapt the BIC to be robust against misspecified models where the signal to noise ratio is low.

Equations143

p (D ∣ H_{K})

p (D ∣ H_{K})

\frac{p ( H _{K} ∣ D )}{p ( H _{K^{'}} ∣ D )}

\frac{p ( H _{K} ∣ D )}{p ( H _{K^{'}} ∣ D )}

\frac{p ( H _{K} ∣ D )}{p ( H _{K^{'}} ∣ D )}

\frac{p ( H _{K} ∣ D )}{p ( H _{K^{'}} ∣ D )}

y (x ∣ α_{K}, K)

y (x ∣ α_{K}, K)

p (ε ∣ σ)

p (ε ∣ σ)

L [α_{K}] = p (y ∣ α_{K}, K)

L [α_{K}] = p (y ∣ α_{K}, K)

v_{k}^{T}

v_{k}^{T}

L [α_{K}]

L [α_{K}]

= C exp [- \frac{1}{2} (z - f_{K})^{T} (z - f_{K})] = C exp [- \frac{1}{2} χ^{T} χ]

\hat{f}_{K}

\hat{f}_{K}

\hat{α}_{K}

\hat{α}_{K}

z^{2}

z^{2}

L [α_{K}]

L [α_{K}]

L [β_{K}]

F_{K}^{2}

F_{K}^{2}

A_{N}

A_{N}

n = 1 \sum N f_{k} (x_{n}) f_{ℓ} (x_{n})

n = 1 \sum N f_{k} (x_{n}) f_{ℓ} (x_{n})

\hat{χ}

\hat{χ}

χ^{2}

χ^{2}

L [β_{K}]

L [β_{K}]

L [α_{K}, α_{L}]

L [α_{K}, α_{L}]

p (y ∣ β_{K}, β_{L}) = L [β_{K}, β_{L}]

p (y ∣ β_{K}, β_{L}) = L [β_{K}, β_{L}]

p (y ∣ K, H)

p (y ∣ K, H)

= C {\int d β_{L} exp [- \frac{1}{2} (β_{L} - \hat{β}_{L})^{T} (β_{L} - \hat{β}_{L})] p (β_{L} ∣ H_{L})}

\times {\int d β_{K} exp [- \frac{1}{2} (β_{K} - \hat{β}_{K})^{T} (β_{K} - \hat{β}_{K})] p (β_{K} ∣ H_{K})} .

p (β_{k} ∣ H_{\textsc bi c}) = \frac{1}{2 π N} e^{- β_{k}^{2} /2 N},

p (β_{k} ∣ H_{\textsc bi c}) = \frac{1}{2 π N} e^{- β_{k}^{2} /2 N},

p (y ∣ K, H_{\textsc bi c})

p (y ∣ K, H_{\textsc bi c})

p (y ∣ N, H_{\textsc bi c})

BF [K, N]

BF [K, N]

- 2 lo g BF [K, N]

- 2 lo g BF [K, N]

p (β_{k} ∣ H_{\textsc ai c}) = \frac{1}{2 π Δ ^{2}} e^{- β_{k}^{2} /2 Δ^{2}},

p (β_{k} ∣ H_{\textsc ai c}) = \frac{1}{2 π Δ ^{2}} e^{- β_{k}^{2} /2 Δ^{2}},

p (y ∣ K, H_{\textsc ai c})

p (y ∣ K, H_{\textsc ai c})

p (y ∣ N, H_{\textsc ai c})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Blind Source Separation Techniques · Bayesian Methods and Mixture Models

Full text

\pdfcolInitStack

tcb@breakable

**Bayesian Model Selection for Misspecified Models in Linear Regression

** M.B. de Kock1 and H.C. Eggers1,2

1*Institute of Theoretical Physics and Department of Physics, Stellenbosch University, P/Bag X1,

7602 Matieland, Stellenbosch, South Africa

2National Institute for Theoretical Physics,

P/Bag X1, 7602 Matieland, South Africa*

Abstract

While the Bayesian Information Criterion (bic) and Akaike Information Criterion (aic) are powerful tools for model selection in linear regression, they are built on different prior assumptions and thereby apply to different data generation scenarios. We show that in the finite dimensional case their respective assumptions can be unified within an augmented model-plus-noise space and construct a prior in this space which inherits the beneficial properties of both aic and bic. This allows us to adapt the bic to be robust against misspecified models where the signal to noise ratio is low.

1 Introduction

The selection of a model between multiple competing models is a well established tool of data analysis in a wide spectrum of fields ranging from ecology to psychology[Burnham and Anderson, 2003, Vrieze, 2012, Aho et al., 2014]. If the true model is one of the candidates then Bayesian methods are consistent in that they will select the true model with probability one as the sample size increases[Nishii et al., 1984]. On the other hand if the underlying data generating process is nonparametric then to estimate the underlying regression function we require a minimax-rate optimal rule[Shao, 1997]. This dichotomy is closely related to the competition between aic and bic , where bic represents the Bayesian methods and aic the loss-optimal rules. In general, it is though to be impossible to combine the properties of both, see [Yang, 2005].

Our goal is not to address the difference between the parametric and nonparametric cases but to give a Bayesian construction for linear regression that is robust to model misspecification. It is similar to [Müller, 2013], but does not introduce an artificial posterior but augments the likelihood with a larger parameter space. This extends the idea of Akaike[Akaike, 1978] which appeared after Akaike [Akaike, 1974] and Schwartz introduced,[Schwarz, 1978], their information criterion, aic and bic, respectively. Akaike, in the limited context of linear regression, considers the model selection problem in a parameter space expanded from the $K$ -dimensional model space into a larger one and then shows how different Bayesian priors in this space give arise to the two different information criterion. This reveals that the BIC implicitly includes a prior which fixes the extra non-model “noise parameters” to be exactly zero, while the aic allows them to vary to some degree around zero. This confirms our own experience in that as the noise-to-signal ratios is lowered to unity the bic fails completely. Statistically speaking, the bic assumes one of the candidate models is the true model while aic does not. It is this property that we wish to extend to the bic case.

Armed with this insight, we shall construct priors not only for the $K$ model parameters, but for an additional set of $L=N-K$ noise parameters in the spirit of Akaike. Unlike their bic and aic predecessors, however, these new priors will take into account the crucial information that the modes of signal parameter priors should be located some distance away from the origin of parameter space. This allows us to adapt the Bayesian Information Criterion to not assume that one of the candidate models are true.

The paper is organised as follows. In Section 2, we review the framework of Bayesian model comparison and linear regression, augmenting in Section 3 the model space by a noise space in preparation for the new priors. Reconsideration in Section 4 of the priors underlying the bic and aic forms the basis and motivation for the construction of spherically symmetric priors for both model and noise parameters in Section 5. The results are tested numerically and compared to the traditional information criteria in Section 6, followed by a brief summary and discussion in Section 7.

2 Bayesian linear regression

2.1 Bayesian model selection

Given data $\mathcal{D}$ , Bayesian model selection is based on the evidence or marginal likelihood for the model ${\mathcal{H}}_{\scriptscriptstyle K}$ which has $K$ parameters ${\bm{\alpha}}_{\scriptscriptstyle K}=(\alpha_{1},\ldots,\alpha_{\scriptscriptstyle K})$ . The evidence is an average over the likelihood $p(\mathcal{D}\,|\,{\bm{\alpha}}_{\scriptscriptstyle K},{\mathcal{H}}_{\scriptscriptstyle K})$ weighted by the parameter prior $p({\bm{\alpha}}_{\scriptscriptstyle K}|{\mathcal{H}}_{\scriptscriptstyle K})$ ,

[TABLE]

where $p({\bm{\alpha}}_{\scriptscriptstyle K}|{\mathcal{H}}_{\scriptscriptstyle K})$ may contain hyperparameters as necessary. Bayes’ theorem used twice for competing models ${\mathcal{H}}_{\scriptscriptstyle K}$ , ${\mathcal{H}}_{{\scriptscriptstyle K}^{\prime}}$ relates the ratio of model posteriors $p({\mathcal{H}}_{\scriptscriptstyle K}\,|\,\mathcal{D})$ to the corresponding model evidences by

[TABLE]

Barring good reasons to deviate from the Principle of Indifference, model priors would normally be set equal, $p({\mathcal{H}}_{\scriptscriptstyle K})=p({\mathcal{H}}_{\scriptscriptstyle K}^{\prime})=\tfrac{1}{2}$ , in which case the posterior odds equals the Bayes Factor [Kass and Raftery, 1995]

[TABLE]

When more than two models are to be compared, it is convenient to define a reference model against which all others are measured. In this paper, we use as reference model ${\mathcal{H}}_{\scriptscriptstyle N}$ , the case where $N$ data points are modelled by $K=N$ free parameters, implying of course an exact fit and no noise. Among the $N$ competing models with $K=1,2,\ldots N$ parameters, the best model is the one with maximal Bayes Factor or equivalently minimum information criterion $\mathrm{IC}=-2\log\mathrm{BF}$ .

2.2 Linear Regression

We briefly review the canonical formalism for linear regression and introduce the language and notation to be used in later sections. By assumption the data $\mathcal{D}$ comes in the form of $N$ data points $y_{n}\in{\bm{y}}=(y_{1},\ldots,y_{\scriptscriptstyle N})$ measured at locations $x_{n}\in{\bm{x}}=(x_{1},\ldots,x_{\scriptscriptstyle N})$ with fixed experimental uncertainties $\sigma_{n}\in{\bm{\sigma}}$ . The immediate aim is to find joint distributions (posterior, evidence etc) of the $K$ coefficients $\alpha_{k}\in{\bm{\alpha}}_{\scriptscriptstyle K}=(\alpha_{1},\ldots,\alpha_{\scriptscriptstyle K})$ of a linear model function

[TABLE]

where the choice of basis functions $f_{k}(x),k=1,\ldots,K$ forms part of the model specification and we subscript model-dependent quantities by $K$ in preparation for the extensions of Section 3. By assumption, the differences $\varepsilon_{n}\equiv y_{n}-y(x_{n}\,|\,{\bm{\alpha}}_{\scriptscriptstyle K},K)$ between each data point and the corresponding model point are normally distributed,

[TABLE]

resulting in a joint likelihood

[TABLE]

The $K$ -dimensional model space ${\mathcal{A}}_{\scriptscriptstyle K}$ is spanned by $N$ -dimensional basis vectors

[TABLE]

which together constitute the $(N{\times}K)$ -dimensioned design matrix $\mathbb{X}_{\scriptscriptstyle K}=[{\bm{v}}_{1},\ldots,{\bm{v}}_{\scriptscriptstyle K}]$ . In terms of the standardised data vector ${\bm{z}}=[y_{1}/\sigma_{1},\ldots,y_{\scriptscriptstyle N}/\sigma_{\scriptscriptstyle N}]$ and collecting constants into $C=(2\pi)^{-N/2}[\textstyle\prod_{n}\sigma_{n}]^{-1}$ , the likelihood can be written in three ways,

[TABLE]

where ${\bm{f}}_{\scriptscriptstyle K}=\mathbb{X}_{\scriptscriptstyle K}{\bm{\alpha}}_{\scriptscriptstyle K}=\sum_{k=1}^{K}{\bm{v}}_{k}\alpha_{k}$ is the model-dependent vector aspiring to approximate the data vector ${\bm{z}}$ and ${\bm{\chi}}={\bm{z}}-{\bm{f}}_{\scriptscriptstyle K}$ is the discrepancy between data and model. Description of the data ${\bm{z}}$ is thereby decomposed into a “noise” component ${\bm{\chi}}$ and a “signal” component ${\bm{f}}_{\scriptscriptstyle K}$ . As illustrated in Figure 1, ${\bm{\chi}}$ and ${\bm{f}}_{\scriptscriptstyle K}$ are in general not orthogonal. The length of the minimum-chisquared vector $\hat{{\bm{\chi}}}$ represents the minimum distance between the data vector ${\bm{z}}$ and model space ${\mathcal{A}}_{\scriptscriptstyle K}$ , so that it is orthogonal to model space ${\bm{f}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}\hat{{\bm{\chi}}}=0$ for all ${\bm{f}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}$ . The resulting maximum-signal vector can be found directly from

[TABLE]

where the maximum-likelihood parameter vector is determined by the usual Moore-Penrose inverse

[TABLE]

with $\mathbb{H}_{\scriptscriptstyle K}=\mathbb{X}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}\mathbb{X}_{\scriptscriptstyle K}$ the Hessian with elements $(\mathbb{H})_{kk^{\prime}}={\bm{v}}_{k}^{\scriptscriptstyle\sf T}{\bm{v}}_{k^{\prime}}=\sum_{n}f_{k}(x_{n})f_{k^{\prime}}(x_{n})/\sigma_{n}^{2}$ . Since ${\bm{z}}=\hat{{\bm{\chi}}}+\hat{{\bm{f}}}_{\scriptscriptstyle K}$ , the squared data vector $z^{2}={\bm{z}}^{\scriptscriptstyle\sf T}{\bm{z}}$ can hence be written as the Pythagorean sum of the usual minimum chisquared $\chi^{2}=\hat{{\bm{\chi}}}^{\scriptscriptstyle\sf T}\hat{{\bm{\chi}}}$ and the squared signal vector $F_{\scriptscriptstyle K}^{2}=\hat{{\bm{f}}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}\hat{{\bm{f}}}_{\scriptscriptstyle K}$ ,

[TABLE]

Following the usual diagonalisation by orthonormal eigenvector matrix $\mathbb{S}_{\scriptscriptstyle K}$ and rescaling by diagonal eigenvalue matrix $\mathbb{L}_{\scriptscriptstyle K}$ and transforming to hyperspherical parameters ${\bm{\beta}}_{\scriptscriptstyle K}=\mathbb{L}_{\scriptscriptstyle K}^{\!\!1/2}\mathbb{S}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}{\bm{\alpha}}_{\scriptscriptstyle K}$ , and corresponding modes ${\bm{\hat{\beta}}}_{\scriptscriptstyle K}=\mathbb{L}_{\scriptscriptstyle K}^{\!\!1/2}\mathbb{S}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}\bm{\hat{\alpha}}_{\scriptscriptstyle K}$ , the likelihood becomes

[TABLE]

and the squared signal vector transforms to

[TABLE]

All of the above is the standard fare of linear regression.

3 Model space and noise space

The expanded model-noise space introduced in this Section is best understood in the context of Akaike’s rederivation in a Bayesian framework of the bic and aic in [Akaike, 1978]. His central message was that both could be understood by introducing, over and above the $K$ parameters ${\bm{\beta}}_{\scriptscriptstyle K}=(\beta_{1},\ldots,\beta_{\scriptscriptstyle K})$ making up the model, an additional set of parameters ${\bm{\beta}}_{\scriptscriptstyle L}=(\beta_{{\scriptscriptstyle K}+1},\ldots,\beta_{\scriptscriptstyle M})$ with $K<M\leq N$ , for which particular choices of priors yield the bic and aic. Details of the derivation are postponed to Section 4.

The introduction of additional parameters in Akaike’s derivations allows the method to account for misspecified model functions in that if there is some signal left the noise parameters would be able to fit the shift in the residuals. The difference between a model ${\mathcal{H}}_{\scriptscriptstyle K}$ with $K$ parameters and another model ${\mathcal{H}}_{{\scriptscriptstyle K}+1}$ with $K{+}1$ parameters must then be found not in the existence or nonexistence of additional parameter $\beta_{{\scriptscriptstyle K}+1}$ but in different priors $p(\beta_{{\scriptscriptstyle K}+1}|{\mathcal{H}}_{\scriptscriptstyle K})\neq p(\beta_{{\scriptscriptstyle K}+1}|{\mathcal{H}}_{{\scriptscriptstyle K}+1})$ . In this view, all model parameters ${\bm{\beta}}_{\scriptscriptstyle K}$ should be assigned priors which allow them to exhibit large deviations from zero, while the additional noise parameters ${\bm{\beta}}_{\scriptscriptstyle L}$ should be assigned priors which are not exactly zero but restricted to small intervals around the origin.

Taking this line of thought to its logical conclusion, we let $M\equiv N$ and introduce $L=N-K$ additional noise parameters $\beta_{\ell}\in{\bm{\beta}}_{\scriptscriptstyle L}=(\beta_{{\scriptscriptstyle K}+1},\ldots,\beta_{\scriptscriptstyle N})$ along with $L$ additional basis functions $\{f_{\ell}(x)\}_{\ell=K{+}1}^{N}$ spanning what we shall call the noise space ${\mathcal{A}}_{\scriptscriptstyle L}$ . While the mathematics does not preclude overlap, it seems natural to demand that model space (also called signal space) ${\mathcal{A}}_{\scriptscriptstyle K}$ and noise space ${\mathcal{A}}_{\scriptscriptstyle L}$ partition the data space,

[TABLE]

In this view, model construction is seen as a successive decomposition of the data space ${\mathcal{A}}_{\scriptscriptstyle N}$ into sequences of partitions $\{{\mathcal{A}}_{\scriptscriptstyle K},{\mathcal{A}}_{\scriptscriptstyle L}\}_{K=1}^{N}$ with progressively increasing $K$ and decreasing $L$ , with model selection based on the maximum evidence or Bayes Factor as a function of $K$ . The partitioning property can be enforced by constructing, if necessary by a Gram-Schmidt procedure, a set of noise functions $f_{\ell}(x)$ which are orthogonal to all model functions $f_{k}(x)$ ,

[TABLE]

thereby ensuring111The simplest way to ensure block-diagonality is to construct a complete orthogonal basis for all $K{=}N$ functions $f_{k}$ which would trivially fulfil these requirements. The block-diagonal form is, however, more widely applicable. that the Hessian of the complete basis set $\{f_{k}\}_{k=1}^{N}$ is block-diagonal, $\mathbb{H}_{\scriptscriptstyle N}=\mathbb{H}_{\scriptscriptstyle K}{\oplus}\mathbb{H}_{\scriptscriptstyle L}$ . We note that the basis functions $f_{\ell}(x)$ may have to be adapted as $K$ changes to safeguard block-diagonality. The resulting sequence of models is therefore not nested in the strict sense.

As already mentioned in Section 2.2, ${\bm{f}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}\hat{{\bm{\chi}}}=0$ for all ${\bm{f}}_{\scriptscriptstyle K}$ because, with the help of Eq. (11), ${\bm{\alpha}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}\mathbb{X}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}({\bm{z}}-\mathbb{X}_{\scriptscriptstyle K}\bm{\hat{\alpha}}_{\scriptscriptstyle K})={\bm{\alpha}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}\mathbb{X}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}({\bm{z}}-\mathbb{X}_{\scriptscriptstyle K}\mathbb{H}_{\scriptscriptstyle K}^{-1}\mathbb{X}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}{\bm{z}})=0$ . Together with ${\mathcal{A}}_{K}\cap{\mathcal{A}}_{L}=\emptyset$ , this means that that $\hat{{\bm{\chi}}}$ is a vector in ${\mathcal{A}}_{\scriptscriptstyle L}$ and hence has a representation in the noise-space basis ${\bm{v}}_{\ell}=[f_{\ell}(x_{1})/\sigma_{1},\ldots,f_{\ell}(x_{\scriptscriptstyle N})/\sigma_{\scriptscriptstyle N}]$ with coefficients $\bm{\hat{\alpha}}_{\scriptscriptstyle L}=(\hat{\alpha}_{{\scriptscriptstyle K}+1},\ldots,\hat{\alpha}_{\scriptscriptstyle N})$ or equivalently in terms of the noise-space design matrix $\mathbb{X}_{\scriptscriptstyle L}=[{\bm{v}}_{{\scriptscriptstyle K}{+}1},\ldots,{\bm{v}}_{\scriptscriptstyle N}]$ ,

[TABLE]

Diagonalisation by $\mathbb{S}_{\scriptscriptstyle L}$ and rescaling by $\mathbb{L}_{\scriptscriptstyle L}$ in noise space results in $\hat{{\bm{\chi}}}=\mathbb{X}_{\scriptscriptstyle L}\mathbb{S}_{\scriptscriptstyle L}\mathbb{L}_{\scriptscriptstyle L}^{-1/2}{\bm{\hat{\beta}}}_{\scriptscriptstyle L}$ with ${\bm{\hat{\beta}}}_{\scriptscriptstyle L}=\mathbb{L}_{\scriptscriptstyle L}^{1/2}\mathbb{S}_{\scriptscriptstyle L}^{\scriptscriptstyle\sf T}\bm{\hat{\alpha}}_{\scriptscriptstyle L}$ just as in model space, so that, in close analogy with Eq. (15),

[TABLE]

Apart from the requirement that the basis functions $f_{\ell}(x)$ must be orthogonal to those in model space, their specific choice is arbitrary; correspondingly, individual coefficients ${\bm{\alpha}}_{\scriptscriptstyle L}$ , ${\bm{\beta}}_{\scriptscriptstyle L}$ and their maximum-likelihood cases $\bm{\hat{\alpha}}_{\scriptscriptstyle L}$ and ${\bm{\hat{\beta}}}_{\scriptscriptstyle L}$ are not fixed by the model. All that matters is that $\chi^{2}$ can be written as a sum of $L=N{-}K$ squared components $\hat{\beta}_{\ell}$ .

The extension from ${\mathcal{A}}_{\scriptscriptstyle K}$ to the larger space ${\mathcal{A}}_{\scriptscriptstyle K}\cup{\mathcal{A}}_{\scriptscriptstyle L}$ has an important consequence for the likelihood and evidence. Rather than using the conventional $K$ -dimensional version which would result from Eq. (13),

[TABLE]

the limited model $\sum_{k=1}^{K}f_{k}(x)\alpha_{k}$ in the likelihood of Eq. (6) is replaced with the full set,

[TABLE]

which for block-diagonal $\mathbb{H}_{\scriptscriptstyle N}=\mathbb{H}_{\scriptscriptstyle K}\oplus\mathbb{H}_{\scriptscriptstyle L}$ takes the form

[TABLE]

for which the evidence factorises into noise and signal parts,

[TABLE]

We briefly consider the case $K{=}N,L{=}0$ as this constitutes our reference model for Bayes Factors. The design matrix $\mathbb{X}_{\scriptscriptstyle N}$ is invertible so that $\bm{\hat{\alpha}}_{\scriptscriptstyle N}=\mathbb{X}_{\scriptscriptstyle N}^{-1}{\bm{z}}$ , all the data becomes signal ( ${\bm{z}}=\hat{{\bm{f}}}_{\scriptscriptstyle K=N}$ , $z^{2}=F_{\scriptscriptstyle N}^{2}$ ), there is no noise ( $\hat{{\bm{\chi}}}_{\scriptscriptstyle{L}=0}=0$ ), and the model amounts to a change of basis for ${\mathcal{A}}_{\scriptscriptstyle N}$ from ${\bm{v}}_{\scriptscriptstyle N}$ to ${\bm{\alpha}}_{\scriptscriptstyle N}$ .

4 Akaike’s BIC/AIC priors in model and noise space

We now rederive Akaike’s central insight in the language of model and noise space. The bic results when the priors for the model parameters $\beta_{k},k=1,\ldots,K$ are normally distributed with variance $N$ , while the additional parameters $\beta_{\ell},\ell=K{+}1,\ldots,N$ are set to zero exactly by means of Dirac delta functions,222 The reasoning behind setting the model prior variances to $N$ is that the exponent of the likelihood $L[{\bm{\alpha}}_{\scriptscriptstyle K},{\bm{\alpha}}_{\scriptscriptstyle L}]$ scales roughly with $N$ as long as parameter-parameter correlations do not dominate, so that ${\bm{\beta}}\approx{\bm{\alpha}}/\sqrt{N}$ and ${\bm{\beta}}^{\scriptscriptstyle\sf T}{\bm{\beta}}\approx 1/N$ .

[TABLE]

The evidence (3) and corresponding $K{=}N$ reference evidence are then

[TABLE]

yielding a Bayes Factor

[TABLE]

Dropping $K$ -independent constants and assuming $N\gg 1$ , we recover the BIC from the logarithm

[TABLE]

since $\chi^{2}\propto-2\log$ (maximum likelihood). In rederiving the aic, [Akaike, 1978] similarly suggested that model and noise parameters be treated on the same basis but with different scales for their priors,

[TABLE]

resulting in evidence, reference evidence and Bayes Factor

[TABLE]

and information criterion

[TABLE]

At this point, Akaike argued that $\delta$ and $\Delta$ should approach 1 from above and below, where 1 represents the situation of equal signal and noise magnitude. This is the critical case where it is difficult to distinguish between model and noise and the Bayes Factor will tend to zero. To find the next to leading order behaviour of the Bayes Factor we as Akaike take the limit $\delta\rightarrow 1^{-}$ and $\Delta\rightarrow 1^{+}$ , and recover the aic,

[TABLE]

Figure 2 summarises schematically the generic form of the data and model and noise parameter priors for the various information criteria. If the true behaviour of the system resulted from some true model with ${S}$ parameters plus noise, the data would correspond to ${S}$ non-zero parameters $\hat{\beta}_{k}$ and $N{-}{S}$ near-null parameters, both within some uncertainty range; this is sketched in the two leftmost columns. The third and fourth columns in Fig. 2 remind us that bic and aic set priors for both model and noise parameters centered around zero, differing only in the scale of the variation around zero for the noise parameters. The bic conflates probabilistic intervals for model parameters with point probabilities for noise parameters, a strong assumption which reduces its effectiveness for weak-signal cases. While consistently using intervals for all parameters, the aic fails to take account of the fact that model parameters will usually not be centered around zero. Consistent with the generic data behaviour, the robust version of bic displayed in the last two columns explicitly shifts model parameter priors away from zero with the help of a hyperparameter $\gamma$ and consistently uses intervals for all.

5 Noncentral radial priors and information criterion

The lesson of Fig. 2 is that properties and performance of information criteria depend crucially both on their treatment of noise parameters and the location of the model parameters priors’ modes. Seen from this perspective, a generic weakness of the aic and bic is self-evident: their model parameter priors are maximal near zero, while the likelihood will be maximal for nonzero values of ${\bm{\beta}}_{\scriptscriptstyle K}$ , resulting in poor overlap of prior and likelihood. This is exactly the problem which the Empirical Bayes criterion tries to correct, albeit in a nonrigorous way [George and Foster, 2000].

There is a good reason, of course, to maximise these priors around the origin. By definition, the state of knowledge embodied in a prior excludes the location of the data-dependent maximum ${\bm{\hat{\beta}}}_{\scriptscriptstyle K}$ , so that, barring supplementary prior information, the origin becomes the preferred mode for $p({\bm{\beta}}_{\scriptscriptstyle K}|{\mathcal{H}}_{\scriptscriptstyle K})$ . However, while ${\bm{\hat{\beta}}}_{\scriptscriptstyle K}$ itself may not be used, we can and should take into consideration the generic fact that for a good model the parameters ${\bm{\hat{\beta}}}_{\scriptscriptstyle K}$ will be nonzero. We do not know where in ${\mathcal{A}}_{\scriptscriptstyle K}$ the ${\bm{\hat{\beta}}}_{\scriptscriptstyle K}$ is located, but we do know that the posterior model radius $({\bm{\hat{\beta}}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}{\bm{\hat{\beta}}}_{\scriptscriptstyle K})^{1/2}$ is significantly nonzero for any and all data, which implies that the prior for what we call the prior model radius ${\parallel}{\bm{\beta}}_{\scriptscriptstyle K}{\parallel}=({\bm{\beta}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}{\bm{\beta}}_{\scriptscriptstyle K})^{1/2}$ should be chosen to be significantly nonzero. Moreover, the identity $\hat{{\bm{f}}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}\hat{{\bm{f}}}_{\scriptscriptstyle K}={\bm{\hat{\beta}}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}{\bm{\hat{\beta}}}_{\scriptscriptstyle K}$ in Eq. (15) and generally ${\bm{f}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}{\bm{f}}_{\scriptscriptstyle K}={\bm{\beta}}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}{\bm{\beta}}_{\scriptscriptstyle K}$ imply that the same model radius ${\parallel}{\bm{\beta}}_{\scriptscriptstyle K}{\parallel}$ sets the scale both in model space ${\mathcal{A}}_{\scriptscriptstyle K}$ and in the corresponding parameter space ${\mathcal{A}}_{\beta,{\scriptscriptstyle K}}$ . Likewise, we have generic knowledge that a good model will be characterised by a near-zero posterior noise radius ${\parallel}{\bm{\hat{\beta}}}_{\scriptscriptstyle L}{\parallel}=({\bm{\hat{\beta}}}_{\scriptscriptstyle L}^{\scriptscriptstyle\sf T}{\bm{\hat{\beta}}}_{\scriptscriptstyle L})^{1/2}$ for which a near-zero prior noise radius ${\parallel}{\bm{\beta}}_{\scriptscriptstyle L}{\parallel}=({\bm{\beta}}_{\scriptscriptstyle L}^{\scriptscriptstyle\sf T}{\bm{\beta}}_{\scriptscriptstyle L})^{1/2}$ is of course appropriate, and the same noise radius $q={\parallel}{\bm{\beta}}_{\scriptscriptstyle L}{\parallel}$ sets the scale in both noise space ${\mathcal{A}}_{\scriptscriptstyle L}$ and its parameter space ${\mathcal{A}}_{\beta,{\scriptscriptstyle L}}$ . These considerations are compactly summarised in the last column of Fig. 2 and in the equation set

[TABLE]

All of this constitutes prior knowledge without reference to the particulars of the data. Crucially, this knowledge pertains to the radii. The model evidence is therefore expanded in terms of two radial parameters $q$ and $r$ and two radial priors,

[TABLE]

where the evidence conditioned on $r$ and $q$ is

[TABLE]

Given the factorised likelihood (22), the conditioned evidence also factorises,

[TABLE]

With only radial information available, the prior for ${\bm{\beta}}_{\scriptscriptstyle K}$ must be uniform on the $K$ -hypersphere,

[TABLE]

with the Dirac delta function constraining the vector ${\bm{\beta}}_{\scriptscriptstyle K}$ to the surface of the $K$ -sphere of radius $r$ , while spherical symmetry on the $L$ -hypersphere with radius $q$ requires

[TABLE]

As shown in [De Kock and Eggers, 2017], the conditioned evidence (36) can be expressed in closed form,

[TABLE]

We now turn to the radial priors. To capture the generic information $|\beta_{\scriptscriptstyle L}|\simeq 0$ , we presuppose that ${\bm{\beta}}_{\scriptscriptstyle L}\sim{\mathcal{N}}(0,\mathbb{I}_{\scriptscriptstyle L}\delta^{2})$ , a normal distribution with zero mode and variance $\delta^{2}$ , so that the prior for the radius $q$ is a chi-squared distribution or Gamma Distribution in $q^{2}/2\delta^{2}$ with hyperparameter $\delta$ and $L=N-K$ as usual,

[TABLE]

Likewise projecting a $K$ -dimensional normal distribution $p({\bm{\beta}}_{\scriptscriptstyle K}|\Delta,{\bm{\mu}})={\mathcal{N}}({\bm{\mu}},\mathbb{I}_{\scriptscriptstyle K}\Delta^{2})$ with nonzero mode ${\bm{\mu}}=(\mu_{1},\ldots,\mu_{\scriptscriptstyle K})$ and variance $\mathbb{I}_{\scriptscriptstyle K}\Delta^{2}$ onto radius $r$ results in a noncentral Gamma Distribution with radial hyperparameter $\gamma=[\sum_{k}\mu_{k}^{2}]^{1/2}$ ,

[TABLE]

Later, we shall interpret $\gamma$ as a signal-to-noise ratio, and with $\gamma\to 0$ , $p(r\,|\,\Delta,\gamma,K)$ consistently reverts to the ordinary Gamma Distribution characteristic of noise. The noncentral Gamma Distribution results from the projection of $\mathcal{N}({\bm{\mu}},\mathbb{I}_{\scriptscriptstyle K}\Delta^{2})$ onto the squared radius $p(r^{2}|\Delta,{\bm{\mu}})=\int\mathrm{d}\,{\bm{\beta}}_{\scriptscriptstyle K}p(r^{2}|{\bm{\beta}}_{\scriptscriptstyle K})\,p({\bm{\beta}}_{\scriptscriptstyle K}|{\bm{\mu}},\Delta)$ and using the integral representation of the Dirac delta function

[TABLE]

whereby

[TABLE]

becomes the Noncentral Gamma Distribution with the help of [Bateman et al., 1955]

[TABLE]

Inserting Eqs. (5), (41) and (42) into (35) and using [Bateman et al., 1955]

[TABLE]

the evidence is found to be

[TABLE]

where the Humbert function is defined in terms of Pochhammer symbols $(x)_{y}=\Gamma(x{+}y)/\Gamma(x)$ as [Bateman et al., 1953]

[TABLE]

which for equal arguments reduces to $\Psi_{(2)}(a,a,a;x,y)=e^{x+y}\,{}_{0}F_{1}(a;xy)$ , and so

[TABLE]

Unlike the aic derivation, we have no reason to maintain the distinction between noise and model prior variances and can set $\delta=\Delta$ , so the evidence reduces via Eq. (12) to

[TABLE]

If signal-to-noise ratios are known beforehand, $\gamma$ can be set to a fixed number; otherwise, it must remain indeterminate and integrated out. Aiming to have a maximally uniform but proper prior for $\gamma$ , we use a half-Gaussian with arbitrarily large variance $\sigma_{\gamma}^{2}$ ,

[TABLE]

yielding the evidence

[TABLE]

and Bayes Factor

[TABLE]

We can now take the limit $\sigma_{\gamma}\rightarrow\infty$ to obtain

[TABLE]

The role of $\gamma$ has been to differentiate model and noise parameter behaviour. For finite $\Delta$ , integration over both $\gamma$ in Eq. (53) and over $r$ in Eqs. (35) and (47) results, however, in redundancy which can safely be eliminated by letting $\Delta\rightarrow 0$ : unlike the aic, our scales are set not by $\Delta$ but by $\gamma$ so we have no further need for it. The Bayes Factor hence simplifies to

[TABLE]

Using the asymptotic properties of the confluent hypergeometric distribution, [Bateman et al., 1953],

[TABLE]

we obtain for large $N$ a robust version of the bic,

[TABLE]

which for large $K$ reduces further to

[TABLE]

The three new forms of the bic in Eqs. (55), (56) and (58) are our central result.

We now show that, in the appropriate limits, the robust version approaches the aic and bic. The prior expectation value of $q^{2}$ for the Gamma Distribution (41) is $E[q^{2}]=L\delta^{2}/2$ , while $E[r^{2}]=\gamma^{2}+K\Delta^{2}/2$ for the Noncentral Gamma Distribution (42), so that via Eq. (43) each parameter scales on average as

[TABLE]

As a result, the expectation value of the partial sum $F_{j}^{2}=\sum_{i=1}^{j}\hat{\beta}_{i}^{2}$ for $j=1,\ldots,N$ scales as

[TABLE]

In the bic limit, $\Delta^{2}=N$ and $\delta=0$ , so that $E[F_{\scriptscriptstyle K}^{2}/K]=N/2+\mbox{constant}$ and so for $N\gg K$ , the robust bic becomes the bic up to a constant. In the aic limit, $\Delta^{2}=\delta^{2}=1$ so that $E[F_{\scriptscriptstyle K}^{2}/K]=1+\mathcal{O}(\gamma^{2}/K)$ and NIC reduces to approximately $\chi^{2}+K+\mathcal{O}(\gamma^{2}/K)$ , close to the aic’s $\chi^{2}+2K$ .

6 Results

To test the performance of the our robust bic, we present in this section a numerical simulation, followed by semi-analytical estimates of the salient quantities.

In the first part, we tested the success rate of information criteria in correctly identifying the number of parameters $S$ for competing models with varying parameter number $K$ . Data sampling points $x_{n}$ were spread evenly over the interval $[0,\pi]$ ,

[TABLE]

and we generated $N=32$ data points throughout, setting the experimental uncertainties to $\sigma_{n}=1$ . For each model ${\mathcal{H}}_{\scriptscriptstyle S}$ constructed from ${S}$ simulation parameters, data ${\bm{z}}({S})=(z_{1}({S}),\ldots,z_{\scriptscriptstyle N}({S}))$ was generated as the sum of an “ideal data” term, a cosine series

[TABLE]

whose amplitude $a$ was varied randomly by an additive term $b\phi_{k}$ with $\phi_{k}$ drawn from the standardised Gaussian distribution $\phi_{k}\sim\mathcal{N}(0,1)$ and $b\geq 0$ an adjustable parameter. Conceptually, $a$ represents the signal strength while $b$ controls the variance of the signal. To simulate randomness associated with the experimental uncertainty normally captured in $\sigma_{n}$ , a second random term $\varepsilon_{n}\sim\mathcal{N}(0,1)$ was added, so that the 32 data points generated from the true parameter model ${\mathcal{H}}_{\scriptscriptstyle S}$ are

[TABLE]

For quenched values of $\phi_{k}$ and $\varepsilon_{n}$ , one dataset was generated for each ${S}=1,2,\ldots N=32$ . All datasets were efficiently computed in terms of $(N{\times}N)$ -dimensioned matrices

[TABLE]

where $\mathbb{D}$ contains the $N$ column vectors ${\bm{z}}({S})$ , one for each ${S}$ , matrix $\mathbb{X}_{\scriptscriptstyle N}$ has elements $(\mathbb{X}_{\scriptscriptstyle N})_{nk}=f_{k}(x_{n})$ , $\mathbb{I}$ is the diagonal matrix, noise matrix $\mathbb{F}$ is diagonal with elements $\phi_{k}$ and $\mathbb{E}$ contains the $N^{2}$ gaussian random numbers $\varepsilon_{n}({S})$ with unit variance. To limit the $k$ -sum in Eq. (63) to ${S}$ , one must include an upper-triangular matrix with components $\mathbb{A}_{{\scriptscriptstyle K},{\scriptscriptstyle S}}=\Theta(K,{S})=1$ for integers $K\geq{S}$ and 0 otherwise. To calculate $F_{\scriptscriptstyle K}^{2}$ and $\chi^{2}$ for given ${S}$ for use in the Bayes Factor (55) and elsewhere, we must modify the notation to keep track of the “true” number of parameters ${S}$ to be compared to the number of model parameters $K$ . We therefore write ${\bm{\hat{\beta}}}_{{\scriptscriptstyle K}|{\scriptscriptstyle S}}$ for the parameter mode of the model with $K$ parameters for data simulated from ${S}$ parameters; correspondingly Eq. (15) becomes $F_{{\scriptscriptstyle K}|{\scriptscriptstyle S}}^{2}={\bm{\hat{\beta}}}_{{\scriptscriptstyle K}|{\scriptscriptstyle S}}^{\scriptscriptstyle\sf T}{\bm{\hat{\beta}}}_{{\scriptscriptstyle K}|{\scriptscriptstyle S}}^{\,}=\sum_{k=1}^{K}\hat{\beta}_{k|{\scriptscriptstyle S}}^{2}$ .

For model construction, we used the same cosine functions (62) used for data generation, and since the cosines form an orthonormal system, the Hessian is diagonal, $\mathbb{H}_{\scriptscriptstyle K}=\mathbb{X}_{{\scriptscriptstyle K}|{\scriptscriptstyle S}}^{\scriptscriptstyle\sf T}\mathbb{X}_{{\scriptscriptstyle K}|{\scriptscriptstyle S}}=\mathbb{I}_{\scriptscriptstyle K}$ , as are the rotation and eigenvalue matrices, so that with $\bm{\hat{\alpha}}_{{\scriptscriptstyle K}|{\scriptscriptstyle S}}=\mathbb{H}_{\scriptscriptstyle K}^{-1}\mathbb{X}_{\scriptscriptstyle K}^{\scriptscriptstyle\sf T}{\bm{z}}({S})$ the mode simplifies to

[TABLE]

The $N{\times}K$ design matrix $\mathbb{X}_{\scriptscriptstyle K}$ is augmented by means of a projector $\mathbb{P}_{\scriptscriptstyle K}$ which contains 1’s along its first $K$ diagonal elements and 0 elsewhere; the truncated design matrix $\mathbb{X}_{\scriptscriptstyle K}=\mathbb{X}_{\scriptscriptstyle N}\mathbb{P}_{\scriptscriptstyle K}$ then contains zeros in the last $N{-}K$ columns of the $N{\times}N$ matrix and $f_{k}(x_{n})$ elsewhere. The $N{\times}N$ matrix of modes $\mathbb{B}$ with elements $\mathbb{B}_{{\scriptscriptstyle K},{\scriptscriptstyle S}}=\hat{\beta}_{{\scriptscriptstyle K}|{\scriptscriptstyle S}}$ is then compactly represented as

[TABLE]

Note that this matrix formulation is possible only for orthogonal basis functions; for nonorthogonal cases, the Hessian and its eigensystem must be recalculated for every $K$ .

We now turn to the test results. Given a dataset with ${S}$ parameters, the $K$ value which minimises the AIC, BIC or the robust version in its Eq. (58) form is deemed a success if that $K$ correctly matches the data’s ${S}$ . In Fig. 3, we plot the number of successes as percentages of $2^{16}$ repetitions of 32 datasets as described above as a function of ${S}$ for four information criteria. The upper panel displays results for weak-signal data generated with $a=1,b=1$ , while the strong-signal data shown in the lower panel used $a=5,b=1$ . Red diamonds represent the robust bic success rate, green circles the corresponding aic success rate and black triangles the bic. Also shown as blue squares is the success rate of the corrected aic, [George and Foster, 2000].

As expected, the bicperforms poorly in the weak-signal environment where the models are badly specified but very well for strong signals, where the models are more successful. While the aic is more successful in the weak-signal case but underperforms the bic for strong signals. It is not a surprise that the corrected aic, which was designed for a particular subset of data scenarios, does very well in the mid-range of the strong-signal case but fails badly otherwise.333The corrected aiccorrects the aicformula for small $N$ and therefore small ${S}$ . In effect, this improves the aicfor strong-signal cases, but destroys its performance for the weak-signal case and larger ${S}$ . By contrast, the robust bicmatches or exceeds the performance of all other information criteria in both the strong and weak signal scenarios. The exact bic result (55) and the Eqs. (58) differ by less than one percent.

The general increase in success rates for ${S}$ near the simulation lower limit 1 and upper limit 32 are easily understood because there are fewer alternatives to ${S}$ at these edges. While this rise will persist for small ${S}$ , it will for large ${S}$ shift with increasing $N$ and is therefore a nonpersistent “boundary effect”.

In the second test, we utilise the simple linear system of Eqs. (64)–(66) to obtain analytical estimates of the squared signal and noise for a detailed but statistically approximate analysis of the shapes and sizes of the criteria’s $K$ vs ${S}$ curves. Because $\mathbb{F}^{\scriptscriptstyle\sf T}\mathbb{F}\simeq\mathbb{I}$ and $\mathrm{var}(\varepsilon)=1$ , the squared data vectors scale approximately as

[TABLE]

The squared signal is obtained from the diagonal elements of the squared mode matrix, $F_{{\scriptscriptstyle K}|{\scriptscriptstyle S}}^{2}=[\mathbb{B}^{\scriptscriptstyle\sf T}\mathbb{B}]_{{\scriptscriptstyle S},{\scriptscriptstyle S}}$ . Inserting the explicit simulation model (64) results in an approximate estimate of

[TABLE]

while the squared data vector and squared noise vector are obtained from

[TABLE]

These expressions provide instructive, if approximate, insights into the behaviour of the information criteria as a function of model parameter number $K$ . Fig. 4 illustrates by example the shapes of the minima as functions of $K$ of the simplest robust bicform (58) as well as the aicand bicfor fixed ${S}=8,b=0$ and strong-signal $a=3$ and weak-signal $a=1$ scenarios. The aicand bicdo not depend on $F_{\scriptscriptstyle K}^{2}$ but only on $\chi^{2}$ , which exhibits the well-known behaviour of steadily decreasing with $K$ . Upper and lower branches of these curves denote the strong-signal and weak-signal cases respectively. Based on $\chi^{2}$ and the simple penalty terms, the aicand bicboth exhibit a reasonably strong minimum at $K{=}{S}$ for $a=3$ ; for the weak-signal $a{=}1$ , however, the aicremains flat while the bichas no minimum at all. This is reflected in the low bicsuccess rate in Fig. 3. Since $\chi^{2}$ becomes independent of $a$ for $K\geq{S}$ , the aicand bicdo not distinguish between strong and weak scenarios in that region.

The robust bic, by contrast, is sensitive to the squared signal $F_{\scriptscriptstyle K}^{2}$ , which lifts the degeneracy between strong and weak signal for $K\geq{S}$ . Like the aicand bic, the robust bichas no trouble identifying $K=S$ for strong signal. For weak signal, it exhibits a minimum at the correct answer, albeit a shallow one. Shallow minima reflect, of course, the inherent uncertainty regarding the signal or noise character of the data. Details of Figure 4 and its discussion are, of course, specific to the model and numbers used and of illustrative value only.

7 Discussion and conclusions

The robust version of the bic introduced in this paper is based on three simple but novel ideas. Firstly, we have expanded Akaike’s original argument for a larger model space into a model space plus a fully-fledged noise space which together partition the entire data space. The resulting symmetries and scale behaviour of model and noise space provide a surprisingly unified and indeed beautiful framework for linear regression.

Secondly, building on the insight of earlier work [De Kock and Eggers, 2017], we posit that both model parameter and noise parameter spaces should be projected onto a radial coordinate on the respective hypersphere. Unlike [De Kock and Eggers, 2017], however, we now have not one but two hyperspheres reflecting the separate symmetries and scales of the model and noise spaces.

The third insight is that the crucial difference between model and noise parameters lies not in the scales $\delta$ and $\Delta$ — indeed we set these equal and eventually even set $\Delta=0$ — but in explicitly taking into account that the maximum-likelihood parameter vector’s magnitude ${\parallel}{\bm{\hat{\beta}}}_{\scriptscriptstyle K}{\parallel}$ must, by the very definition of “signal”, be significantly nonzero, while ${\parallel}{\bm{\hat{\beta}}}_{\scriptscriptstyle L}{\parallel}\simeq 0$ for noise. This results in a Gamma Distribution for noise parameters arising from projection of a zero-mode Gaussian on the one hand, and a noncentral Gamma Distribution for model parameters arising from projection of a nonzero-mode Gaussian.

Together, these three insights have allowed us to calculate Bayes Factors for model comparison in closed form and construct a robust version of the bic which extends the robustness the aic has against model misspecification to the bic. Unlike the latter, the robust bicdepends explicitly on the squared signal strength $F_{\scriptscriptstyle K}^{2}$ , and as $F_{\scriptscriptstyle K}^{2}$ approaches the weak or the strong signal limit, the robust bic correspondingly approaches the aicand biccases as limiting forms.

The noncentrality parameter $\gamma$ as a measure of signal strength appears to be the essence of the difference between signal and noise. Where the signal-to-noise ratio is known beforehand, $\gamma$ can be set to a fixed number or restricted to a limited interval. In the general case of unknown signal-to-noise ratio, however, it is better to integrate $\gamma$ over all possible values as implemented here.

We conclude with a few general remarks. Naturally, the scope of the numerical results presented here is limited, and this robust version of the bic should be tested and possibly improved when applied to a diversity of data scenarios. The present results should also be extended from the fixed experimental uncertainties ${\bm{\sigma}}$ to variable $\sigma$ . The analysis was done in the context of linear regression and should strictly speaking be used only in that context. The degree of success for nonlinear situations cannot be estimated or guaranteed within the present framework. Our derivations presume that there is only one model per $K$ . This limited approach is easily generalised to include more than one model for a given $K$ using, for example, indicator vectors as set out in [Liang et al., 2008].

**Acknowledgements

**This work was supported in part by the South African National Research Foundation.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Aho et al., 2014] Aho, K., Derryberry, D., and Peterson, T. (2014). Model selection for ecologists: the worldviews of AIC and BIC. Ecology , 95(3):631–636.
2[Akaike, 1974] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control , 19(6):716–723.
3[Akaike, 1978] Akaike, H. (1978). A Bayesian analysis of the minimum AIC procedure. Annals of the Institute of Statistical Mathematics , 30(1):9–14.
4[Bateman et al., 1953] Bateman, H., Erdélyi, A., Magnus, W., Oberhettinger, F., and Tricomi, F. G. (1953). Higher Transcendental Functions , volume 1. Mc Graw-Hill New York.
5[Bateman et al., 1955] Bateman, H., Erdélyi, A., Magnus, W., Oberhettinger, F., and Tricomi, F. G. (1955). Higher Transcendental Functions , volume 2. Mc Graw-Hill New York.
6[Burnham and Anderson, 2003] Burnham, K. P. and Anderson, D. R. (2003). Model selection and multimodel inference: a practical information-theoretic approach . Springer Science & Business Media, 2 edition.
7[De Kock and Eggers, 2017] De Kock, M. B. and Eggers, H. C. (2017). Bayesian variable selection with spherically symmetric priors. Communications in Statistics-Theory and Methods , 46(9):4250–4263.
8[George and Foster, 2000] George, E. I. and Foster, D. P. (2000). Calibration and empirical Bayes variable selection. Biometrika , 87:731–747.