Robust approximate Bayesian inference

Erlis Ruli; Nicola Sartori; Laura Ventura

arXiv:1706.01752·stat.ME·June 13, 2019

Robust approximate Bayesian inference

Erlis Ruli, Nicola Sartori, Laura Ventura

PDF

TL;DR

This paper introduces a robust Bayesian inference method using $M$-estimating functions within ABC algorithms, with theoretical analysis, simulations, and a clinical application demonstrating its effectiveness.

Contribution

It presents a novel approach combining $M$-estimating functions with ABC for robust posterior inference, including theoretical properties and practical implementation.

Findings

01

The method produces robust posterior distributions in linear mixed models.

02

Simulation studies validate the approach's effectiveness.

03

Application to clinical data demonstrates practical utility.

Abstract

We discuss an approach for deriving robust posterior distributions from $M$ -estimating functions using Approximate Bayesian Computation (ABC) methods. In particular, we use $M$ -estimating functions to construct suitable summary statistics in ABC algorithms. The theoretical properties of the robust posterior distributions are discussed. Special attention is given to the application of the method to linear mixed models. Simulation results and an application to a clinical study demonstrate the usefulness of the method. An R implementation is also provided in the robustBLME package.

Equations31

Ψ_{θ} = Ψ (y; θ) = i = 1 \sum n ψ (y_{i}; θ) - c (θ),

Ψ_{θ} = Ψ (y; θ) = i = 1 \sum n ψ (y_{i}; θ) - c (θ),

K (θ) = H (θ)^{- 1} J (θ) H (θ)^{-^{_{T}}},

K (θ) = H (θ)^{- 1} J (θ) H (θ)^{-^{_{T}}},

π_{R} (θ ∣ y) \propto π (θ) L_{R} (θ),

π_{R} (θ ∣ y) \propto π (θ) L_{R} (θ),

η_{R} (y^{*}; θ) = B_{R} (θ)^{- 1} Ψ (y^{*}; θ),

η_{R} (y^{*}; θ) = B_{R} (θ)^{- 1} Ψ (y^{*}; θ),

π_{R}^{A B C} (θ ∣ \tilde{θ}) = \frac{\int _{Y^{*}} π ( θ ) f ( y ^{*} ; θ ) K _{h} ( η ~ _{R} ( y ^{*} )) d y ^{*}}{\int _{Y^{*} \times Θ} π ( θ ) f ( y ^{*} ; θ ) K _{h} ( η ~ _{R} ( y ^{*} )) d y ^{*} d θ} .

π_{R}^{A B C} (θ ∣ \tilde{θ}) = \frac{\int _{Y^{*}} π ( θ ) f ( y ^{*} ; θ ) K _{h} ( η ~ _{R} ( y ^{*} )) d y ^{*}}{\int _{Y^{*} \times Θ} π ( θ ) f ( y ^{*} ; θ ) K _{h} ( η ~ _{R} ( y ^{*} )) d y ^{*} d θ} .

\tilde{η}_{R} (y^{*}) = B_{R} (\tilde{θ})^{- 1} (Ψ (y^{*}; \tilde{θ}) - Ψ (y; \tilde{θ})) = i = 1 \sum n (ψ (y_{i}^{*}, \tilde{θ}) - ψ (y_{i}, \tilde{θ})) .

\tilde{η}_{R} (y^{*}) = B_{R} (\tilde{θ})^{- 1} (Ψ (y^{*}; \tilde{θ}) - Ψ (y; \tilde{θ})) = i = 1 \sum n (ψ (y_{i}^{*}, \tilde{θ}) - ψ (y_{i}, \tilde{θ})) .

π_{R}^{A B C} (θ ∣ \tilde{θ}) \tilde{˙} N_{d} (\tilde{θ}, K (\tilde{θ})) .

π_{R}^{A B C} (θ ∣ \tilde{θ}) \tilde{˙} N_{d} (\tilde{θ}, K (\tilde{θ})) .

η_{R} (y; θ)^{^{_{T}}} η_{R} (y; θ) = Ψ_{θ}^{^{_{T}}} J (θ)^{- 1} Ψ_{θ} = (\tilde{θ} - θ)^{^{_{T}}} K (θ)^{- 1} (\tilde{θ} - θ) + o_{p} (1) .

η_{R} (y; θ)^{^{_{T}}} η_{R} (y; θ) = Ψ_{θ}^{^{_{T}}} J (θ)^{- 1} Ψ_{θ} = (\tilde{θ} - θ)^{^{_{T}}} K (θ)^{- 1} (\tilde{θ} - θ) + o_{p} (1) .

Ψ_{μ} = i = 1 \sum n ψ_{c_{1}} (z_{i}) and Ψ_{σ} = i = 1 \sum n (ψ_{c_{2}} (z_{i})^{2} - k (c_{2})),

Ψ_{μ} = i = 1 \sum n ψ_{c_{1}} (z_{i}) and Ψ_{σ} = i = 1 \sum n (ψ_{c_{2}} (z_{i})^{2} - k (c_{2})),

y = X α + i = 1 \sum c - 1 Z_{i} β_{i} + ε,

y = X α + i = 1 \sum c - 1 Z_{i} β_{i} + ε,

ℓ (θ) = lo g L (θ) = - \frac{1}{2} j = 1 \sum g {lo g ∣ V_{j} ∣ + (y_{j} - X_{j} α)^{^{_{T}}} V_{j}^{- 1} (y_{j} - X_{j} α)},

ℓ (θ) = lo g L (θ) = - \frac{1}{2} j = 1 \sum g {lo g ∣ V_{j} ∣ + (y_{j} - X_{j} α)^{^{_{T}}} V_{j}^{- 1} (y_{j} - X_{j} α)},

X^{^{_{T}}} V^{- 1/2} ψ_{c_{1}} (r) = 0,

X^{^{_{T}}} V^{- 1/2} ψ_{c_{1}} (r) = 0,

ψ_{c_{2}} (r)^{^{_{T}}} V^{- 1/2} Z_{i} Z_{i}^{^{_{T}}} V^{- 1/2} ψ_{c_{2}} (r) - tr (C P Z_{i} Z_{i}^{^{_{T}}}) = 0, i = 1, \dots, c,

y_{ij} = μ + α_{j} + β_{i} + ε_{ij},

y_{ij} = μ + α_{j} + β_{i} + ε_{ij},

y_{i} = X_{i}^{^{_{T}}} α + X_{i}^{^{_{T}}} \times w_{i} γ + β_{i} 1_{6} + ε_{i}, i = 1, \dots, 24,

y_{i} = X_{i}^{^{_{T}}} α + X_{i}^{^{_{T}}} \times w_{i} γ + β_{i} 1_{6} + ε_{i}, i = 1, \dots, 24,

η_{R} (y; θ) \sim N_{d} (0_{d}, I_{d}),

η_{R} (y; θ) \sim N_{d} (0_{d}, I_{d}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Robust approximate Bayesian inference

Erlis Ruli, Nicola Sartori and Laura Ventura

Department of Statistical Sciences, University of Padova, Italy

[email protected], [email protected], [email protected]

Abstract

We discuss an approach for deriving robust posterior distributions from $M$ -estimating functions using Approximate Bayesian Computation (ABC) methods. In particular, we use M-estimating functions to construct suitable summary statistics in ABC algorithms. The theoretical properties of the robust posterior distributions are discussed. Special attention is given to the application of the method to linear mixed models. Simulation results and an application to a clinical study demonstrate the usefulness of the method. An R implementation is also provided in the robustBLME package.

Keywords: Influence function; likelihood-free inference; $M$ -estimators; quasi-likelihood; robustness; unbiased estimating function.

1 Introduction

The normality assumption is the usual basis of many statistical analyses in several fields, such as medicine, health sciences, quality control and engineering statistics. Under this assumption, standard parametric estimation and testing procedures are simple and efficient. However, both from a frequentist or a Bayesian perspective, it is well known that these procedures are not robust when the normal distribution is just an approximate model or in the presence of outliers in the observed data. In these situations, robust statistical methods can be considered in order to produce statistical procedures that are stable with respect to small changes in the data or to small model departures; see Huber and Ronchetti (2009) for a review on robust methods.

The concept of robustness has been widely discussed in the frequentist literature; see, for instance, Hampel et al. (1986), Tsou and Royall (1995) and Markatou et al. (1998). Also Bayesian robustness with respect to model misspecification have attracted considerable attention. For instance, Lazar (2003), Greco et al. (2008), Ventura et al. (2010) and Agostinelli and Greco (2013) discuss approaches based on robust pseudo-likelihood functions, such as the empirical likelihood, as replacement of the genuine likelihood in Bayes’ formula. Lewis et al. (2014) discuss an approach for building posterior distributions from robust $M$ -estimators using constrained Markov Chain Monte Carlo (MCMC) methods. Recent approaches based on tilted likelihoods can be found in Grünwald and van Ommen (2017), Watson and Holmes (2016), Miller and Dunson (2018). Finally, approaches based on model embedding through heavy-tailed distributions are discussed by Andrade and O’Hagan (2006).

The aforementioned approaches may present some drawbacks. The empirical likelihood is not computable for small sample sizes and posterior distributions based on the quasi-likelihood can be easily obtained only for scalar parameters. The restricted likelihood approach of Lewis et al. (2014), as well as all the approaches based on estimating equations can be computationally cumbersome with some robust $M$ -estimating functions (such as, for instance, those used in linear mixed effects models). The tilted and the weighted likelihood approaches refer to concepts of robustness that are not directly related to the one considered in this paper, which is based on the influence function (Hampel et al., 1986, Huber and Ronchetti, 2009). Finally, the idea of embedding the model in a larger structure has the cost of requiring the elicitation of a prior distribution for the extra parameters introduced. Moreover, the statistical procedures derived under an embedded model are not necessarily robust in a broad sense, since the larger model may still be too restricted.

Here we focus on the robustness approach based on the influence function and on the derivation of robust posterior distributions from robust M-estimating functions, i.e. estimating equations with bounded influence function (see, e.g., Huber and Ronchetti, 2009, Chap. 3). In particular, we propose an approach based on Approximate Bayesian Computation (ABC) methods (see, e.g., Beaumont et al., 2002) using robust M-estimating functions as summary statistics. The idea extends results of Ruli et al. (2016) on composite score functions to Bayesian robustness. The method is easy to implement and computationally efficient, even when the M-estimating functions are potentially cumbersome to evaluate. Theoretical properties, implementation details and simulation results are discussed.

The rest of the paper is structured as follows. Section 2 sets the background. Section 3 describes the proposed method and its properties. Section 4 investigates the properties of the proposed method in the context of linear mixed models through simulations and an application to a clinical study. Concluding remarks are given in Section 5.

2 Background on robust $M$ -estimating functions

Let $y=(y_{1},\ldots,y_{n})$ be a random sample of size $n$ , having independent and identically distributed components, according to a distribution function $F_{\theta}=F(y;\theta)$ , with $\theta\in\Theta\subseteq{\rm I}\negthinspace{\rm R}^{d}$ , $d\geq 1$ and $y\in\mathcal{Y}$ . Let $L(\theta)$ be the likelihood function based on model $F_{\theta}$ .

Furthermore, let

[TABLE]

be an unbiased estimating function for $\theta$ , i.e. such that $E_{\theta}(\Psi(Y;\theta))=0$ for every $\theta$ . In (1), $\psi(\cdot)$ is a known function, $E_{\theta}(\cdot)$ is the expectation with respect to $F_{\theta}$ and the function $c(\cdot)$ is a consistency correction which ensures unbiasedness of the estimating function.

A general M-estimator (see, e.g., Hampel et al., 1986, Huber and Ronchetti, 2009) is defined as the root $\tilde{\theta}$ of the estimating equation $\Psi_{\theta}=0$ . The class of $M$ -estimators is wide and includes a variety of well-known estimators. For example, it includes the maximum likelihood estimator (MLE), the maximum composite likelihood estimator (see, e.g., Ruli et al., 2016, and references therein) and the scoring rule estimator (see e.g. Dawid et al., 2016, and references therein). Under broad regularity conditions, assumed throughout this paper, an $M$ -estimator is consistent and approximately normal with mean $\theta$ and variance

[TABLE]

where $H(\theta)=-E_{\theta}(\partial\Psi_{\theta}/\partial\theta^{{\footnotesize{{}^{{}_{\sf T}}}}})$ and $J(\theta)=E_{\theta}(\Psi_{\theta}\Psi_{\theta}^{{\footnotesize{{}^{{}_{\sf T}}}}})$ are the sensitivity and the variability matrices, respectively. The matrix $G(\theta)=K(\theta)^{-1}$ is known as the Godambe information and the form of $K(\theta)$ is due to the failure of the information identity since, in general, $H(\theta)\neq J(\theta)$ .

The influence function (IF) of the estimator $\tilde{\theta}$ is $\mathrm{\emph{IF}}(x;\tilde{\theta},F_{\theta})\propto\psi(x;\theta)$ and it measures the effect on the estimator $\tilde{\theta}$ of an infinitesimal contamination at the point $x$ , standardised by the mass of the contamination. A desirable robustness property for $\tilde{\theta}$ is that its IF is bounded (B-robustness), i.e. that $\psi(x;\theta)$ is bounded. Note that the IF of the MLE is proportional to the score function; therefore, in general, the MLE has unbounded IF, i.e. it is not B-robust.

3 Robust ABC inference

One possibility to perform robust Bayesian inference is to resort to a pseudo-posterior distribution of the form

[TABLE]

where $\pi(\theta)$ is a prior distribution for $\theta$ and $L_{R}(\theta)$ is a pseudo-likelihood based on a robust $\Psi_{\theta}$ , such as the quasi- or the empirical likelihood. This approach has two main drawbacks: the empirical likelihood is not computable for very small sample sizes and for moderate sample sizes the corresponding posterior appears to have always heavy tails (see, e.g., Greco et al., 2008); moreover, the posterior distribution based on the quasi-likelihood can be easily obtained only for scalar parameters. A further limitation of this approach is related to computational cost, in the sense that it requires repeated evaluations of the consistency correction $c(\theta)$ in (1), which in practice is often cumbersome.

We propose an alternative method for computing posterior distributions based on robust $M$ -estimating functions, extending the idea in Ruli et al. (2016). The method resorts to the ABC machinery (see, e.g., Beaumont et al., 2002) in which a standardised version of $\Psi_{\theta}$ , evaluated at a fixed value of $\theta$ , is used as a summary statistic. In Ruli et al. (2016) the composite score function is used as a model-based data reduction procedure for ABC in complex models. Here we generalise the approach to general unbiased robust estimating functions. In particular, let $\tilde{\theta}=\tilde{\theta}(y)$ be the $M$ -estimate of $\theta$ based on the observed sample $y$ . Furthermore, let $B_{R}(\theta)$ be such that $J(\theta)=B_{R}(\theta)B_{R}(\theta)^{{\footnotesize{{}^{{}_{\sf T}}}}}$ . The summary statistic in ABC is then the rescaled M-estimating function

[TABLE]

evaluated at $\tilde{\theta}$ , where $y^{*}$ is a simulated sample. In the sequel we use the shorthand notation $\tilde{\eta}_{R}(y^{*})=\eta_{R}(y^{*};\tilde{\theta})$ .

To generate posterior samples we propose to use the ABC-R algorithm with an MCMC kernel (Algorithm 1), which is similar to Algorithm 2 of Fearnhead and Prangle (2012); see also Marjoram et al. (2003). More specifically, the ABC-R algorithm (Algorithm 1) involves a kernel density $K_{h}(\cdot)$ , which is governed by the bandwidth $h>0$ and a proposal density $q(\cdot|\cdot)$ ; see the Appendix for the implementation details.

The proposed method gives Markov-dependent samples from the ABC-R posterior

[TABLE]

While Algorithm 1 or the use of a kernel in (5) are not new ideas in the ABC literature, the novelty here is to incorporate in such machinery the robust summary statistic $\tilde{\eta}_{R}(y^{*})$ in order to obtain a simulated sample from a robust posterior distribution. Using similar arguments to Soubeyrand et al. (2013), it can be shown that, for $h\to 0$ , $\pi_{R}^{ABC}(\theta|\tilde{\theta})$ converges to $\pi(\theta|\tilde{\theta})$ pointwise (see also Blum, 2010), in the sense that $\pi_{R}^{ABC}(\theta|\tilde{\theta})$ and $\pi(\theta|\tilde{\theta})$ are equivalent for sufficiently small $h$ . Since in general (4) does not give a sufficient summary statistic, then $\pi(\theta|\tilde{\theta})$ differs from $\pi(\theta|y)$ and information is lost by using (4) instead of $y$ . However this difference pays off in terms of robustness in inference about $\theta$ .

Posteriors conditional on partial information have been extensively discussed in the literature. Soubeyrand and Haon-Lasportes (2015) study the properties of the ABC posterior when the summary statistic is the MLE or the pseudo-MLE derived from a simplified parametric model. An alternative version of the ABC-R algorithm could be based directly on $\tilde{\theta}$ , used as the summary statistic and a, possibly rescaled, distance among the observed and the simulated value of the statistic. Apparently, these two versions of ABC, namely the one based on $\tilde{\theta}$ and that based on (4) seem to be treated in the literature as two separate approaches (see, e.g., Drovandi et al., 2015). However, both alternatives use essentially the same information, i.e. $\tilde{\theta}$ , but through different distance metrics. In addition, for small tolerance levels, these two distances converge to zero, and both methods give a posterior distribution conditional on the same statistic $\tilde{\theta}$ . Indeed, let $\tilde{\theta}$ be the summary statistic of the ABC posterior and let the corresponding tolerance threshold $\epsilon$ be sufficiently small and consider the random draw $\theta^{*}$ and its corresponding simulated summary statistics $\tilde{\theta}^{*}$ taken with the ABC algorithm. Then, by construction $\tilde{\theta}^{*}$ will be close to $\tilde{\theta}$ . This implies that also $\tilde{\eta}_{R}(y^{*})=\eta_{R}(y^{*};\tilde{\theta})$ will be close to $\eta_{R}(y^{*};\tilde{\theta}^{*})=0$ , and hence $\theta^{*}$ is also a sample from the ABC-R posterior which uses the summary statistic $\tilde{\eta}_{R}$ .

Nevertheless, the use of $\tilde{\theta}$ as summary statistic requires the solution of $\Psi_{\theta}=0$ at each iteration of the algorithm, which could be computationally cumbersome. On the contrary, the proposed approach, besides sharing the same invariance properties stated by Ruli et al. (2016), i.e. invariance with respect to both monotonic transformation of the data and with respect to reparameterisations, has the advantage of avoiding computational problems related to the repeated evaluation of $\Psi_{\theta}$ as shown by the following lemma.

Lemma 3.1

The ABC-R algorithm does not require repeated evaluations of the consistency correction $c(\theta)$ involved in $\Psi_{\theta}$ , as given by (1).

Proof

Let $\tilde{\theta}$ be the solution of $\Psi_{\theta}=0$ , with $\Psi_{\theta}$ of the form (1). Then, for a given simulated $y^{*}$ from $F_{\theta^{*}}$ , we have

[TABLE]

This implies that $c(\theta)$ is computed only once, at $\tilde{\theta}$ .

Theorem 3.1 below shows that the proposed method gives a robust approximate posterior distribution with the correct curvature, even though $\Psi_{\theta}$ , unlike the full score function, does not satisfy the information identity. Here, correct curvature means that asymptotically the robust posterior distribution and its normal approximation have the same covariance matrix, which is the inverse of the Godambe information, i.e. $K(\theta)$ .

Theorem 3.1

The ABC-R algorithm with rescaled M-estimating function $\tilde{\eta}_{R}(y)$ as summary statistic, as $h\to 0$ , leads to an approximate posterior distribution with the correct curvature and is also invariant to reparameterisations.

Proof

The proof follows from Theorem 3.2 of Ruli et al. (2016), by substituting the composite estimating equation with the more general M-estimating function $\Psi_{\theta}$ .

The ABC-R algorithm delivers thus a robust approximate posterior distribution which does not need calibration. On the contrary, for (3) a calibration is typically required.

Theorem 3.2 below shows that the proposed ABC posterior distribution is asymptotically normal.

Theorem 3.2

Assume the regularity assumptions of Soubeyrand and Haon-Lasportes (2015) and the usual regularity condition on M-estimators (Huber and Ronchetti, 2009, Chap. 4) are satisfied. Then, for $n\to\infty$ and $h\to 0$ , the posterior $\pi_{R}^{ABC}(\theta|\tilde{\theta})$ is asymptotically equivalent to the density of the normal distribution with mean vector $\tilde{\theta}$ and covariance matrix $K(\tilde{\theta})$ :

[TABLE]

Proof

The proof follows from Lemma 2 and Theorem 1 in Soubeyrand and Haon-Lasportes (2015) and from the asymptotic relation between the Wald-type statistic and the score-type statistic, i.e.

[TABLE]

If $\psi(y;\theta)$ is bounded in $y$ , i.e. if the estimator $\tilde{\theta}$ is B-robust, then the ABC-R posterior is resistant with respect to slight violations of model assumptions. More precisely, the following theorem shows that the ABC-R posterior inherits the robustness properties of the estimating equation.

Theorem 3.3

If $\psi(y;\theta)$ is bounded in $y$ , i.e. if the estimator $\tilde{\theta}$ is B-robust, then asymptotically the posterior mode, as well as other posterior summaries of $\pi_{R}^{ABC}(\theta|\tilde{\theta})$ have bounded IF.

Proof

From Theorem 3.2, the asymptotic posterior mode of $\pi_{R}^{ABC}(\theta|\tilde{\theta})$ is $\tilde{\theta}$ , which is $B$ -robust. Moreover, following results in Greco et al. (2008), it can be shown that asymptotic posterior summaries have bounded IF if and only if the posterior mode has bounded IF.

Example. We consider an illustrative example in which we compare numerically the ABC-R posterior, with the classical posterior based on the assumed model and the pseudo-posterior (3) based on the empirical likelihood (Lazar, 2003, Greco et al., 2008). Scenarios with data simulated either from the assumed model or from a slightly misspecified model are considered.

Let $F_{\theta}$ be a location-scale distribution with location $\mu$ and scale $\sigma>0$ , and let $\theta=(\mu,\sigma)$ . The Huber’s estimating function is a standard choice for robust estimation of location and scale parameters. The $M$ -estimating function is $\Psi_{\theta}=(\Psi_{\mu},\Psi_{\sigma})$ , with

[TABLE]

where $z_{i}=(y_{i}-\mu)/\sigma$ , $i=1,\ldots,n$ , $\psi_{c}(z)=\max[-c,\min(c,z)]$ is the Huber $\psi$ -function, $c>0$ is a scalar tuning constant which controls the desired degree of robustness of $\tilde{\theta}$ , and $k(\cdot)$ is a consistency correction term. Let $F_{\theta}$ be the normal distribution $N(\mu,\sigma^{2})$ and assume $\mu$ and $\sigma$ a priori independent with $\mu\sim N(0,10^{2})$ and $\sigma\sim\text{halfCauchy}(5)$ , where $\text{halfCauchy}(a)$ is the half Cauchy distribution with scale parameter equal to $a$ . We consider random samples of sizes $n=\{15,30\}$ drawn from either the normal distribution with $\theta=(0,1)$ and from a contaminated model $(1-\delta)N(0,1)+\delta N(0,\sigma_{1}^{2})$ , with $\sigma_{1}^{2}>0$ . We set the contamination level equal to 10%, i.e. $\delta=0.1$ , and $\sigma_{1}^{2}=10$ . Moreover, we fix $c_{1}=1.345$ and $c_{2}=2.07$ , which imply that $\tilde{\mu}$ and $\tilde{\sigma}$ are, respectively, 5% and 10% less efficient than the corresponding MLE under the assumed model (see Huber and Ronchetti, 2009, Chap. 6).

The genuine, e.g. the posterior based on the likelihood function of the normal model, and the pseudo-posterior (3) based on the empirical likelihood (EL) are computed by numerical integration. The ABC-R posterior is obtained using Algorithm 1. From the posterior distributions illustrated in Figure 1 we note that, when the data come from the central model (panels (a)-(b)), i.e. for $\delta=0$ , all the posteriors are in reasonable agreement, even if the EL posterior behaves slightly worse, especially the marginal posterior of $\sigma$ with $n=15$ . When the data are contaminated (panels (c)-(d)), the genuine posterior is less trustworthy as the bulk of the posterior drifts away from the true parameter value (vertical and horizontal straight lines). This is not the case however for the ABC-R posterior which remains centred around the true parameter value. We note that in the contaminated case, the ABC-R posterior is the one with smaller variability. This is due to the fact that the ABC-R posterior is not affected by the very outlying observations coming from the contamination component.

To highlight the robustness properties of the ABC-R posterior, we consider a sensitivity analysis. A sample $y$ of size $n=31$ is taken from the central model and the aforementioned posteriors are computed from the contaminated data $y^{w}$ given by the original data with the median observation $y_{(n+1)/2}$ replaced by $y_{(n+1)/2}+w$ ; $w$ is a contamination scalar with possible values $\{-15,-14,\ldots,15\}.$ The results of the sensitivity analysis, illustrated by means of violin plots in Figure 2, highlight that the posterior median of the genuine posterior (panel (c)) is substantially driven by $w$ . On the other hand, ABC-R and EL posteriors are robust. For all posteriors, the behaviour of the posterior median reflects the behaviour of the IF of the posterior mode. Furthermore, the variability of all posteriors is comparable for values of $w$ close to 0. More generally, these plots confirm that the genuine and EL posteriors under contamination are much more dispersed than the ABC-R posterior.

4 Application to linear mixed models

Linear mixed models (LMM) are a popular choice when analysing data in the context of hierarchical, longitudinal or repeated measures. A general formulation is

[TABLE]

where $y$ is a $n$ -dimensional vector of response observations, $X$ and $Z_{i}$ are known $n\times q$ and $n\times p_{i}$ design matrices, $\alpha$ is a $q$ -vector of unknown fixed effects, the $\beta_{i}$ are $p_{i}$ -vectors of unobserved random effects $(1\leq i\leq c-1)$ and $\varepsilon$ is a vector of unobserved errors. The $p_{i}$ levels of each random effect $\beta_{i}$ are assumed to be independent with mean zero and variance $\sigma_{i}^{2}$ . Moreover, each random error $\varepsilon_{i}$ is assumed to be independent with mean zero and variance $\sigma_{c}^{2}$ and $\beta_{1},\ldots,\beta_{c-1}$ and $\varepsilon$ are assumed to be independent.

Here we focus on the classical normal LMM, which assumes that $\varepsilon\sim N_{n}(0_{n},\sigma_{c}^{2}I_{n})$ and $\beta_{i}\sim N(0,\sigma_{i}^{2})$ , $i=1,\dots,c-1$ . For a normal LMM, it follows that $Y$ is multivariate normal with $E(Y)=X\alpha$ and $\text{var}(Y)=V=\sum_{i=1}^{c}\sigma_{i}^{2}Z_{i}Z_{i}^{\footnotesize{{}^{{}_{\sf T}}}}\,,$ where $Z_{c}=I_{n}$ . We assume that the set of $d=q+c$ unknown parameters $\theta=(\alpha,\sigma^{2})=(\alpha,\sigma_{1}^{2},\ldots,\sigma_{c}^{2})$ is identifiable. The validity and performance of this LMM requires strict adherence to the assumed model, which is usually chosen because it simplifies the analyses and not because it fits exactly the data at hand. The robust procedure discussed in this paper specifically takes into account the fact that the normal model is only approximate and then it produces statistical analyses that are stable with respect to outliers, deviations from the model or model misspecifications.

Although the $n$ observations $y$ are not independent, if the random effects are nested, then independent subgroups of observations can be found. Indeed, in many situations, $y$ can be split into $g$ independent groups of observations $y_{j}$ , $j=1,\ldots,g$ , and the log-likelihood is

[TABLE]

where $(y_{1},\ldots,y_{g})$ and $X$ and $V$ are partitioned accordingly. Classical Bayesian inference for $\theta$ is based on $\pi(\theta|y)\propto L(\theta)\,\pi(\theta)$ , where $\pi(\theta)$ is a prior distribution for $\theta$ . However, (9) can be very sensitive to model deviations (Richardson and Welsh, 1995, Richardson, 1997, Copt and Victoria-Feser, 2006); see also results of the simulation study in Section 4.1.

In the frequentist literature, there are two broad classes of estimators for robust estimation of Gaussian LMM: $M$ -estimators (see, e.g., Richardson and Welsh, 1995, Richardson, 1997, and references therein) and $S$ -estimators (Copt and Victoria-Feser, 2006). The latter are generally available for balanced designs whereas the formers can be applied to a wide variety of situations; for instance it can deal with unbalanced designs and robustness with respect to the design matrix (Richardson, 1997). In this work we focus on M-estimators but it is worth stressing that the idea can be applied to $S$ -estimators as well. Following Richardson and Welsh (1995), we focus on the system of M-estimating equations

[TABLE]

where $r=V^{-1/2}(y-X\alpha)$ is the vector or scaled marginal residuals, $C=E_{\theta}\left[\psi_{c_{2}}(R)\psi_{c_{2}}(R)^{\footnotesize{{}^{{}_{\sf T}}}}\right]$ , with $R=V^{-1/2}(Y-X\alpha)$ , $P=V^{-1}-V^{-1}X(X^{\footnotesize{{}^{{}_{\sf T}}}}V^{-1}X)^{-1}X^{\footnotesize{{}^{{}_{\sf T}}}}V^{-1}$ and $\text{tr}(\cdot)$ is the trace operator. The function $\text{tr}(CPZ_{i}Z_{i})$ is a correction factor needed to ensure consistency at the Gaussian model for each $i=1,\ldots,c$ . Equations (10)-(11) are called robust REML II estimating equations and are bounded versions of restricted likelihood equations. Richardson (1997) shows that the $M$ -estimator based on (10)-(11) is asymptotically normal with mean equal to the true parameter $\theta$ and covariance matrix of the form (2). The ABC-R procedure in the normal LMM based on (10)-(11) will be studied by means of simulations in Section 4.1 and then applied to a dataset from a clinical study in Section 4.2.

4.1 Simulation study

Let us consider the two-component nested model

[TABLE]

where $\mu$ is the grand mean, $\alpha_{j}$ are the fixed effects, constrained such that $\sum_{j=1}^{q}\alpha_{j}=0$ , $\beta_{i}\sim N(0,\sigma_{1}^{2})$ are the random effects and $\varepsilon_{ij}\sim N(0,\sigma_{2}^{2})$ is the residual term, for $j=1,\ldots,q$ and $i=1,\ldots,g$ . Model (12) is a particular case of (8) with $c=2$ , a single random effect $\beta_{1}$ with $p_{1}=g$ levels and $Z_{1}$ the unit diagonal matrix. Moreover, the covariate is a categorical variable with $q$ levels; hence the design matrix is given by $q-1$ dummy variables.

We assess the properties of the proposed method via simulations with 500 Monte Carlo replications. For each Monte Carlo replication, the true values for $(\sigma_{1}^{2},\sigma^{2}_{2})$ and for $\alpha$ are drawn uniformly in $(1,10)\times(1,10)$ and $(-5,5)$ , respectively. With these values, two datasets of size $g$ are generated: one from the central model and one from the contaminated model $(1-\delta)N(X_{i}^{\footnotesize{{}^{{}_{\sf T}}}}\alpha,V_{i})+\delta N(X_{i}^{\footnotesize{{}^{{}_{\sf T}}}}\alpha,15V_{i})$ , where $X_{i}$ is the matrix of covariates for the $i$ th unit, $\theta=(\alpha,\sigma^{2}_{1},\sigma_{2}^{2})$ and $\delta=0.10$ . We consider $q=\{3,5,7\}$ and $g=\{30,50,70\}$ . The prior distributions are $\alpha\sim N_{q}(0,10^{2}I_{q})$ and $(\sigma_{1}^{2},\sigma^{2}_{2})\sim\text{halfCauchy}(7)\times\text{halfCauchy}(7)$ . For each scenario, we fit model (12) in the classical Bayesian way, using an adaptive random walk Metropolis-Hastings algorithm. The same model is fitted by the ABC-R method using the estimating equations (10)-(11). As in Richardson and Welsh (1995), we set $c_{1}=1.345$ and $c_{2}=2.07$ and we find $\tilde{\theta}$ solving (10)-(11) iteratively until convergence. The classical REML estimate, computed by the function lmer of the lme4 package, is used as starting value. In our experiments, the convergence of the solution is quite rapid, i.e. $\tilde{\theta}$ stabilises within 10–15 iterations.

We assess the component-wise bias of the posterior median $\tilde{\theta}_{m}$ by the modulus of $\tilde{\theta}_{m}-\theta_{0}$ in logarithmic scale, where $\theta_{0}$ is the true value. Moreover, the efficiency of the classical Bayesian estimator relative to the ABC-R estimator is assessed through the index $MD_{MCMC}/MD_{ABC}$ , where $MD=\text{med}(|\tilde{\theta}_{m}-\theta_{0}|)$ ; see Richardson and Welsh (1995) and Copt and Victoria-Feser (2006). In addition, for each Monte Carlo replication we compute the Euclidean distance of $\tilde{\theta}_{m}$ from $\theta_{0}$ , which can be considered as a global measure of bias. Contrary to Richardson and Welsh (1995), we consider a different $\theta_{0}$ for each Monte Carlo replication. The bias and efficiency of the classical Bayesian posterior and of the ABC-R posterior for the 500 replications are illustrated in Figures 3 and 4, respectively.

Under the central model, inference with the ABC-R and the classical Bayesian posteriors is roughly similar, i.e. both bias and efficiency compare equally well across the two methods. This holds both for the fixed effects $\alpha$ and for the variance components $(\sigma_{1}^{2},\sigma^{2}_{2})$ . Under the contaminated model, we notice important differences among ABC-R and the classical Bayesian estimation. In particular, $\tilde{\theta}_{m}$ based on ABC-R is less biased, both globally and on a component by component basis, and more efficient. The gain in efficiency is particularly evident for the variance components.

4.2 Effects of GRP94-based complexes on IL-10

The GRP94 dataset (Tramentozzi et al., 2016) concerns the measurement of glucose-regulated protein94 in plasma or other biological fluids and the study of its role as a tumour antigen, i.e. its ability to alter the production of immunoglobines (IgGs) and inflammatory cytokines in the peripheral blood mononuclear cells (PBMCs) of tumour patients. The study involved 27 patients admitted to the division of General Surgery of the Civil Hospital of Padova for ablation of primary, solid cancer of the gastro-intestinal tract. For each patient, gender, age (expressed in years), type and stage of tumour (ordinal scales of four levels) are given. Patients’ plasma and PBMCs were challenged with GRP94 complexes and the level of IgG and of the cytokines: interferon $\gamma$ (IFN $\gamma$ ), interleukin 6 (IL-6), interleukin 10 (IL-10) and tumour necrosis factor $\alpha$ (TNF $\alpha$ ) were measured. Owing to time and cost constraints, for patients IDs 17, 27 and 28 only IgG was measured. The following five treatments were considered: GRP94 at the dose of either 10 ng/ml or 100 ng/ml, GRP94 in complex with IgG (GRP94+IgG) at the doses 10 ng/ml or 100 ng/ml and IgG a the dose 100 ng/ml. Finally, baseline measurements of IgG and of the aforementioned cytokines were taken from untreated PMBCs. Although fresh patient’s plasma and PMBCs are taken for each treatment and patient, the resulting measures are likely to be correlated since plasma and PMBCs are taken from the same patient. Hence, a LMM can be suitable for these data. Using paired Mann-Whitney tests, Tramentozzi et al. (2016) show that GRP94 in complex with IgG at the higher dose can significantly inhibit the production of IgG, whereas GRP94 at both doses can stimulate the secretion of IL-6 and TNF $\alpha$ from PBMCs of cancer patients. In addition, some of the differences between treatments were significant for a specific gender; see Tramentozzi et al. (2016) for full details.

A feature of these data is the presence of extreme observations, both at baseline and challenged PMBCs-based measurements, as it can be seen from the strip plots in Figure 5. Such extreme observations induce high variability on the response measurements, especially for IFN $\gamma$ , IL-6, IL-10 and TNF $\alpha$ . Hence, one must be cautious when fitting a LMM to such data.

We fit the two-component nested LMM (12) to the IL-10 with ABC-R using estimating equations (10)-(11). Since all measures are positive and some of them are highly skewed, a logarithmic transformation is used in order to alleviate distributional skewness. Furthermore, since Tramentozzi et al. (2016) highlight a possible gender effect (especially with respect to the cytokines) we also check for gender effects by including an interaction with gender. The model with interaction is

[TABLE]

where wi is a dummy variable for gender, $\gamma$ is the fixed effect of the treatment-gender interaction, and ${1_{6}}$ is the unit vector of dimension 6. The interaction model (13) has $12$ unknown fixed effects $(\alpha,\gamma)$ .

As in this case there is no extra-experimental information, we assume vague priors. In particular, $\alpha_{j}\sim N(0,100)$ and $\gamma_{j}\sim N(0,100)$ , for $j=1,\ldots,6$ . For the variance components, following Gelman (2006), we assume $\sigma_{1}^{2}\sim\text{halfCauchy}(7)$ and $\sigma_{2}^{2}\sim\text{halfCauchy}(7)$ in both models. However, we note that one of the features of the proposed method is the simultaneous ability to have robustness to possible model misspecification and to include prior information on model parameters, if available.

ABC-R posterior samples are drawn using Algorithm 1. For comparison purposes, we fit also a classical Bayesian LMM with the aforementioned prior and an adaptive random walk Metropolis-Hastings algorithm is used for sampling from this posterior. Figure 6 compares the ABC-R and the classical posterior for a subset of the fixed effects of models (12) and (13) by means of kernel density estimations. The parameters shown are those referring to the treatments based on GRP94 at the dose of 10 ng/ml (GRP94_10), GRP94 at the dose of 100 ng/ml (GRP94_100) and GRP94 in complex with IgG at the dose of 100 ng/ml (GRP94+IgG_100), which according to Tramentozzi et al. (2016) are the most prominent. The first row (d1) illustrates the marginal posteriors of the parameters of (12) (with baseline being the reference category). The second row (d2) shows the marginal posteriors of the parameters of (13) (with baseline and female being the reference categories). Numbers within parenthesis in the plot subtitles give the evidence in favour of the null hypothesis H0 that the parameter is equal to zero, computed under the Full Bayesian Significance Testing (FBST) setting of Pereira et al. (2008); inside the parenthesis, the first (last) value from left refers to the ABC-R (classical) posterior.

The FBST in favour of $H_{0}$ has been proposed by Pereira and Stern (1999) as an intuitive measure of evidence, defined as the posterior probability related to the less probable points of the parametric space. It favours $H_{0}$ whenever it is large and it is based on a specific loss function and thus the decision made under this procedure is the action that minimises the corresponding posterior risk (Pereira et al., 2008). The FBST solves the drawback of the usual Bayesian procedure for testing based on the Bayes factor (BF), that is, when the null hypothesis is precise and improper or vague priors are assumed, the BF can be undetermined and it can lead to the so-called Jeffreys-Lindley paradox.

There is a high posterior probability that the effect of GRP94_100 with or without interaction with gender is different from the baseline, since the evidence of H0 is rather low under the classical Bayesian LMM. However, such effects vanish under the robust ABC-R procedure. This is an indication to the fact that the classical LMM posterior in the case of log IL-10 is likely to be driven by few extreme observations.

5 Discussion

Currently, the only available approach for obtaining posterior distributions explicitly using robust unbiased estimating functions is through pseudo-likelihood methods such as the empirical or the quasi-likelihood (Greco et al., 2008). Bissiri et al. (2016) show how robust posterior distribution can be based on generic loss functions, in some special cases derived from robust estimating equations. In this work, we present an alternative approach that directly incorporates robust estimating functions into approximate Bayesian computation techniques. With respect to available approaches based on pseudo-likelihoods, our method can be computationally faster when the evaluation of the estimating function is expensive.

Motivated by the GRP94 dataset, we focused on two-component nested LMM, but more complex models can be fitted since the estimating equations (10)-(11) are very general (see Richardson, 1997). For instance, it is possible to deal with models with multiple random effects or even with robustness with respect to the design matrix. An R implementation of the proposed method is provided in the robustBLME package (Ruli et al., 2018).

The proposed method can be applied to any unbiased robust estimating equations, such as $S$ -estimating equations. The study of the proposed approach with S-estimating in the proposed approach is left for future work.

From a practical perspective we recommend to fit both classical and robust LMMs and compare their posteriors, say by FSBT. If the differences are mild then the posterior is probably not impacted by outliers so the classical LMM can be safely used. On the contrary, if there are important differences between them, then it is likely that the LMM posterior is driven by outliers and therefore the robust posterior would be a safer choice.

Acknowledgements

This work was partially supported by University of Padova (Progetti di Ricerca di Ateneo 2015, CPDA153257) and by the Italian Ministry of Eduction under the PRIN 2015 grant (2015EASZFS_003).

Appendix: Computational details

Provided simulation from $F_{\theta}$ is fast, the main demanding requirement of the proposed method is essentially the computation of the observed $\tilde{\theta}$ and the scaling matrix $B_{R}(\theta)$ evaluated at $\tilde{\theta}$ . Given that, for large sample sizes,

[TABLE]

where $0_{d}$ is a $d$ -vector of zeros and $I_{d}$ is the identity matrix of order $d$ , it is reasonable to replace $K_{h}(\cdot)$ with the multivariate normal density centred at zero and with covariance matrix $hI_{d}$ . In order to choose the bandwidth $h$ we consider several pilot runs of the ABC-R algorithm for a grid of $h$ values, and select the value of $h$ that delivers approximately 0.1% acceptance ratio (as done, for instance, by Fearnhead and Prangle, 2012).

Contrary to other ABC-MCMC algorithms in which the proposal requires pilot runs (see, Cabras et al., 2015, for building proposal distributions in ABC-MCMC), in our case a scaling matrix for the proposal $q(\cdot|\cdot)$ can be readily build, almost effortlessly, by using the usual sandwich formula (2) evaluated at $\tilde{\theta}$ (see also Ruli et al., 2016). Even in cases in which $H(\theta)$ and $J(\theta)$ are not analytically available, they can be straightforwardly estimated via simulation. Indeed, in our experience, 100-500 samples from the model $F_{\tilde{\theta}}$ , give estimates with reasonably low Monte Carlo variability (see also Cattelan and Sartori, 2015). Throughout the examples considered we use the multivariate $t$ -density with 5 degrees of freedom as the proposal density $q(\cdot|\cdot)$ and the ABC-R is always started from $\tilde{\theta}$ . In the ABC algorithm, we fix the tolerance threshold in order to give a pre-specified but small acceptance ratio, as frequently done in the ABC literature. In our experimentations we found that an acceptance value of 0.1% gives satisfactory results.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agostinelli and Greco (2013) Agostinelli, C. and Greco, L. (2013) A weighted strategy to handle likelihood uncertainty in Bayesian inference. Computational Statistics , 28 , 319–339.
2Andrade and O’Hagan (2006) Andrade, J. A. A. and O’Hagan, A. (2006) Bayesian robustness modeling using regularly varying distributions. Bayesian Analysis , 1 , 169–188.
3Beaumont et al. (2002) Beaumont, M. A., Zhang, W. and Balding, D. J. (2002) Approximate Bayesian computation in population genetics. Genetics , 162 , 2025–2035.
4Bissiri et al. (2016) Bissiri, P. G., Holmes, C. C. and Walker, S. G. (2016) A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B , 78 , 1103–1130.
5Blum (2010) Blum, M. G. B. (2010) Approximate Bayesian computation: a nonparametric perspective. Journal of the American Statistical Association , 105 , 1178–1187.
6Cabras et al. (2015) Cabras, S., Castellanos Nueda, M. E. and Ruli, E. (2015) Approximate Bayesian computation by modelling summary statistics in a quasi-likelihood framework. Bayesian Analysis , 10 , 411–439.
7Cattelan and Sartori (2015) Cattelan, M. and Sartori, N. (2015) Empirical and simulated adjustments of composite likelihood ratio statistics. Journal of Statistical Computation and Simulation , 86 , 1056–1067.
8Copt and Victoria-Feser (2006) Copt, S. and Victoria-Feser, M. P. (2006) High-breakdown inference for mixed linear models. Journal of the American Statistical Association , 101 , 292–300.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Robust approximate Bayesian inference

Abstract

1 Introduction

2 Background on robust MMM-estimating functions

3 Robust ABC inference

Lemma 3.1

Theorem 3.1

Theorem 3.2

Theorem 3.3

4 Application to linear mixed models

4.1 Simulation study

4.2 Effects of GRP94-based complexes on IL-10

5 Discussion

Acknowledgements

Appendix: Computational details

2 Background on robust $M$ -estimating functions