Multilevel linear models, Gibbs samplers and multigrid decompositions

Giacomo Zanella; Gareth Roberts

arXiv:1703.06098·stat.CO·June 27, 2019

Multilevel linear models, Gibbs samplers and multigrid decompositions

Giacomo Zanella, Gareth Roberts

PDF

Open Access

TL;DR

This paper analyzes the convergence of Gibbs samplers in multilevel Bayesian models, providing explicit formulas and guidelines for optimizing their implementation across various hierarchical structures.

Contribution

It introduces a multigrid approach to derive convergence rates for Gibbs samplers in complex multilevel models, extending analysis beyond two-level hierarchies.

Findings

01

Explicit convergence rate formulas for multilevel models

02

Guidelines for parametrization and identifiability in Gibbs sampling

03

Simulation results indicating broader applicability to non-Gaussian and gradient-based MCMC

Abstract

We study the convergence properties of the Gibbs Sampler in the context of posterior distributions arising from Bayesian analysis of conditionally Gaussian hierarchical models. We develop a multigrid approach to derive analytic expressions for the convergence rates of the algorithm for various widely used model structures, including nested and crossed random effects. Our results apply to multilevel models with an arbitrary number of layers in the hierarchy, while most previous work was limited to the two-level nested case. The theoretical results provide explicit and easy-to-implement guidelines to optimize practical implementations of the Gibbs Sampler, such as indications on which parametrization to choose (e.g. centred and non-centred), which constraint to impose to guarantee statistical identifiability, and which parameters to monitor in the diagnostic process. Simulations suggest…

Figures4

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1 : Optimal parametrization for 3-levels hierarchical models as a function of the normalized variance components.

	${\tilde{σ}}_{a}^{2} \geq {\tilde{σ}}_{b}^{2} + {\tilde{σ}}_{e}^{2}$	${\tilde{σ}}_{a}^{2} < {\tilde{σ}}_{b}^{2} + {\tilde{σ}}_{e}^{2}$
${\tilde{σ}}_{b}^{2} \geq {\tilde{σ}}_{e}^{2}$	$(μ, 𝜸, 𝜼)$	$(μ, a, 𝜼)$
${\tilde{σ}}_{b}^{2} < {\tilde{σ}}_{e}^{2}$	$(μ, 𝜸, b)$	$(μ, a, b)$

Table 2. Table 2 : Effective sample sizes for the standard unconstrained Gibbs Sampler for Model CkP , and for the two version where identifiability is obtained by imposing the constraints a 1 ( 1 ) = a 1 ( 2 ) = 1 subscript superscript 𝑎 1 1 subscript superscript 𝑎 2 1 1 a^{(1)}_{1}=a^{(2)}_{1}=1 and a ¯ ( 1 ) = a ¯ ( 2 ) = 1 superscript ¯ 𝑎 1 superscript ¯ 𝑎 2 1 \bar{a}^{(1)}=\bar{a}^{(2)}=1 , respectively. Effective sample sizes correspond to 10 5 superscript 10 5 10^{5} iterations of each algorithm, with the first half of the samples discarded as burn-in.

	$n_{1} = 5$ , $n_{2} = 5$			$n_{1} = 5$ , $n_{2} = 100$			$n_{1} = 100$ , $n_{2} = 100$
	$μ$	$a_{i_{1}}^{(1)}$	$a_{i_{2}}^{(2)}$	$μ$	$a_{i_{1}}^{(1)}$	$a_{i_{2}}^{(2)}$	$μ$	$a_{i_{1}}^{(1)}$	$a_{i_{2}}^{(2)}$
Unconstrained	273	287	268	273	314	280	287	351	255
$a_{1}^{(1)} = a_{1}^{(2)} = 1$	1483	2741	1656	506	33614	507	372	553	300
${\bar{a}}^{(1)} = {\bar{a}}^{(2)} = 1$	52405	52537	49847	46404	49634	50134	43869	50682	51259

Table 3. Table 3 : Comparison of HMC, NUTS and the Gibbs Sampler for Model CkP with and without linear constraints. The values of the effective sample sizes and the runtimes refer to a single run of each algorithm for 10 4 superscript 10 4 10^{4} iterations, with the first 2000 iterations discarded as burn-in.

	Effective Sample Size (ESS)			Runtime	ESS/time [1/s]
	$μ$	$a_{i_{1}}^{(1)}$	$a_{i_{2}}^{(2)}$		$μ$	$a_{i_{1}}^{(1)}$	$a_{i_{2}}^{(2)}$
HMC (unconstrained)	530	545	518	5697s	0.09	0.10	0.09
HMC ( $a_{1}^{(1)} = a_{1}^{(2)} = 1$ )	1661	1611	3907	1737s	0.96	0.93	2.2
HMC ( ${\bar{a}}^{(1)} = {\bar{a}}^{(2)} = 1$ )	1369	1263	1897	1459s	0.94	0.87	1.3
NUTS (unconstrained)	30	18	28	906s	0.03	0.02	0.03
NUTS ( $a_{1}^{(1)} = a_{1}^{(2)} = 1$ )	108	176	696	312s	0.35	0.57	2.2
NUTS ( ${\bar{a}}^{(1)} = {\bar{a}}^{(2)} = 1$ )	8000	19914	20062	134s	60	149	150
Gibbs (unconstrained)	21	27	4.3	0.91s	24	30	4.7
Gibbs ( $a_{1}^{(1)} = a_{1}^{(2)} = 1$ )	27	27	5.6	0.92s	29	30	6.1
Gibbs ( ${\bar{a}}^{(1)} = {\bar{a}}^{(2)} = 1$ )	8000	7992	8037	1.00s	8000	7992	8037

Equations262

s \to \infty lim \frac{∥ P ^{s} f - E _{π} [ f ] ∥ _{L^{2} (π)}}{r ^{s}} = 0 \forall f \in L^{2} (π),

s \to \infty lim \frac{∥ P ^{s} f - E _{π} [ f ] ∥ _{L^{2} (π)}}{r ^{s}} = 0 \forall f \in L^{2} (π),

T = min {s; ∥ P^{s} f - E_{π} [f] ∥_{L^{2} (π)} \leq ϵ}

T = min {s; ∥ P^{s} f - E_{π} [f] ∥_{L^{2} (π)} \leq ϵ}

y_{ij k} = μ + a_{i} + b_{ij} + ϵ_{ij k},

y_{ij k} = μ + a_{i} + b_{ij} + ϵ_{ij k},

y_{ij k} \sim N (η_{ij}, σ_{e}^{2}), η_{ij} \sim N (γ_{i}, σ_{b}^{2}), γ_{i} \sim N (μ, σ_{a}^{2}), p (μ) \propto 1 .

y_{ij k} \sim N (η_{ij}, σ_{e}^{2}), η_{ij} \sim N (γ_{i}, σ_{b}^{2}), γ_{i} \sim N (μ, σ_{a}^{2}), p (μ) \propto 1 .

\delta({\bm{\beta}})=\left(\begin{array}[]{c}\delta^{(0)}{\bm{\beta}}\\ \delta^{(1)}{\bm{\beta}}\\ \delta^{(2)}{\bm{\beta}}\end{array}\right)=\left(\begin{array}[]{c}\delta^{(0)}{\bm{\beta}}^{(0)}\,,\,\delta^{(0)}{\bm{\beta}}^{(1)}\,,\,\delta^{(0)}{\bm{\beta}}^{(2)}\\ \delta^{(1)}{\bm{\beta}}^{(1)}\,,\,\delta^{(1)}{\bm{\beta}}^{(2)}\\ \delta^{(2)}{\bm{\beta}}^{(2)}\end{array}\right)\,,

\delta({\bm{\beta}})=\left(\begin{array}[]{c}\delta^{(0)}{\bm{\beta}}\\ \delta^{(1)}{\bm{\beta}}\\ \delta^{(2)}{\bm{\beta}}\end{array}\right)=\left(\begin{array}[]{c}\delta^{(0)}{\bm{\beta}}^{(0)}\,,\,\delta^{(0)}{\bm{\beta}}^{(1)}\,,\,\delta^{(0)}{\bm{\beta}}^{(2)}\\ \delta^{(1)}{\bm{\beta}}^{(1)}\,,\,\delta^{(1)}{\bm{\beta}}^{(2)}\\ \delta^{(2)}{\bm{\beta}}^{(2)}\end{array}\right)\,,

δ^{(0)} β^{(0)} = β^{(0)}, δ^{(0)} β^{(1)} = β_{\cdot}^{(1)}, δ^{(0)} β^{(2)} = β_{\cdot\cdot}^{(2)},

δ^{(0)} β^{(0)} = β^{(0)}, δ^{(0)} β^{(1)} = β_{\cdot}^{(1)}, δ^{(0)} β^{(2)} = β_{\cdot\cdot}^{(2)},

δ^{(1)} β^{(1)} = (β_{1}^{(1)} - β_{\cdot}^{(1)}, \dots, β_{I}^{(1)} - β_{\cdot}^{(1)}), δ^{(1)} β^{(2)} = (β_{1 \cdot}^{(2)} - β_{\cdot\cdot}^{(2)}, \dots, β_{I \cdot}^{(2)} - β_{\cdot\cdot}^{(2)}),

δ^{(2)} β^{(2)} = (β_{11}^{(2)} - β_{1 \cdot}^{(2)}, β_{12}^{(2)} - β_{1 \cdot}^{(2)}, \dots, β_{I (J - 1)}^{(2)} - β_{I \cdot}^{(2)}, β_{I J}^{(2)} - β_{I \cdot}^{(2)}),

β_{\cdot}^{(1)} = \frac{\sum _{i} β _{i}^{(1)}}{I}, β_{\cdot\cdot}^{(2)} = \frac{\sum _{i, j} β _{ij}^{(2)}}{I J}, β_{i \cdot}^{(2)} = \frac{\sum _{j} β _{ij}^{(2)}}{J} .

β_{\cdot}^{(1)} = \frac{\sum _{i} β _{i}^{(1)}}{I}, β_{\cdot\cdot}^{(2)} = \frac{\sum _{i, j} β _{ij}^{(2)}}{I J}, β_{i \cdot}^{(2)} = \frac{\sum _{j} β _{ij}^{(2)}}{J} .

ρ (δ^{(0)} β (s)) \geq ρ (δ^{(1)} β (s)) \geq ρ (δ^{(2)} β (s)) = 0 .

ρ (δ^{(0)} β (s)) \geq ρ (δ^{(1)} β (s)) \geq ρ (δ^{(2)} β (s)) = 0 .

ρ (β (s)) = ρ (δ^{(0)} β (s)) .

ρ (β (s)) = ρ (δ^{(0)} β (s)) .

ρ_{00} = 1 - \frac{σ ~ _{a}^{2}}{σ ~ _{a}^{2} + σ ~ _{b}^{2}} \frac{σ ~ _{b}^{2}}{σ ~ _{b}^{2} + σ ~ _{e}^{2}},

ρ_{00} = 1 - \frac{σ ~ _{a}^{2}}{σ ~ _{a}^{2} + σ ~ _{b}^{2}} \frac{σ ~ _{b}^{2}}{σ ~ _{b}^{2} + σ ~ _{e}^{2}},

ρ_{01} = 1 - \frac{σ ~ _{a}^{2}}{σ ~ _{a}^{2} + σ ~ _{e}^{2}} \frac{σ ~ _{e}^{2}}{σ ~ _{b}^{2} + σ ~ _{e}^{2}},

(ρ_{00}, ρ_{11}, ρ_{01}, ρ_{10}) = (0.995, 0.998, 0.007, 0.999) .

(ρ_{00}, ρ_{11}, ρ_{01}, ρ_{10}) = (0.995, 0.998, 0.007, 0.999) .

y_{i_{1} \dots i_{k}} =

y_{i_{1} \dots i_{k}} =

\overset{a}{ˉ}^{(s)}

\overset{a}{ˉ}^{(s)}

((μ, a) (t))_{t = 1}^{\infty} = (μ (t), a^{(1)} (t), \dots, a^{(k)} (t))_{t = 1}^{\infty}

((μ, a) (t))_{t = 1}^{\infty} = (μ (t), a^{(1)} (t), \dots, a^{(k)} (t))_{t = 1}^{\infty}

ρ = s \in {1, \dots, k} max \frac{N τ _{e}}{N τ _{e} + n _{s} τ _{s}} .

ρ = s \in {1, \dots, k} max \frac{N τ _{e}}{N τ _{e} + n _{s} τ _{s}} .

(μ, β^{(1)}, β^{(2)}) = (μ, a^{(1)} + (1 - λ_{1}) μ, a^{(2)} + (1 - λ_{2}) μ), for (λ_{1}, λ_{2}) \in {0, 1}^{2} .

(μ, β^{(1)}, β^{(2)}) = (μ, a^{(1)} + (1 - λ_{1}) μ, a^{(2)} + (1 - λ_{2}) μ), for (λ_{1}, λ_{2}) \in {0, 1}^{2} .

ρ_{11}

ρ_{11}

ρ_{01}

ρ = s \in {1, \dots, k} max (\frac{N τ _{e} ( 1 - q _{s} )}{N τ _{e} + n _{s} τ _{s}}),

ρ = s \in {1, \dots, k} max (\frac{N τ _{e} ( 1 - q _{s} )}{N τ _{e} + n _{s} τ _{s}}),

ρ = s \in {1, \dots, k} max (\frac{N τ _{e}}{N τ _{e} + n _{s} τ _{s}} \frac{n _{s} - 1}{n _{s}}),

ρ = s \in {1, \dots, k} max (\frac{N τ _{e}}{N τ _{e} + n _{s} τ _{s}} \frac{n _{s} - 1}{n _{s}}),

y_{i_{1} \dots i_{k}} \sim

y_{i_{1} \dots i_{k}} \sim

μ ∣ y, \tilde{a}

μ ∣ y, \tilde{a}

\tilde{a}^{(s)} ∣ y, μ, \tilde{a}^{(- s)}

p (μ)

p (μ)

γ_{i}

y_{ij}

ρ_{C P} = \frac{τ _{a}}{τ _{a} + τ ~ _{e}} and ρ_{N C P} = \frac{τ ~ _{e}}{τ _{a} + τ ~ _{e}},

ρ_{C P} = \frac{τ _{a}}{τ _{a} + τ ~ _{e}} and ρ_{N C P} = \frac{τ ~ _{e}}{τ _{a} + τ ~ _{e}},

ρ_{λ_{1} \dots λ_{I}} = \frac{\sum _{i : λ_{i} = 1} τ ~ _{i} \frac{τ ~ _{i}}{τ ~ _{i} + τ _{a}} + \sum _{i : λ_{i} = 0} τ _{a} \frac{τ _{a}}{τ ~ _{i} + τ _{a}}}{\sum _{i : λ_{i} = 1} τ ~ _{i} + \sum _{i : λ_{i} = 0} τ _{a}},

ρ_{λ_{1} \dots λ_{I}} = \frac{\sum _{i : λ_{i} = 1} τ ~ _{i} \frac{τ ~ _{i}}{τ ~ _{i} + τ _{a}} + \sum _{i : λ_{i} = 0} τ _{a} \frac{τ _{a}}{τ ~ _{i} + τ _{a}}}{\sum _{i : λ_{i} = 1} τ ~ _{i} + \sum _{i : λ_{i} = 0} τ _{a}},

ρ_{\overset{ˉ}{λ}_{1} \dots \overset{ˉ}{λ}_{I}} \leq ρ_{λ_{1} \dots λ_{I}} for any (λ_{1} \dots λ_{I}) \in {0, 1}^{I} .

ρ_{\overset{ˉ}{λ}_{1} \dots \overset{ˉ}{λ}_{I}} \leq ρ_{λ_{1} \dots λ_{I}} for any (λ_{1} \dots λ_{I}) \in {0, 1}^{I} .

p (μ)

p (μ)

γ_{i}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Bayesian Inference · Bayesian Methods and Mixture Models · Statistical Methods and Inference

Full text

Multilevel linear models, Gibbs samplers and multigrid decompositions

Giacomo Zanella111 Department of Decision Sciences, BIDSA and IGIER, Bocconi University, via Roentgen 1, 20136 Milan, Italy. [email protected]

and Gareth Roberts222 Department of Statistics, University of Warwick, Coventry, CV4 7AL, UK. [email protected]

Abstract

We study the convergence properties of the Gibbs Sampler in the context of posterior distributions arising from Bayesian analysis of conditionally Gaussian hierarchical models. We develop a multigrid approach to derive analytic expressions for the convergence rates of the algorithm for various widely used model structures, including nested and crossed random effects. Our results apply to multilevel models with an arbitrary number of layers in the hierarchy, while most previous work was limited to the two-level nested case. The theoretical results provide explicit and easy-to-implement guidelines to optimize practical implementations of the Gibbs Sampler, such as indications on which parametrization to choose (e.g. centred and non-centred), which constraint to impose to guarantee statistical identifiability, and which parameters to monitor in the diagnostic process. Simulations suggest that the results are informative also in the context of non-Gaussian distributions and more general MCMC schemes, such as gradient-based ones.

1 Introduction

Markov chain Monte Carlo (MCMC) is established as the computational workhorse of most Bayesian statistical analyses for complex models. For hierarchical models with conditionally conjugate priors, the Gibbs sampler (Gelfand and Smith, 1990; Smith and Roberts, 1993) remains one of the most natural algorithm of choice, thanks to its simplicity of implementation and low computational cost per iteration (thanks to conjugacy and conditional independence). Nonetheless, speed of convergence of the resulting Markov chain can be a major issue and can be highly sensitive to the model structure and the implementation details, such choice of parametrization (Hills and Smith, 1992; Gelfand et al., 1995) or identifiability constraints (Vines et al., 1996; Gelfand and Sahu, 1999; Xie and Carlin, 2006). This work provides a contribution towards gaining a quantitative understanding of the interaction between Bayesian hierarchical structures and the behaviour of MCMC algorithms, which lies at the heart of the practical success of Bayesian statistics.

While there is some previous work in the area (Roberts and Sahu, 1997; Meng and Van Dyk, 1997; Papaspiliopoulos et al., 2003; Jones and Hobert, 2004; Papaspiliopoulos et al., 2007; Yu and Meng, 2011), current theoretical understanding of the interaction between Bayesian hierarchical models and MCMC convergence is still very limited, and almost nothing is known for models of hierarchical depth greater than two. The present paper offers a contribution towards such an understanding, focusing on theory for Gaussian hierarchical models and seeking quantitative results. In particular, we derive analytic expressions for the convergence rates of the Gibbs Sampler for various multilevel linear models and explore the dependence of these rates on the model structure, the choice of parametrization and the introduction of identifiability constraints. The theoretical results given in this paper extend and improve substantially on existing literature (Roberts and Sahu, 1997; Yu and Meng, 2011; Bass and Sahu, 2016b; Gao and Owen, 2017) both in terms of generality of hierarchical structure and the availability of explicit rates. We also show by simulations that the understanding gained from the Gaussian case can be extrapolated to more general settings.

In general, the Gibbs sampler can be elegantly described in terms of orthogonal projections (Amit, 1991, 1996; Diaconis et al., 2010). While in principle this theory provides the tools to extract practical convergence information for Gibbs samplers in the context of multivariate Gaussian distributions, in order to apply it to practically used Bayesian multilevel models one needs detailed knowledge of the spectrum of non-trivial high-dimensional matrices, which has drastically limited its applicability to derive analytic results. In this paper we combine this general framework with a novel multigrid decomposition approach that allows us to focus on low-dimensional Markov chains and derive explicit analytic results concerning Gibbs sampler rates of convergence for multilevel linear models, such as nested and crossed random effect models with an arbitrary number of layers and/or factors.

Our results have various practical implications. First they can be readily used in the popular context of conditionally Gaussian models, where there exist unknown variances at various levels of the hierarchy (Gelman and Hill, 2006). In that case our results describe, for example, the optimal updating strategies for the hierarchical mean structure conditional on the variances, allowing to optimize the mean parametrization on the fly (Section 3.2), or the computationally optimal way of imposing statistically identifiability (Sections 4.2), and provide theoretically grounded indication of which parameters to monitor in the convergence diagnostic process (Section 2.1). Also, our results can be used as a building block to derive computational complexity statements about the Gibbs Sampler in the context of multilevel linear model (see e.g. Papaspiliopoulos et al. (2019) for work in that direction). Note that in the context of conditionally Gaussian models the entire Gaussian mean component could be updated in a single block, thus avoiding convergence issues related to single-site updates. However these block updates can in principle be computationally expensive (up to $O(n^{3})$ cost in the dimension ( $n$ ) of the Gaussian to be updated), while single-site updating schemes with provably bounded convergence rate can offer a more scalable alternative. For some class of models, sparse linear algebra methods can reduce the cost of the block update by exploiting sparsity in the posterior precision matrix, but the resulting computational cost depends on the model structure and can still be super-linear (see e.g. Section 4 for models leading to dense precision matrices and Papaspiliopoulos et al. (2019) for more discussion).

While impressive results are being obtained with black-box software implementation of Hamiltonian Monte Carlo (HMC) such as STAN (Carpenter et al., 2017), our results suggest that Gibbs Sampling schemes built on our methodological guidance can be substantially cheaper than gradient-based ones in the context of hierarchical models, leading to improved performances (Section 5). Moreover, our simulations show that the methodological results we develop in this paper are also helpful when fitting multilevel models with gradient-based schemes (Section 5.1) and allow to obtain drastic improvements in efficiency also when using generic software, such as STAN.

Throughout the paper, we shall couch all our results in terms of $L^{2}$ rates of convergence. Specifically, let $({\bm{\beta}}(s))_{s=1,2,\dots}$ be a Markov chain with stationary distribution $\pi$ and transition operator defined by $P^{s}f({\bm{\beta}}(0))=\mathbb{E}[f({\bm{\beta}}(s))|{\bm{\beta}}(0)]$ . The rate of convergence $\rho({\bm{\beta}}(s))$ associated to $({\bm{\beta}}(s))_{s=1,2,\dots}$ is defined as the smallest number $\rho$ such that for all $r>\rho$

[TABLE]

where $L^{2}(\pi)$ denotes the space of square $\pi$ -integrable functions, $\|\cdot\|_{L^{2}(\pi)}$ is its associated $L^{2}$ -norm and $\mathbb{E}_{\pi}[f]=\int f\,d\pi$ is the expectation of $f$ with respect to $\pi$ . The rate of convergence $\rho({\bm{\beta}}(s))$ characterizes the speed at which $({\bm{\beta}}(s))_{s=1,2,\dots}$ converges to its stationary distribution $\pi$ , with a simple argument giving that if

[TABLE]

then $T=\mathcal{O}\left(\frac{1}{-\log(\rho)}\right)$ .

1.1 Paper overview and structure

Section 2 carefully introduces the 3-level hierarchical models we shall consider, and provides motivating simulations. Then in Section 3 we shall give a complete analysis for 3-level symmetric models (i.e. homogeneous variances and symmetric data structure). At the heart of the analysis is a multigrid decomposition of the Gibbs sampler into completely independent Markov chains describing different levels of hierarchical granularity, Theorem 1. Such multigrid decomposition simultaneously applies to every Gibbs sampler induced by all centred/non-centred parametrizations and is fundamentally a statistical property of the hierarchical models under consideration. Although multigrid ideas have already been used in methodological contexts to design improved MCMC schemes (Goodman and Sokal, 1989; Liu and Sabatti, 2000), to our knowledge they had never been used in theoretical contexts to study convergence rates. We demonstrate that the slowest of these independent chains is always that corresponding to the coarsest level, regardless of the value of the variance components and on the number of branches in the hierarchy, and thus derive explicit expressions for the rates of convergence in symmetric contexts.

In Section 4 we focus on crossed effect models, using again a multigrid decomposition approach to derive explicit convergence rates. The results show that in the context of crossed models, centred/non-centred reparametrizations are not sufficient to guarantee fast convergence of the resulting Gibbs Sampler. On the other hand, we show that the latter can be achieved by imposing stronger statistical identifiability through additional linear constraints and our theory provides indications on which constraints lead to faster convergence. Finally, a simulation study reported in Section 5 suggests that the analysis of the Gaussian case leads to useful guidance also in the case of non-Gaussian models for both the Gibbs Sampler and Hamiltonian Monte Carlo algorithms (Neal et al., 2011).

Section 6 considers 3-level non-symmetric hierarchical models, providing bounds on convergence rates based on comparisons with related symmetric models and discussing the use use of bespoke parametrizations, where the choice of centred or non-centred parametrization in each branch of the hierarchy depends on the branch-specific parameter.

Section 7 considers hierarchical models with arbitrary depth ( $\geq 4$ ). Using an appropriate auxiliary random walk, whose evolution through the hierarchical tree is governed by the parameters’ squared partial correlations, we are able to extend the multigrid analysis to general tree structures and some non-symmetric cases. We again demonstrate a fundamental multigrid decomposition in Theorem 9 where the coarsest level chain converges the slowest, and we give explicit formulae for optimal partial non-centering strategies.

2 Three level hierarchical linear models

The theoretical innovation in this paper is centred around an important case in which we can obtain explicit Gibbs sampler rates of convergence, and as a result study explicitly the effects of particular models, parametrization schemes and blocking strategies. Therefore we shall begin with a detailed study of the following three-level Gaussian linear model, giving a fairly complete understanding of the interaction between model structure and parametrization and the Gibbs Sampler convergence behaviour.

Model S3 (Symmetric 3-levels hierarchical model).

Suppose

[TABLE]

where $i$ , $j$ and $k$ run from 1 to $I$ , $J$ and $K$ respectively and $\epsilon_{ijk}$ are iid normal random variables with mean 0 and variance $\sigma_{e}^{2}$ . We employ the standard Bayesian model specification assuming $a_{i}\sim N(0,\sigma_{a}^{2})$ , $b_{ij}\sim N(0,\sigma_{b}^{2})$ and a flat prior on $\mu$ .

For the theoretical analysis, we will consider the variance terms $\sigma_{a}^{2}$ , $\sigma_{b}^{2}$ and $\sigma_{e}^{2}$ to be known, while in the simulations we will assume them to be unknown and give them a prior distribution. Defining $\textbf{a}=(a_{i})_{i}$ , $\textbf{b}=(b_{ij})_{i,j}$ and $\textbf{y}=(y_{ijk})_{i,j,k}$ , the Gibbs Sampler explores the posterior distribution $(\mu,\textbf{a},\textbf{b})|\textbf{y}$ by iteratively sampling from the full conditional distributions of $\mu$ , a and b as follows (see below for motivation of denoting such sampler as GS( $1,1$ )).

Sampler GS( $1,1$ ).

Initialize $\mu(0)$ , $\textbf{a}(0)$ and $\textbf{b}(0)$ and then iterate

Sample $\mu(s+1)$ from $p(\mu|\textbf{a}(s),\textbf{b}(s),\textbf{y})$ ; 2. 2.

Sample $a_{i}(s+1)$ from $p(a_{i}|\mu(s+1),\textbf{b}(s),\textbf{y})$ for all $i$ ; 3. 3.

Sample $b_{ij}(s+1)$ from $p(b_{ij}|\mu(s+1),\textbf{a}(s+1),\textbf{y})$ for all $i$ and $j$ ,

where $p(\mu|\textbf{a},\textbf{b},\textbf{y})$ , $p(a_{i}|\mu,\textbf{b},\textbf{y})$ and $p(b_{ij}|\mu,\textbf{a},\textbf{y})$ are the full conditionals of Model S3 (see supplementary material for explicit expressions).

Given the conditional independence structure of the model, Sampler GS(1,1) is equivalent to a blocked Gibbs sampler with components $\mu$ , a and b, i.e. a scheme performing consecutive updates of $\mu|\textbf{a},\textbf{b}$ , $\textbf{a}|\mu,\textbf{b}$ and $\textbf{b}|\mu,\textbf{a}$ at each iteration.

The parametrization $(\mu,\textbf{a},\textbf{b})$ induced by (2) is often referred to as non-centred parametrization (NCP) and it is contrasted with the centred parametrization (CP) obtained by replacing $a_{i}$ and $b_{ij}$ with $\gamma_{i}=\mu+a_{i}$ and $\eta_{ij}=\gamma_{i}+b_{ij}$ respectively. Under the centred parametrization $(\mu,\bm{\gamma},\bm{\eta})$ the model formulation becomes

[TABLE]

Figures 1(b) and 1(a) provides a graphical representation of the two parametrizations. In the $(\mu,\textbf{a},\textbf{b})$ case $(1,1)$ refers to the fact that both levels 1 and 2 use a non-centred parametrization, while in the $(\mu,\bm{\gamma},\bm{\eta})$ case $(0,0)$ indicates that both levels use a centred parametrization. The resulting Gibbs sampler for the centred parametrization is as follows.

Sampler GS( $0,0$ ).

Initialize $\mu(0)$ , $\bm{\gamma}(0)$ and $\bm{\eta}(0)$ and then iterate

Sample $\mu(s+1)$ from $p(\mu|\bm{\gamma}(s),\bm{\eta}(s),\textbf{y})$ ; 2. 2.

Sample $\gamma_{i}(s+1)$ from $p(\gamma_{i}|\mu(s+1),\bm{\eta}(s),\textbf{y})$ for all $i$ ; 3. 3.

Sample $\eta_{ij}(s+1)$ from $p(\eta_{ij}|\mu(s+1),\bm{\gamma}(s+1),\textbf{y})$ for all $i$ and $j$ ,

where $p(\mu|\bm{\gamma},\bm{\eta},\textbf{y})$ , $p(\gamma_{i}|\mu,\bm{\eta},\textbf{y})$ and $p(\eta_{ij}|\mu,\bm{\gamma},\textbf{y})$ are the full conditionals induced by (3) (see supplementary material for explicit expressions).

Together with the fully non-centred parametrization $(\mu,\textbf{a},\textbf{b})$ and the fully centred parametrization $(\mu,\bm{\gamma},\bm{\eta})$ , one can also consider the mixed parametrizations given by $(\mu,\bm{\gamma},\textbf{b})$ and $(\mu,\textbf{a},\bm{\eta})$ and the corresponding Gibbs Sampler schemes $GS{(0,1)}$ and $GS(1,0)$ . See Figures 1(c) and 1(d) for graphical representations.

2.1 Illustrative example

As an illustrative example, we simulated data from Model S3 with $I=J=100$ , $K=5$ , $\mu=0$ , $\sigma_{a}=\sigma_{e}=10$ and $\sigma_{b}=10^{-0.5}$ . This correspond to a scenario of high level of noise in the measurements. We fit model S3 assuming the standard deviations $(\sigma_{a},\sigma_{b},\sigma_{e})$ to be unknown and placing weakly informative priors, namely $\frac{1}{\sigma_{a}^{2}}$ , $\frac{1}{\sigma_{b}^{2}}$ and $\frac{1}{\sigma_{e}^{2}}$ a priori distributed according to an Inverse Gamma distribution with shape and rate parameters equal to $0.01$ . We compare the efficiency of the Gibbs sampling schemes corresponding to the four different parametrizations, denoting them by $GS(1,1)$ , $GS(0,0)$ , $GS(0,1)$ and $GS(1,0)$ . For this simple example we initialized the chains at true values of the parameters $(\mu,\textbf{a},\textbf{b})$ and $(\sigma_{a},\sigma_{b},\sigma_{e})$ , which we know because we are in a simulated dataset example. The more realistic case of starting the chains from randomly chosen states led to the same conclusions of this illustrative examples. Note that all the four schemes have the same starting states (modulo re-parametrization) to have a fair comparison.

Figure 2 shows the mixing behaviour of the global mean $\mu$ and displays the potentially dramatic difference among mixing properties of the Gibbs Sampler under different parametrizations.

Based on Figure 2 one would certainly exclude using $GS(1,1)$ and $GS(1,0)$ to fit the model under consideration and may be tempted to deduce that both $GS(0,0)$ and $GS(0,1)$ lead to good mixing properties of the resulting chain. However, as an additional check, a cautious practitioner may also explore the mixing of the parameters at the first level, namely a and $\bm{\gamma}$ . Figure 3 displays the behaviour of the global averages of such parameters, namely $a_{\cdot}=\frac{\sum_{i}a_{i}}{I}$ and $\gamma_{\cdot}=\frac{\sum_{i}\gamma_{i}}{I}$ , in the first 1000 iterations.

Again, we see a dramatic difference induced by different parametrizations and, somehow surprisingly, we see that, despite having good mixing behaviour at level 0 (i.e. $\mu$ ), $GS(0,0)$ displays very poor mixing behaviour at level 1 (i.e. $\bm{\gamma}$ ). It is then natural to explores also the mixing behaviour at level 2 and Figure 4 does so again by plotting the global averages $b_{\cdot\cdot}=\frac{\sum_{ij}\beta_{ij}}{IJ}$ and $\eta_{\cdot\cdot}=\frac{\sum_{ij}\eta_{ij}}{IJ}$ .

In this case $GS(1,1)$ and $GS(0,1)$ are the only one achieving good mixing. Based on Figures 2, 3 and 4 it is natural to choose to fit the model using the sampler $GS(0,1)$ corresponding to the mixed parametrization $(\mu,\bm{\gamma},\textbf{b})$ , as it is the only one providing a good mixing across all three levels.

This simple example shows many typical issues arising when fitting Bayesian multi-level models and raises many questions. For example, one would like to know what are good parameters to use to diagnose convergence, in order to avoid misleading conclusions like the one suggested by Figure 2. In fact, while in two level model good mixing of the global hyperparameters such as $\mu$ typically indicates good global mixing, this is not true in other multi-level models. Indeed, it is legitimate to wonder whether diagnoses based only on the global means, like in Figures 2-4, are enough to deduce good mixing of the whole Markov chain, which in our example has more than $10^{4}$ dimensions ( $1+I+IJ$ mean components and $3$ precision components). Below we will show that for Model S3, mixing of the global means ensures mixing of the whole $(1+I+IJ)$ -dimensional mean components of the chain given the variances (see e.g. Corollary 1). Therefore it is enough to monitor the three global means and the three variances to ensure a reliable check of the chain mixing properties.

Even more crucially, it is desirable to have simple and theoretically grounded guidance in choosing a computationally efficient parametrization, given the huge impact it can have on computational performances. The theoretical analysis developed in the next section will provide useful guidance in this respect.

3 Multigrid decomposition for the three level hierarchical model

The basic ingredient of our analysis is the following multigrid decomposition. Consider the four possible parametrization of Model S3: $(\mu,\textbf{a},\textbf{b})$ , $(\mu,\bm{\gamma},\bm{\eta})$ and the mixed parametrizations $(\mu,\bm{\gamma},\textbf{b})$ and $(\mu,\textbf{a},\bm{\eta})$ . In order to provide a unified treatment, regardless of the chosen parametrization, we denote the parameters used by $({\bm{\beta}}^{(0)},{\bm{\beta}}^{(1)},{\bm{\beta}}^{(2)})$ and the resulting Gibbs Sampler by $GS({\bm{\beta}})$ . For example, in the NCP case ${\bm{\beta}}^{(0)}=\mu$ , ${\bm{\beta}}^{(1)}=\textbf{a}$ , ${\bm{\beta}}^{(2)}=\textbf{b}$ and $GS({\bm{\beta}})$ coincides with GS(1,1). First consider the map $\delta$ sending ${\bm{\beta}}=({\bm{\beta}}^{(0)},{\bm{\beta}}^{(1)},{\bm{\beta}}^{(2)})$ to

[TABLE]

where, loosely speaking, $\delta^{(i)}{\bm{\beta}}$ represent the increments of ${\bm{\beta}}$ at the $i$ -th coarseness level. More precisely

[TABLE]

where

[TABLE]

It is easy to see that the map $\delta$ is a bijection between $\mathbb{R}^{d}$ and $\mathbb{R}^{3}\times(\mathbb{R}^{I})^{*}\times(\mathbb{R}^{I})^{*}\times_{i=1}^{I}(\mathbb{R}^{J})^{*}$ , where $(\mathbb{R}^{p})^{*}=\{(v_{1},\dots,v_{p})\in\mathbb{R}^{p}\,:\,\sum_{i=1}^{p}v_{i}=0\}$ . The dimensionality of $\delta{\bm{\beta}}$ equals the one of ${\bm{\beta}}$ , which is $1+I+IJ$ , because $\delta{\bm{\beta}}$ has $3+2I+IJ$ parameters and $2+I$ constraints. The following theorem shows that the Markov chain induced by $GS({\bm{\beta}})$ factorizes under the transformation $\delta$ .

Theorem 1 (Multigrid Decomposition).

Let $({\bm{\beta}}(s))_{s=1}^{\infty}$ be a Markov chain on $\mathbb{R}^{d}$ evolving according to $GS({\bm{\beta}})$ . Then the timewise transformations $(\delta^{(0)}{\bm{\beta}}(s))_{s=1}^{\infty}$ , $(\delta^{(1)}{\bm{\beta}}(s))_{s=1}^{\infty}$ and $(\delta^{(2)}{\bm{\beta}}(s))_{s=1}^{\infty}$ are each a Markov chain and evolve independently.

While the posterior independence of $\delta^{(0)}{\bm{\beta}}$ , $\delta^{(1)}{\bm{\beta}}$ and $\delta^{(2)}{\bm{\beta}}$ is well-known, the remarkable fact following from Theorem 1 is that also the Markov chains induced by the Gibbs Sampler are independent (note that the independence of the random vector under the target measure is a necessary but not sufficient condition for the independence of a corresponding MCMC scheme).

Remark 1.

It is worth noting that the three subspaces of $\mathbb{R}^{d}$ spanned by the vectors $\delta^{(0)}{\bm{\beta}}$ , $\delta^{(1)}{\bm{\beta}}$ and $\delta^{(2)}{\bm{\beta}}$ , respectively, do not depend on the choice of parametrization ${\bm{\beta}}$ . Thus the multigrid decomposition is intrinsic to the model, and not dependent on the particular parametrization being considered.

Theorem 1 provides a useful tool to analyze the Markov chain of interest, ${\bm{\beta}}(s)$ . In fact the factorization into independent Markov chains implies that the rate of convergence of ${\bm{\beta}}(s)$ is simply given by the worst rate of convergence among $\delta^{(0)}{\bm{\beta}}(s)$ , $\delta^{(1)}{\bm{\beta}}(s)$ and $\delta^{(2)}{\bm{\beta}}(s)$ . Interestingly, the slowest chain is always the chain at the highest level $\delta^{(0)}{\bm{\beta}}(s)$ , regardless of the choice of parametrization and the values of $(I,J,K,\sigma_{a},\sigma_{b},\sigma_{e})$ .

Theorem 2 (Hierarchical ordering of convergence rates).

Let $\delta^{(0)}{\bm{\beta}}(s)$ , $\delta^{(1)}{\bm{\beta}}(s)$ and $\delta^{(2)}{\bm{\beta}}(s)$ be the Markov chains defined in Theorem 1. Then the associated convergence rates satisfy

[TABLE]

Theorems 1 and 2 imply that the rate of convergence of the global chain ${\bm{\beta}}(s)$ coincides with the one of the sub-chain $\delta^{(0)}{\bm{\beta}}(s)$ sampling the global means $(\beta^{(0)},\beta^{(1)}_{\cdot},\beta^{(2)}_{\cdot\cdot})$ .

Corollary 1.

(Rate of convergence of $GS({\bm{\beta}})$ ) Given the notation of Theorem 1,

[TABLE]

3.1 Explicit rates of convergence under different parametrizations

The multigrid decomposition developed in Section 3 allows to perform a direct analysis on the convergence properties of the Markov chain of interest ${\bm{\beta}}(s)$ . The latter is a Gibbs Sampler targeting a multivariate Gaussian distributions and thus, in principle, could be analyzed using, for example, the tools developed in Amit (1996); Roberts and Sahu (1997); Khare et al. (2009). However, these results require to have a full characterization of the spectrum of a $d\times d$ matrix, where $d$ is the number of dimensions in the Markov chain under consideration. Given the high-dimensionality of ${\bm{\beta}}(s)$ , which has $1+I+IJ$ parameters, it is hard to apply directly such results and in fact the convergence properties of ${\bm{\beta}}(s)$ have been studied heuristically or numerically in the literature (see e.g. (Gelfand et al., 1995, Sec.4) and (Roberts and Sahu, 1997, Sec.4.2)). Corollary 1, however implies that it suffices to study the skeleton chain $\delta^{(0)}{\bm{\beta}}(s)$ , which is a low-dimensional chain (namely 3-dimensional) amenable to direct analysis. Therefore, using Corollary 1, we can derive analytic expressions for the rates of convergence for the Gibbs Sampler under different parametrizations.

Theorem 3.

Given an instance of Model S3, the rate of convergence of the four Gibbs Sampler schemes $GS(0,0)$ , $GS(1,1)$ , $GS(0,1)$ and $GS(1,0)$ are given by

[TABLE]

where $\tilde{\sigma}_{a}^{2}=\frac{\sigma_{a}^{2}}{I}$ , $\tilde{\sigma}_{b}^{2}=\frac{\sigma_{b}^{2}}{IJ}$ and $\tilde{\sigma}_{e}^{2}=\frac{\sigma_{e}^{2}}{IJK}$ .

Theorem 3 provides explicit and informative formulas regarding the interaction between choice of parametrization and resulting efficiency of the Gibbs Sampler for Model S3.

Figure 6 summarizes graphically the dependence of the converge rates of different parametrizations from the values of the variances of various levels. Roughly speaking, the figure suggests that there is a partition of the hyperparameter space (corresponding to the white regions in each plot) such that in each region one and only one of the four parametrizations performs well.

Consider for example the illustrative example of Section 2.1. Applying Theorem 3 to such context we obtain that the $L^{2}$ rates of convergence (up to the third decimal digit) of the various Gibbs Samplers under consideration given $(I,J,K,\sigma_{a},\sigma_{b},\sigma_{e})=(100,100,5,10,10^{-0.5},10)$ are

[TABLE]

Recall that values of $\rho$ close to 1 mean slow convergence, see (1) and discussion thereof. These numbers provide a quantitative and theoretically grounded description of the behaviour heuristically observed in Section 2.1 and can be easily used to optimize performances (see e.g. Section 3.2 below).

3.2 Conditionally optimal parametrization

A natural and practically relevant question is what is the optimal parametrization (among the four possible choices $(\mu,\textbf{a},\textbf{b})$ , $(\mu,\bm{\gamma},\textbf{b})$ , $(\mu,\textbf{a},\bm{\eta})$ and $(\mu,\bm{\gamma},\bm{\eta})$ ) as a function of the normalized variance components $(\tilde{\sigma}_{a}^{2},\tilde{\sigma}_{b}^{2},\tilde{\sigma}_{e}^{2})$ . Using the formulas of Theorem 3 we can obtain the following explicit answers.

Corollary 2 (Optimal parametrization for Model S3).

The rate of convergence of the Gibbs Sampler targeting Model S3 is minimized by the following choice of parametrization:

•

use a centred parametrization $\bm{\eta}$ at the lowest level if and only if $\tilde{\sigma}_{b}^{2}\geq\tilde{\sigma}_{e}^{2}$ ,

•

use a centred parametrization $\bm{\gamma}$ at the middle level if and only if $\tilde{\sigma}_{a}^{2}\geq\tilde{\sigma}_{b}^{2}+\tilde{\sigma}_{e}^{2}$ .

The resulting Gibbs Sampler has a rate of convergence $\rho$ upper bounded by $\frac{2}{3}$ , with the equality $\rho=\frac{2}{3}$ holding if and only if $\tilde{\sigma}_{a}^{2}=\tilde{\sigma}_{b}^{2}+\tilde{\sigma}_{e}^{2}$ and $\tilde{\sigma}_{b}^{2}=\tilde{\sigma}_{e}^{2}$ (in which case all parametrizations are equivalent).

Table 1 provides a graphical representation of the decision process. This simple rule guarantees that the resulting Gibbs Sampler has a rate of converges smaller than $\frac{2}{3}$ , thus guaranteeing a high sampling efficiency for fixed variances.

Table 1 implies that the choice of parametrization of a given level (i.e. whether it is computationally convenient to use a centred or non-centred parametrization) depends on the ratio between the normalized variance at the level under consideration and the sum of the normalized variances of the levels below. This results extend previous intuition for the two-level case (e.g. Papaspiliopoulos et al. (2003)) to deeper hierarchical levels (in this case three levels).

Corollary 2 allows for simple and effective strategies to ensure high sampling efficiency in practical implementations of Gibbs Sampling for Model S3 in the case of unknown variances. Common implementations choose a parametrization ${\bm{\beta}}=({\bm{\beta}}^{(0)},{\bm{\beta}}^{(1)},{\bm{\beta}}^{(2)})$ of the Gaussian component (for example the fully centred parametrization ${\bm{\beta}}=(\mu,\bm{\gamma},\bm{\eta})$ ) and alternate updating ${\bm{\beta}}|(\sigma_{a},\sigma_{b},\sigma_{e})$ with $GS({\bm{\beta}})$ and $(\sigma_{a},\sigma_{b},\sigma_{e})|{\bm{\beta}}$ with direct sampling (which is straightforward using the conditional independence of $\sigma_{a}$ , $\sigma_{b}$ and $\sigma_{e}$ given ${\bm{\beta}}$ ). Given Corollary 2, instead, one can choose the optimal parametrization ${\bm{\beta}}$ given $(\sigma_{a},\sigma_{b},\sigma_{e})$ on-the-fly according to Table 1. This ensures that the sampling step ${\bm{\beta}}|(\sigma_{a},\sigma_{b},\sigma_{e})$ will have a high efficiency, regardless of the values of $(I,J,K,\sigma_{a},\sigma_{b},\sigma_{e})$ . Note that the additional computational cost required by choosing the optimal parametrization according to Table 1 at each step is negligible compared to the cost of a Gibbs Sampling iteration.

Figure 7 compares the resulting autocorrelation functions in the context of the illustrative example of Section 2.1, with unknown variances $(\sigma_{a},\sigma_{b},\sigma_{e})$ .

We compare the Gibbs Sampler with optimal parametrization (updating ${\bm{\beta}}|(\sigma_{a},\sigma_{b},\sigma_{e})$ with $GS({\bm{\beta}})$ , where ${\bm{\beta}}$ is the optimal parametrization chosen according to Table 1, and $(\sigma_{a},\sigma_{b},\sigma_{e})|{\bm{\beta}}$ exactly) with the centred Gibbs Sampler (updating $(\mu,\bm{\gamma},\bm{\eta})|(\sigma_{a},\sigma_{b},\sigma_{e})$ with $GS(0,0)$ and $(\sigma_{a},\sigma_{b},\sigma_{e})|(\mu,\bm{\gamma},\bm{\eta})$ exactly) and a blocked Gibbs Sampler (updating ${\bm{\beta}}|(\sigma_{a},\sigma_{b},\sigma_{e})$ and $(\sigma_{a},\sigma_{b},\sigma_{e})|{\bm{\beta}}$ exactly), which can be implemented because the distribution of ${\bm{\beta}}|(\sigma_{a},\sigma_{b},\sigma_{e})$ is multivariate Gaussian. All schemes are then combined with the parameter expansion methodology of Meng and Van Dyk (1999); Liu and Wu (1999). The results in Figure 7 show that using the optimal parametrization reduces significantly the autocorrelation compared to, e.g., the fully centred one, and achieves a mixing that is basically equivalent to the one obtained by the exact blocked Gibbs Sampler. In all cases parameter expansion helps to reduce the autocorrelation in the samples of the standard deviations $(\sigma_{a},\sigma_{b},\sigma_{e})$ .

The similarity of performances between the Gibbs Sampler with optimal parametrization and the blocked one is interesting because the Gibbs update of ${\bm{\beta}}|(\sigma_{a},\sigma_{b},\sigma_{e})$ only requires univariate updates and has a potentially lower computational cost compared to a full multivariate block update of ${\bm{\beta}}|(\sigma_{a},\sigma_{b},\sigma_{e})$ , which requires large matrix operations. While these matrix operations can be performed efficiently in the context of nested linear models (see e.g. Papaspiliopoulos and Zanella, 2017), their cost becomes significantly larger for example in the context of crossed random effect models (see Section 4 below and Papaspiliopoulos et al., 2019). Note that such a similarity of performances is not surprising given our theoretical results above. In fact Corollary 2 guarantees that the sampler $GS({\bm{\beta}})$ used in the Gibbs update have a rate of convergence upper bounded by $2/3$ , which is well separated from 1. When such updates are nested within a larger sampler (e.g the one updating ${\bm{\beta}}|(\sigma_{a},\sigma_{b},\sigma_{e})$ and $(\sigma_{a},\sigma_{b},\sigma_{e})|{\bm{\beta}}$ ) the difference between and exact update of ${\bm{\beta}}$ and a Gibbs one with good rate of convergence can easily become negligible.

4 Multigrid decomposition for crossed effect models

Interestingly, the multigrid decomposition can be used to analyze non-nested models. In this section we focus on the following crossed effect model.

Model Ck (k-factors crossed-effects model).

[TABLE]

with $a^{(s)}_{i_{s}}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1/\tau_{s})$ for $s\in\{1,\dots,k\}$ , $\epsilon_{i_{1}\dots i_{k}}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1/\tau_{e})$ and $p(\mu)\propto 1$ . We denote the number of observed datapoints by $N=\prod_{s=1}^{k}n_{s}$ .

Similarly to Sections 2 and 3, we use bold letters to denote the following vectors: $\bm{y}=(y_{i_{1}\dots i_{k}})_{i_{1},\dots,i_{k}}$ , ${\bm{a}}^{(s)}=(a^{(s)}_{i_{s}})_{i_{s}}$ , ${\bm{a}}=({\bm{a}}^{(1)},\dots,{\bm{a}}^{(k)})$ and ${\bm{a}}^{(-s)}=({\bm{a}}^{(1)},\dots,{\bm{a}}^{(s-1)},{\bm{a}}^{(s+1)},\dots,{\bm{a}}^{(k)})$ . The standard Gibbs Sampler to sample from the posterior distribution $\mathcal{L}(\mu,{\bm{a}}|\bm{y})$ of Model Ck is defined as follows.

Sampler GS-crossed.

At each iteration

sample $\mu$ from $\mathcal{L}(\mu|{\bm{a}},\bm{y})$ , 2. 2.

sample ${\bm{a}}^{(s)}$ from $\mathcal{L}\left({\bm{a}}^{(s)}|\mu,{\bm{a}}^{(-s)},\bm{y}\right)$ with $s$ going from $1$ to $k$ .

Model Ck and Sampler GS-crossed have recently been analysed in Papaspiliopoulos et al. (2019) using the multigrid decomposition approach developed in Section 3 of this paper to derive expressions for the convergence rate of Sampler GS-crossed. In particular, Papaspiliopoulos et al. (2019) considered the following linear functions of ${\bm{a}}$

[TABLE]

for each $s\in\{1,\dots,k\}$ and proved the following result.

Theorem 4 (Papaspiliopoulos et al. (2019)).

Let

[TABLE]

be the Markov chain generated by Sampler GS-crossed. Then the time-wise transformations $\left((\mu,\bar{a}^{(1)},\dots,\bar{a}^{(k)})(t)\right)_{t=1}^{\infty}$ and $\left(\delta{\bm{a}}^{(1)}(t)\right)_{t=1}^{\infty}$ , …, $\left(\delta{\bm{a}}^{(k)}(t)\right)_{t=1}^{\infty}$ are $(k+1)$ independent Markov chains. Moreover, the rate of convergence of $\left((\mu,{\bm{a}})(t)\right)_{t=1}^{\infty}$ is

[TABLE]

Theorem 4 implies that the convergence properties of Sampler GS-crossed deteriorate as $N$ increases because $\max_{s\in\{1,\dots,k\}}(N\tau_{e})^{-1}(N\tau_{e}+n_{s}\tau_{s})$ goes to 1 as $N\to\infty$ . Motivated by this consideration, Papaspiliopoulos et al. (2019) propose a collapsed Gibbs Sampler that avoids such slowdown for increasing data size while preserving the same computational cost per iteration of Sampler GS-crossed. In the following two sections we extend the analysis of Model Ck performed in Papaspiliopoulos et al. (2019), focusing on the role of, respectively, reparametrizations and statistical identifiability.

4.1 Reparametrizations and crossed effects models

In the context of nested models, reparametrization techniques based on hierarchical centering offers a way to make the Gibbs Sampler robust to large datasets (see e.g. Corollary 2). We now show that this is not the case in the crossed effects context of Model Ck. In this section we focus on the case $k=2$ , which is a case often studied theoretically in the literature (see e.g Gao and Owen (2017); Brown et al. (2018) for recent examples). In this case, hierarchical centering leads to four possible parametrizations defined as

[TABLE]

Each parametrization corresponds to a different Gibbs Sampler, which at each iteration updates $\mu$ from $\mathcal{L}(\mu|{\bm{\beta}}^{(1)},{\bm{\beta}}^{(2)},\bm{y})$ , ${\bm{\beta}}^{(1)}$ from $\mathcal{L}\big{(}{\bm{\beta}}^{(1)}|\mu,{\bm{\beta}}^{(2)},\bm{y}\big{)}$ , and ${\bm{\beta}}^{(2)}$ from $\mathcal{L}\big{(}{\bm{\beta}}^{(2)}|\mu,{\bm{\beta}}^{(1)},\bm{y}\big{)}$ . The following result characterizes the rate of convergence $\rho_{\lambda_{1}\lambda_{2}}$ of such Gibbs Samplers for all combinations $(\lambda_{1},\lambda_{2})\in\{0,1\}^{2}$ .

Theorem 5.

Let $r_{1}=\frac{N\tau_{e}}{N\tau_{e}+n_{1}\tau_{1}}$ and $r_{2}=\frac{N\tau_{e}}{N\tau_{e}+n_{2}\tau_{2}}$ . Then we have

[TABLE]

Figure 8 summarizes graphically the results of Theorem 5, showing the dependence of the converge rates on the choice of parametrization. The rate displayed in Figure 8 for the fully centred parametrization is the lower bound given in (9).

Theorem 5 implies that centering both factors (i.e. setting $\lambda_{1}=\lambda_{2}=0$ ) is always computationally worse than any of the other parametrizations because $\rho_{00}\geq\max\{\rho_{11},\rho_{01},\rho_{10}\}$ . On the other hand, the optimal choice of $(\lambda_{1},\lambda_{2})$ among $(1,1)$ , $(0,1)$ and $(1,0)$ depends on the specific values of $r_{1}$ and $r_{2}$ . More precisely, the expressions in (9) imply that the convergence rate is minimized by centering the first factor (i.e. setting $\lambda_{1}=0$ ) if and only if $r_{1}\geq(2-r_{2})^{-1}$ and centering the second factor (i.e. setting $\lambda_{2}=0$ ) if and only if $r_{2}\geq(2-r_{1})^{-1}$ . These results are in agreement with, for example, the empirical results obtained in Gelfand et al. (1996, Sec.6) and Browne (2004).

More crucially, Theorem 5 implies that $\min\{\rho_{00},\rho_{01},\rho_{10},\rho_{11}\}\rightarrow 1$ as $n_{1},n_{2}\rightarrow\infty$ . Therefore, regardless of the parametrizations chosen, the convergence of Gibbs Samplers targeting Model Ck deteriorate as the number of factors $n_{1}$ and $n_{2}$ increase. This is in contrast with the nested case analysed in Section 3, where reparametrization techniques are successful in providing samplers with good convergence properties for all choices of hyperparameter values. In the next section we show that a more effective way to achieve good convergence properties is to impose stronger identifiability constraints.

4.2 Connections to statistical identifiability

The parameters $(\mu,{\bm{a}}^{(1)},\dots,{\bm{a}}^{(k)})$ in Model Ck are not identifiable, in the sense that the mapping $(\mu,{\bm{a}}^{(1)},\dots,{\bm{a}}^{(k)})\rightarrow\mathcal{L}(\bm{y}|\mu,{\bm{a}}^{(1)},\dots,{\bm{a}}^{(k)})$ in not injective. While this is not strictly speaking an issue for Bayesian inferences, one may wonder whether imposing identifiability on model parameters results in avoiding the degradation of mixing described in previous sections (see e.g. Vines et al. (1996); Gelfand and Sahu (1999); Xie and Carlin (2006); Kaufman et al. (2010); Vallejos et al. (2015) for related discussion and some examples in applications). We consider imposing identifiability by conditioning on some linear constraints, such as the commonly used choices of $a^{(s)}_{1}=0$ or $\bar{a}^{(s)}=0$ . More generally, one can obtain identifiability for Model Ck by imposing a linear constraint $c_{s}=0$ for each $s$ from $1$ to $k$ , where $c_{s}=\sum_{j=1}^{n_{s}}w_{j}^{(s)}a^{(s)}_{j}$ is a linear combination of $(a_{1}^{(s)},\dots,a_{n_{s}}^{(s)})$ weighted by some non-negative terms $(w_{1}^{(s)},\dots,w_{n_{s}}^{(s)})$ satisfying $\sum_{j=1}^{n_{s}}w_{j}^{(s)}>0$ . Interestingly, one can exploit the multigrid decomposition to derive the convergence rates of the resulting Gibbs Samplers for all choices of weights $(w_{1}^{(s)},\dots,w_{n_{s}}^{(s)})$ .

Theorem 6.

The rate of convergence of Sampler GS-crossed conditioned on $c_{s}=0$ for $s=1,\dots,k$ is given by

[TABLE]

where $q_{s}=(\sum_{j=1}^{n_{s}}w_{j}^{(s)})^{2}/(n_{s}\sum_{j=1}^{n_{s}}(w_{j}^{(s)})^{2})$ .

Comparing (10) with (7) we can see that, since $(1-q_{s})\in[0,1)$ , the rate of convergence always decreases after imposing the identifiability constraints $c_{s}=0$ for $s=1,\dots,k$ . Thus, Theorem 6 implies that, in this context, imposing identifiability always improves the convergence properties of the Gibbs Sampler. To our knowledge, this is the first rigorous result of this kind in the Bayesian computation literature. On the other hand, the result also shows that imposing identifiability per se does not guarantee fast convergence. For example, Theorem 6 implies that the rate of convergence of Sampler GS-crossed conditioned on $a^{(s)}_{1}=0$ for each $s\in\{1,\dots,k\}$ is given by

[TABLE]

while the rate of convergence of Sampler GS-crossed conditioned on $\bar{a}^{(s)}=0$ for each $s\in\{1,\dots,k\}$ equals 0, i.e. the sampler produces i.i.d. draws from the posterior distribution $\mathcal{L}(\mu,{\bm{a}}|\bm{y},\bar{a}^{(1)}=\dots=\bar{a}^{(k)}=0)$ . While in both cases we observe an improvement over the original Gibbs Sampler in terms of convergence rates, the result shows that conditioning on $a^{(s)}_{1}=0$ for each $s\in\{1,\dots,k\}$ leads to a convergence rate that can still go to 1 as the datasize increase. Interestingly (10) implies that the rate of convergence is minimized when $q_{s}$ is maximized, which happens when the weights in the linear constraints are constant, for example $w_{j}^{(s)}=n_{s}^{-1}$ for all $s=1,\dots,k$ and $j=1,\dots,n_{s}$ .

5 Beyond the Gaussian case: a Poisson example

The results of Section 4.2 provide guidance on the choice of which linear constraint to use to impose identifiability for models also beyond the Gaussian case. As an example, we consider the following crossed random effect model with Poisson likelihood, which is the simplest analogue of Model Ck in the context of count data.

Model CkP (Poisson crossed-effects model).

[TABLE]

*with $a^{(s)}_{i_{s}}\stackrel{{\scriptstyle iid}}{{\sim}}Gamma(\alpha_{s},\beta_{s})$ for $s=1,\dots,k$ and $\mu\sim Gamma(\alpha_{\mu},\beta_{\mu})$ . *

We consider sampling from the posterior distribution $\mathcal{L}(\mu,{\bm{a}}|\bm{y})$ of Model CkP using the standard Gibbs Sampler that, similarly to Sampler GS-crossed, at each iteration updates $\mu$ from $\mathcal{L}(\mu|{\bm{a}},\bm{y})$ and then ${\bm{a}}^{(s)}$ from $\mathcal{L}\left({\bm{a}}^{(s)}|\mu,{\bm{a}}^{(-s)},\bm{y}\right)$ for $s=1,\dots,k$ . Here $\bm{y}$ , ${\bm{a}}$ and ${\bm{a}}^{(-s)}$ are defined as in the beginning of Section 4.

We explore the extent to which the conclusions drawn from Theorem 6 apply also to Model CkP by means of simulations. We consider the case $k=2$ with three different combinations of values of $n_{1}$ and $n_{2}$ , with data $(y_{i_{1}\dots i_{k}})_{i_{1}\dots i_{k}}$ simulated from the model. For the prior hyperparameters we use $\alpha_{1}=\alpha_{2}=\alpha_{\mu}=2$ and $\beta_{1}=\beta_{2}=\beta_{\mu}=0.1$ . We compare the standard Gibbs Sampler with no constraints, with the versions obtained by imposing the linear constraint $a^{(1)}_{1}=a^{(2)}_{1}=1$ and $\bar{a}^{(1)}=\bar{a}^{(2)}=1$ , respectively, where $\bar{a}^{(1)}$ and $\bar{a}^{(2)}$ are defined as in (6). Table 2 reports the resulting effective same sizes for various parameters (namely $\mu$ , the average effective sample size of $a^{(1)}_{i}$ for $i=2,\dots,n_{1}$ and similarly for $a^{(2)}_{j}$ over $j=2,\dots,n_{2}$ ).

The results in Table 2 are very much coherent with the theoretical results obtained in the Gaussian case in Theorem 6. In particular, imposing identifiability always improves mixing of the Gibbs Sampler, and imposing constraints on $\bar{a}^{(1)}$ and $\bar{a}^{(2)}$ leads to a sampler with faster convergence compared to imposing constraints on $a^{(1)}_{1}$ and $a^{(2)}_{1}$ . Moreover, the difference in resulting efficiency between the two set of linear constraints increases with $n_{1}$ and $n_{2}$ , as suggested by Theorem 6.

5.1 Comparison with Hamiltonian Monte Carlo

Finally, we also explore whether the results in Theorem 6 are useful also to guide the implementation of other MCMC schemes targeting Model CkP, such as Hamiltonian Monte Carlo (HMC) (Neal et al., 2011) and the No-U-Turn Sampler (NUTS) (Hoffman and Gelman, 2014) implemented in the widely used software STAN (Carpenter et al., 2017). We consider the same setting of the rightmost column of Table 2 where $n_{1}=n_{2}=100$ .

Table 3 reports effective sample sizes (ESS), runtime and ESS per unit of computation time for HMC and NUTS, and compare those to the Gibbs Sampler one (see also the supplementary material for traceplots and autocorrelation functions). Table 3 suggests that imposing identifiability through linear constraints as suggested by Theorem 6 helps significantly also gradient-based sampler such as HMC and NUTS, both by speeding up convergence (higher ESS) and by reducing the cost per iteration (lower runtime). We expect the reduction in runtime for HMC and NUTS to arise from the adaptation of the tuning parameters performed by STAN (e.g. step-size and number of leapfrog steps within each iteration) so that for better identified and less correlated targets (as the one with the linear constraints) the number of required leapfrog steps per iteration is lower, leading to a reduction in runtime. Overall, imposing identifiability results in a significantly higher sampling efficiency: for example, when the constraints $\bar{a}^{(1)}=\bar{a}^{(2)}=1$ is imposed, NUTS is over three orders of magnitude more efficient than in the unconstrained version. Also, especially for NUTS, the constraints $\bar{a}^{(1)}=\bar{a}^{(2)}=1$ lead to a more efficient sampler than the ones with $a^{(1)}_{1}=a^{(2)}_{1}=1$ , which is analogous to the results of Theorem 6. In general, Table 3 supports the fact that the results of Theorem 6 can also provide useful guidance to derive significantly more efficient implementations of gradient-based MCMC algorithms. Finally, note that the runtime of the Gibbs Sampler is orders of magnitude lower than the one HMC and NUTS, suggesting that for random effect models such as Model CkP gradient-based schemes can be much more costly than Gibbs-type schemes, which exploit more directly the conditional independence among random effects. We leave a more detailed investigation of these aspects in the context of more general and complex models to future work.

All simulations reported in Tables 2 and 3 were performed on the same desktop computer with 16GB of RAM and an Intel core i7-7700 @ 3.60 GHz processor, using the R programming language (R Core Team, 2018). Effective sample sizes are estimated using the mcmcse R package (Flegal et al., 2017). The supplementary material provides the R code used to implement the Gibbs Samplers and the Stan code used to specify the models.

Remark 2.

Interestingly, the multigrid decomposition can be applied also to Model CkP, with the appropriate modifications. In this case the Markov chain $\left((\mu,{\bm{a}})(t)\right)_{t=1}^{\infty}$ induced by the Gibbs Sampler can be decomposed into $(k+1)$ independent Markov chains $\left((\mu,\tilde{a}^{(1)},\dots,\tilde{a}^{(k)})(t)\right)_{t=1}^{\infty}$ and $\big{(}\tilde{\delta}{\bm{a}}^{(1)}(t)\big{)}_{t=1}^{\infty}$ , …, $\big{(}\tilde{\delta}{\bm{a}}^{(k)}(t)\big{)}_{t=1}^{\infty}$ , where $\tilde{a}^{(s)}=\sum_{i_{s}}a^{(s)}_{i_{s}}$ and $\tilde{\delta}a^{(s)}_{i_{s}}=a^{(s)}_{i_{s}}/\tilde{a}^{(s)}$ . In this case the rate of convergence of the original chain coincides with the one of $\left((\mu,\tilde{a}^{(1)},\dots,\tilde{a}^{(k)})(t)\right)_{t=1}^{\infty}$ , which evolves according to a $(k+1)$ -dimensional Gibbs Sampler with full conditionals given by:

[TABLE]

where $y_{\cdot}=\sum_{i_{1},\dots,i_{k}}y_{i_{1}\dots i_{k}}$ . We expect such a $(k+1)$ -dimensional Gibbs Sampler to be potentially amenable to analysis using the framework of iterated random functions (Diaconis and Freedman, 1999), in order to obtain an upper bound on convergence rates (see e.g. Alsmeyer and Fuh, 2001, Theorem 2.1.(b)). We leave these extensions to future works and mention it in Section 8 as a possible avenue for future research directions.

6 Non-symmetric hierarchical models

Section 3 describes how to choose the optimal parametrization as a function of $(I,J,K,\sigma_{a},\sigma_{b},\sigma_{e})$ for Model S3. In general, both the variance terms $\sigma_{b}^{2}$ and $\sigma_{e}^{2}$ , and the number of branches $J$ and $K$ in the hierarchy could depend on $i$ and $j$ . In this section we consider such non-symmetric cases for two and three level hierarchical models. In these non-symmetric cases the computationally optimal strategy will involve centering some branches of the hierarchy and non-centering others: we will refer to these strategies as bespoke parametrizations.

Consider the following non-symmetric 2-levels model (which we describe in terms of precisions rather than variances for notational convenience).

Model NS2 (Non-symmetric 2-levels hierarchical model).

*Consider the following 2-levels model with centred parametrization *

[TABLE]

where the precision components $(\tau_{a},(\tau_{e,i})_{i})$ are assumed to be known.

Papaspiliopoulos et al. (2003) studied the symmetric version of Model NS2, where $J_{i}=J$ and $\tau_{e,i}=\tau_{e}$ for all $i$ and some fixed $J$ and $\tau_{e}$ . They showed that the rates of convergence induced by the centred and non-centred parametrizations are given respectively by

[TABLE]

where $\tilde{\tau}_{e}=J\tau_{e}$ . The following Theorem provides an extension to the general non-symmetric case. We consider Model NS2 with a bespoke parametrization $(\mu,\beta_{1},\dots,\beta_{I})$ defined by $I$ indicators $(\lambda_{1},\dots,\lambda_{I})\in\{0,1\}^{I}$ as $\beta_{i}=\gamma_{i}-\lambda_{i}\mu$ , meaning that $\lambda_{i}$ equals 0 if component $i$ is centred and 1 if it is non-centred.

Theorem 7.

The rate of convergence of the Gibbs Sampler targeting Model NS2 with bespoke parametrization given by $(\lambda_{1},\dots,\lambda_{I})\in\{0,1\}^{I}$ is

[TABLE]

where $\tilde{\tau}_{i}=J_{i}\tau_{e,i}$ .

Equation (14) shows that in the non-symmetric case, the GS rate of convergence is given by a weighted average of the precision ratios $\frac{\tau_{a}}{\tilde{\tau}_{i}+\tau_{a}}$ and $\frac{\tilde{\tau}_{i}}{\tilde{\tau}_{i}+\tau_{a}}$ depending on whether each component is centred or not. This has clear analogies with the symmetric case in (13). The weights in the average of (14) are themselves a function of $(\lambda_{1},\dots,\lambda_{I})$ , thus introducing dependence across components in terms of centering and the overall convergence rate. Nonetheless, the following corollary shows that even in the context of Model NS2, the choice of optimal parametrization in each branch of the three can be carried out independently following the same intuition of the symmetric case: for each $i$ in $\{1,\dots,I\}$ use centred parametrization $\gamma_{i}$ if and only if $\tau_{a}\leq J_{i}\tau_{e,i}$ , otherwise use a non-centred parametrization $a_{i}=\gamma_{i}-\mu$ . Note that the optimal choice on each branch of the hierarchy can be taken independently of other branches, which make the implementation easy (compared to a scenario where the optimal decision on each branch was influenced by other branches).

Corollary 3.

Let $\bar{\lambda}_{i}=\mathbbm{1}(\tau_{a}>\tilde{\tau}_{i})$ for all $i$ from 1 to $I$ . Then

[TABLE]

By (14), the strategy described in Corollary 3 ensures that $\rho_{\bar{\lambda}_{1}\dots\bar{\lambda}_{I}}\leq 1/2$ . This is the same upper bound one can obtain in the symmetric case (see (13) and Papaspiliopoulos et al. (2003)), meaning that in this case bespoke parametrizations are successful in dealing with the lack of symmetry.

Consider now the three-level non-symmetric case.

Model NS3 (Non-symmetric 3-levels hierarchical model).

Consider a more general 3-levels model with centred parametrization

[TABLE]

where variance components are assumed to be known.

In this case the multigrid factorization of Theorem 1 does not apply directly to Model NS3, but nonetheless it can still be used to obtain upper bounds on the rates of convergence.

Theorem 8.

Given an instance of Model NS3 we define

[TABLE]

If $a/b^{(i)}\geq a/b^{(i^{\prime})}e/b^{(i^{\prime})}$ for every $i,i^{\prime}\in\{1,\dots,I\}$ , then the rate of convergence of the Gibbs Sampler with centred parametrization $(\mu,\bm{\gamma},\bm{\eta})$ satisfies

[TABLE]

The results of Theorem 8 suggest that as the number of datapoints increase the efficiency of the Gibbs sampler with centred parametrization increases. In fact, as $K_{ij}$ increases the assumptions of Theorem 8 are eventually satisfied and the bound on the convergence rate goes to [math] as $J_{i}$ and $K_{ij}$ increase. Theorem 8 provides rigorous theoretical support and characterization of the well known fact that the centred parametrization is to be preferred in contexts of large and informative datasets (Gelfand et al., 1995; Papaspiliopoulos et al., 2003). We note that the convergence rate for the Gibbs Sampler targeting Model NS3 is not easily tractable, and that deriving analytic expressions for the optimal bespoke parametrization in this context is still an open problem.

7 Hierarchical linear models with arbitrary number of levels

In this section we consider Gaussian hierarchical models with an arbitrary number of levels, namely $k$ levels. We refer to the highest level of the hierarchy (i.e. the one furthest away from the data) as level 0, down to level $k-1$ being the lowest level (i.e. closest to the data). The 3 level model of Section 3 is a special case of the theory developed here where $k=3$ .

7.1 Model formulation

In order to allow for more generality and keep the notation concise, in this section we will use a graphical models notation. In particular $T$ will denote a finite tree with $k$ levels and root $t_{0}\in T$ . For each node $t\in T$ we will denote by $pa(t)$ the unique parent of $t$ and by $ch(t)$ the collection of children of $t$ . Moreover we write $s\preceq t$ and $s\succeq t$ if $s$ is respectively an ancestor or a descendant of $t$ (with $s$ and $t$ possibly being equal) while $s\prec t$ and $s\succ t$ denote the same notions with the additional condition of $s\neq t$ . For each node $t\in T$ we denote by $\ell(t)$ the level of node $t$ (i.e. its distance from $t_{0}$ ). For each $d\in\{0,\dots,k-1\}$ we denote by $T_{d}=\{t\in T\,:\,\ell(t)=d\}$ the collection of nodes at level $d$ . For example we have $T_{0}=\{t_{0}\}$ and $T=\cup_{d=0}^{k-1}T_{d}$ . Noisy observations will occur only at leaf nodes. The collection of leaf nodes is denoted as $T_{L}=\{t\in T\,:\,ch(t)=\emptyset\}$ . For simplicity we assume that all leaf nodes are at level $k-1$ , i.e. $T_{L}=T_{k-1}$ , although this assumption could be easily relaxed allowing some branches to be longer than others.

Model NSk ( $k$ -levels hierarchical model).

Suppose that we have a hierarchy described by a tree $T$ with observations occurring at leaf nodes $T_{L}$ . We assume the following hierarchical model

[TABLE]

where $i\in\{1,\dots,n_{t}\}$ with $n_{t}$ being the number of observed data at leaf node $t$ , $(\tau_{t})_{t\in T\backslash t_{0}}$ and $(\tau_{t}^{(e)})_{t\in T_{L}}$ are known precision components and all normal random variables are sampled independently. Following the standard Bayesian model specification we assume a flat prior on $\gamma_{t_{0}}$ .

We are interested in sampling from the posterior distribution of ${\bm{\gamma}}_{T}=(\gamma_{t})_{t\in T}$ given some observations $\textbf{y}=(y_{t})_{t\in T_{L}}$ . The centred parametrization ${\bm{\gamma}}_{T}$ of Model NSk leads to the following Gibbs Sampler.

Sampler GS( ${\bm{\gamma}}_{T}$ ).

*Initialize ${\bm{\gamma}}_{T}(0)$ and then iterate the following kernel:

For $d=0,\dots,k-1$ , sample $\gamma_{t}(s+1)$ from $p(\gamma_{t}|{\bm{\gamma}}_{T_{d-1}}(s+1),{\bm{\gamma}}_{T_{d+1}}(s),\textbf{y})$ for all $t\in T_{d}$ , where $p(\gamma_{t}|{\bm{\gamma}}_{T_{d-1}},{\bm{\gamma}}_{T_{d+1}},\textbf{y})=p(\gamma_{t}|{\bm{\gamma}}_{T\backslash t},\textbf{y})$ is the full conditional distribution of $\gamma_{t}$ given by Model NSk. When $d$ equals 0 or $k-1$ the levels ${\bm{\gamma}}_{T_{d-1}}$ and ${\bm{\gamma}}_{T_{d+1}}$ have to be replaced by empty sets in the conditioning.*

7.2 Non centering and hierararchical reparametrizations

Model NSk expresses Gaussian hierarchical models using a centred parametrization. The corresponding non-centred version is given by the following example.

Example 1 (Fully non-centred parametrization).

Under the same setting as Model NSk, define

[TABLE]

and assume a flat prior on $\alpha_{t_{0}}$ .

The non-centred parametrization ${\bm{\alpha}}_{T}$ can be obtained as a linear transformation of the centred version ${\bm{\gamma}}_{T}$ of Model NSk. More generally, we will consider the class of parametrizations that can be obtained by reparametrizing ${\bm{\gamma}}_{T}$ as follows.

Definition 1 (Hierarchical reparametrizations).

Let ${\bm{\gamma}}_{T}=(\gamma_{t})_{t\in T}$ be a random vector with elements indexed by a tree $T$ . We say that ${\bm{\beta}}_{T}=(\beta_{t})_{t\in T}$ is a hierarchical (linear) reparametrization of ${\bm{\gamma}}_{T}$ if

[TABLE]

for some real-valued coefficients $\Lambda=(\lambda_{tr})_{r\preceq t,t\in T}$ satisfying $\lambda_{tt}\neq 0$ for all $t\in T$ . We denote (17) by ${\bm{\beta}}_{T}=\Lambda{\bm{\gamma}}_{T}$ .

Using terminology from Papaspiliopoulos et al. (2003), we refer to the family of hierarchical reparametrizations of $\bm{\gamma}_{T}=(\gamma_{t})_{t\in T}$ as partially non-centred parametrizations (PNCP) of Model NSk. Note that (17) does not span the space of all linear transformations of ${\bm{\gamma}}_{T}$ . In fact $\Lambda=(\lambda_{tr})_{r\preceq t,t\in T}$ can be thought as a $|T|\times|T|$ matrix $\Lambda=(\lambda_{tr})_{r,t\in T}$ inducing a linear transformation of ${\bm{\gamma}}_{T}$ with the additional sparsity constraint of being zero on all elements $\lambda_{tr}$ such that $r\npreceq t$ . The following Lemma shows that the definition of the class of PNCP does not depend on the starting parametrization used to formulate Model NSk. For example, one could equivalently define the class of PNCP of Model NSk as the collection of hierarchical reparametrization of the non-centred parametrization ${\bm{\alpha}}_{T}$ of Example 1.

Lemma 1.

If ${\bm{\beta}}_{T}$ is a hierarchical reparametrization of ${\bm{\gamma}}_{T}$ , then also ${\bm{\gamma}}_{T}$ is a hierarchical reparametrization of ${\bm{\beta}}_{T}$ .

As for the 3-levels case we are interested in assessing the computational efficiency of the different Gibbs Sampling schemes arising from different PNCP’s. For each PNCP ${\bm{\beta}}_{T}$ the corresponding Gibbs Sampler scheme $GS({\bm{\beta}}_{T})$ is defined analogously to $GS({\bm{\gamma}}_{T})$ .

Sampler GS( ${\bm{\beta}}_{T}$ ).

*Initialize ${\bm{\beta}}_{T}(0)$ and then iterate the following kernel:

For $d=0,\dots,k-1$ , sample $\beta_{t}(s+1)$ from $p(\beta_{t}|({\bm{\beta}}_{T_{p}}(s+1))_{0\leq p<d},({\bm{\beta}}_{T_{p}}(s))_{d<p\leq k-1},\textbf{y})$ for all $t\in T_{d}$ , where $p(\beta_{t}|({\bm{\beta}}_{T_{p}})_{0\leq p<d},({\bm{\beta}}_{T_{p}})_{d<p\leq k-1},\textbf{y})=p(\beta_{t}|{\bm{\beta}}_{T\backslash t},\textbf{y})$ is the full conditional distribution of $\beta_{t}$ given by Model NSk.*

Sampler GS( ${\bm{\beta}}_{T}$ ) is easy to implement because the family of PNCP preserves the hierarchical structure of Model NSk. In fact, any PNCP of Model NSk exhibits the following conditional independence structure:

[TABLE]

Note that this is a weaker condition than the one holding for the centred parametrization ${\bm{\gamma}}_{T}$ . In the latter case, the conditional independence graph corresponds exactly to the tree $T$ , meaning that if $r\neq t$

[TABLE]

Despite being weaker than (T), condition (H) still guarantees that all parameters at the same level are conditionally independent (thus simplifying the update of ${\bm{\beta}}_{T_{d}}|{\bm{\beta}}_{T\backslash T_{d}}$ ) and that the full conditional distribution of each $\beta_{t}$ depends only on the descendants or ancestors of $t$ . The following Lemma and Corollary provide a simple way to check that any PNCP of Model NSk satisfies (H).

Lemma 2.

Property (H) is closed under hierarchical re-parametrizations, meaning that if ${\bm{\beta}}_{T}$ satisfies (H) then any hierarchical re-parametrization of ${\bm{\beta}}_{T}$ satisfies (H) too.

Corollary 4.

Any PNCP ${\bm{\beta}}_{T}$ of Model NSk satisfies (H).

7.3 Symmetry assumption

To provide a full analysis of the convergence properties of Sampler GS( ${\bm{\beta}}_{T}$ ) we need a symmetry assumption that we now define. Let $\rho_{tr}$ denote the partial correlation $Corr\left(\beta_{t},\beta_{r}\Big{|}{\bm{\beta}}_{T\backslash\{t,r\}}\right)$ , namely

[TABLE]

and $\rho_{tt}=1$ for all $t$ . Here $Q$ is the precision matrix of ${\bm{\beta}}_{T}$ . Let $\textbf{X}=(X_{\ell})_{\ell=0}^{k-1}$ be a random walk going through $T$ from root to leaves as follows: $X_{0}=t_{0}$ almost surely and then, for $\ell\in\{0,\dots,k-2\}$

[TABLE]

Equation (18) implies that at each step the Markov chain X jumps from the current state $r$ to one of its children $t\in ch(r)$ choosing $t$ proportionally to the squared partial correlation between $\beta_{r}$ and $\beta_{t}$ . Since $\ell(X_{d})=d$ almost surely for all $d\in\{0,\dots,k-1\}$ we can use the following simplified notation: for any $t$ and $r$ in $T$ we use $P(t)$ , $P(t|r)$ and $P(t\cap r)$ to denote respectively $P(X_{\ell(t)}=t)$ , $P(X_{\ell(t)}=t\,|\,X_{\ell(r)}=r)$ and $P(X_{\ell(t)}=t\,\cap\,X_{\ell(r)}=r)$ .

Given the above definitions, we define the following symmetry condition: there exist a $k\times k$ symmetric matrix $C=(c_{dp})_{d,p=0}^{k-1}$ such that

[TABLE]

and $\rho_{tr}=0$ if $r\npreceq t$ and $t\npreceq r$ . Note that $\rho_{tr}$ is invariant to coordinate-wise rescaling of ${\bm{\beta}}_{T}$ and therefore both property (S) and the transition kernel of X are invariant to rescalings. Therefore we can consider, without loss of generality, the following rescaled version of ${\bm{\beta}}_{T}$ defined by

[TABLE]

Given (19), condition (S) can be written, in terms of the precision matrix of $\tilde{{\bm{\beta}}}_{T}=(\tilde{\beta}_{t})_{t\in T}$ as

[TABLE]

The rescaled version $\tilde{{\bm{\beta}}}_{T}$ will be helpful later to design an appropriate multigrid decomposition of ${\bm{\beta}}_{T}$ . Also, it can be seen that property ( $\widetilde{\hbox{S}}$ ) is closed under symmetric hierarchical parametrizations.

Definition 2 (Symmetric hierarchical reparametrizations).

We say that ${\bm{\beta}}_{T}=\Lambda{\bm{\alpha}}_{T}$ is a symmetric hierarchical reparametrization of ${\bm{\alpha}}_{T}$ if the coefficients of $\Lambda=(\lambda_{tr})_{r\preceq t,t\in T}$ depend only on the levels of $r$ and $t$ in the hierarchy $T$ .

Lemma 3.

Property ( $\widetilde{\hbox{S}}$ ) is closed under symmetric hierarchical reparametrizations, meaning that if $\tilde{{\bm{\beta}}}_{T}$ satisfies ( $\widetilde{\hbox{S}}$ ) then any symmetric hierarchical reparametrization of $\tilde{{\bm{\beta}}}_{T}$ satisfies ( $\widetilde{\hbox{S}}$ ) too.

Various special cases of Model NSk satisfy assumption (S). For example, we now consider three cases: a fully symmetric case (both the tree $T$ and the variances $(\tau_{t})_{t\in T}$ are symmetric), a weakly symmetric case (non-symmetric tree and symmetric variances) and a non-symmetric case (both tree and variances non-symmetric).

Model Sk (Symmetric $k$ -levels hierarchical model).

Consider the $k$ -level Gaussian Hierarchical model where the observed data are generated from

[TABLE]

where $[N]=\{1,\dots,N\}$ for any positive integer $N$ . The parameters have the following hierarchical structure: for each level $d$ from $1$ to $k-1$

[TABLE]

Here $(\tau_{1},\dots,\tau_{k-1},\tau_{e})$ are known precisions and the root parameter $\gamma^{(0)}$ is given a flat prior $p(\gamma^{(0)})\propto 1$ . For each $d\in\{1,\dots,k-1\}$ the positive integer $I_{d}$ represents the number of branches from level $d-1$ to level $d$ .

It is easy to see that the posterior distribution of Model Sk, conditioned on the observed data $\textbf{y}=(y_{i_{1},\dots,i_{k-1},j})_{i_{1},\dots,i_{k-1},j}$ , satisfies (S). In this case the random walk X defined by (18) coincides with the natural random walk going through $T$ .

Example 2 (Weakly symmetric case).

Another special case of Model NSk satisfying (S) is given by the case of a general tree $T$ and precision terms defined as $\tau_{t}=\frac{\tau_{\ell(t)}}{\prod_{s\prec t}|ch(s)|}$ for all $t\in T$ and $\tau^{(e)}_{t}=\frac{\tau_{e}}{n_{t}\prod_{s\prec t}|ch(s)|}$ , where $(\tau_{1},\dots,\tau_{k},\tau_{e})\in\mathbb{R}_{+}^{k+1}$ are level-specific precision terms. This is an extension of Model Sk where the lack of symmetry of $T$ is compensated by appropriate variance terms. Condition (S) can be checked by evaluating the partial correlations $(\rho_{tr})_{t,r\in T}$ of the resulting vector ${\bm{\gamma}}_{T}$ .

Example 3 (Non-symmetric cases).

In both cases previously considered (Model Sk and Example 2) the auxiliary Markov chain X defined in (18) follows a natural random walk, in the sense that at each time the chain chooses the next state uniformly at random among children nodes. However, condition (S) is also satisfied by non-symmetric cases where X is not a natural random walk. In particular any instance of Model NSk such that

[TABLE]

for some $(k-1)$ -dimensional vector $(c_{0},\dots,c_{k-2})$ induces a posterior distribution satisfying (S). In fact, in the context of Model NSk conditions (S*) and (S) are equivalent (this can be derived from (T) and (18)).

The cases previously considered are expressed in terms of centred parametrization, meaning that as all the instances of Model NSk they satisfy (T). Nevertheless Lemma 3 shows that any symmetric hierarchical reparametrization of a vector satisfying ( $\widetilde{\hbox{S}}$ ) still satisfies ( $\widetilde{\hbox{S}}$ ). This implies, for example, that the fully non-centred version of Model Sk and any mixed strategy where some level is centred and some is not centred, still satisfies ( $\widetilde{\hbox{S}}$ ) (after rescaling).

Moreover, note that the exact analysis we will now provide for the Gibbs sampler on models satisfying ( $\widetilde{\hbox{S}}$ ) can be used to provide bound on general cases that do not satisfy ( $\widetilde{\hbox{S}}$ ) (see for example Theorem 8).

7.4 Multigrid decomposition

We now show how to use the multigrid decomposition to analyze the Gibbs Sampler for random vectors ${\bm{\beta}}_{T}$ satisfying (H) and (S). Our aim is to provide a transformation of ${\bm{\beta}}_{T}$ that factorizes the Gibbs Sampler Markov Chain into independent and more tractable sub-chains. Similarly to Section 3 in the following we will often denote ${\bm{\beta}}_{T_{d}}=(\beta_{t})_{t\in T_{d}}$ by ${\bm{\beta}}^{(d)}$ . We proceed in two steps, first introducing the averaging operators $\phi^{(p)}$ and then the residual operators $\delta^{(p)}$ . For any $p\leq d$ the averaging operator $\phi^{(p)}:\mathbb{R}^{T_{d}}\rightarrow\mathbb{R}^{T_{p}}$ is defined as

[TABLE]

where $\textbf{X}=(X_{\ell})_{\ell=0}^{k-1}$ is the Markov chain defined by (18). Loosely speaking $\phi^{(p)}{\bm{\beta}}^{(d)}=\mathbb{E}[\beta_{X_{d}}|{\bm{\beta}}_{T},X_{p}]$ can be interpreted as the averages of ${\bm{\beta}}^{(d)}$ at the coarseness corresponding to the $p$ -th level of the hierarchy. In particular $\phi^{(d)}{\bm{\beta}}^{(d)}={\bm{\beta}}^{(d)}$ and $\phi^{(0)}_{t_{0}}{\bm{\beta}}^{(d)}=\mathbb{E}[\beta_{X_{d}}|{\bm{\beta}}_{T}]$ .

Example 4 (Averaging operators in the symmetric case).

Let ${\bm{\beta}}_{T}={\bm{\gamma}}_{T}$ be given by Model Sk. Then

[TABLE]

Given $\phi$ , we define the residual operators $\delta^{(p)}:\mathbb{R}^{T_{d}}\rightarrow\mathbb{R}^{T_{p}}$ as $\delta^{(p)}=(\delta^{(p)}_{r})_{r\in T_{p}}$ with $\delta^{(p)}_{r}:\mathbb{R}^{T_{d}}\rightarrow\mathbb{R}$ defined as

[TABLE]

for $1\leq p\leq d\leq k-1$ and $\delta^{(0)}{\bm{\beta}}^{(d)}=\phi^{(0)}{\bm{\beta}}^{(d)}$ for $0=p\leq d\leq k-1$ . Analogously to the 3-level case of Section 3, under suitable assumptions the residual operators $\delta^{(p)}$ decompose the Markov chain generated by the Gibbs Sampler into independent sub-chains. To prove the result we first need the following lemma.

Lemma 4 ( $p$ -residuals interact only with $p$ -residuals).

Let ${\bm{\beta}}_{T}$ be a Gaussian random vector satisfying (H) and ( $\widetilde{\hbox{S}}$ ). Then for any $p$ and $d$ with $0\leq p\leq d\leq k-1$ , for all $r\in T_{p}$ we have the identity

[TABLE]

Given Lemma 4 we can prove the following multigrid decomposition for hierarchical linear models.

Theorem 9 (Multigrid decomposition for $k$ levels).

Let $({\bm{\beta}}(s))_{s\in\mathbb{N}}$ be a Markov chain evolving according to GS( ${\bm{\beta}}_{T}$ ) with ${\bm{\beta}}_{T}$ satisfying (H) and ( $\widetilde{\hbox{S}}$ ). Then $(\delta^{(0)}{\bm{\beta}}(s))_{s}$ , …, $(\delta^{(k-1)}{\bm{\beta}}(s))_{s}$ are $k$ independent Markov chains. Moreover, each chain $\delta^{(p)}{\bm{\beta}}(s)=(\delta^{(p)}{\bm{\beta}}^{(d)}(s))_{d=p}^{k-1}$ evolves according to the following blocked Gibbs sampler scheme with $(k-p)$ components: for $d$ going from $p$ to $k-1$ sample

[TABLE]

where $\mathcal{L}(X|Y)$ denotes the conditional distribution of $X$ given $Y$ .

Theorem 9 implies that running a Gibbs sampler $({\bm{\beta}}(s))_{s}$ targeting distributions satisfying (H) is equivalent to running $k$ independent blocked Gibbs Samplers, one for each level of coarseness from $(\delta^{(0)}{\bm{\beta}}(s))_{s}$ to $(\delta^{(k-1)}{\bm{\beta}}(s))_{s}$ .

Corollary 5.

Let ${\bm{\beta}}_{T}$ satisfies (H) and ( $\widetilde{\hbox{S}}$ ). Then the rate of convergence of GS( ${\bm{\beta}}_{T}$ ) is given by $\max\{\rho_{0},\dots,\rho_{k-1}\}$ where for each $p\in\{0,\dots,{k-1}\}$ , $\rho_{p}$ is the rate of convergence of $(\delta^{(p)}{\bm{\beta}}(s))_{s}$ .

7.5 Hierarchical ordering of rates

Combining the results in Roberts and Sahu (1997, Sec.2.2) with the multigrid decomposition, we can characterize the rates of convergence of the $k$ independent Markov chains described above as follows.

Theorem 10.

The rate of convergence of $(\delta^{(p)}{\bm{\beta}}(s))_{s}$ is given by the largest modulus eigenvalue of $(\mathbb{I}_{k-p}-L)^{-1}U$ . Here $\mathbb{I}_{k-p}$ is the $(k-p)$ dimensional identity matrix, while $L$ and $U$ are, respectively, the strictly lower and strictly upper triangular part of $(c_{d\ell})_{d,\ell=p}^{k-1}$ , with $c_{d\ell}$ given by ( $\widetilde{\hbox{S}}$ ).

Theorem 10 implies that the convergence properties of the $k$ independent Markov chains are closely related. In particular, from the rates of convergence point of view, the $k$ Markov chains updating $\delta^{(p)}{\bm{\beta}}$ for $p=0,\dots,k-1$ behave as Gibbs samplers targeting a decreasing number of dimensions (from $k$ down to 1) of a single $k$ -dimensional Gaussian distribution with precision matrix given by $-C$ , where $C=(c_{d\ell})_{d,\ell=p}^{k-1}$ is given by ( $\widetilde{\hbox{S}}$ ). This suggests that the convergence properties of the sub-chains will typically improve from that of $(\delta^{(0)}{\bm{\beta}}(s))_{s}$ to that $(\delta^{(k-1)}{\bm{\beta}}(s))_{s}$ and that the rate of convergence of $(\delta^{(0)}{\bm{\beta}}(s))_{s}$ will typically determine the rate of the whole sampler GS( ${\bm{\beta}}_{T}$ ). In particular, in the centred parametrization case we can use the well-known Cauchy interlacing theorem (see e.g. Bhatia, 2013) to show that the rate of convergence is monotonically non-increasing from $(\delta^{(0)}{\bm{\beta}}(s))_{s}$ to $(\delta^{(k-1)}{\bm{\beta}}(s))_{s}$ .

Theorem 11.

(Ordering of rates for centred parametrization)* Let ${\bm{\gamma}}$ be a Gaussian vector satisfying (T) and ( $\widetilde{\hbox{S}}$ ) and let $({\bm{\gamma}}(s))_{s\in\mathbb{N}}$ be the corresponding Markov chain evolving according to GS( ${\bm{\gamma}}_{T}$ ). Then the convergence rates of the $k$ independent Markov chains $(\delta^{(0)}{\bm{\gamma}}(s))_{s}$ , …, $(\delta^{(k-1)}{\bm{\gamma}}(s))_{s}$ satisfy*

[TABLE]

In Theorem 11 we needed the additional assumption (T) to prove (23). The reason is that, while in most cases the convergence rates of a deterministic-scan Gibbs Sampler targeting a $n$ -th dimensional Gaussian distribution improves if one of the coordinates is conditioned to a fixed value and the sampler targets only the remaining $(n-1)$ coordinates, this is not true in general. Example 2 of Roberts and Sahu (1997) provides a counter-example (see also Whittaker, 1990, page 319). In Roberts and Sahu (1997), this example was used a counter-example regarding blocking strategies, it also works in the present context. We note that, if one were to consider a random scan version of the Gibbs Sampler, the reversibility of the induced Markov chains would allow to prove the ordering result in Theorem 11 with no need to assume (T). We leave this as a direction of future research and briefly mention it in Section 8.

Theorem 11 implies the following corollary.

Corollary 6.

Let ${\bm{\gamma}}$ be a Gaussian vector satisfying (T) and ( $\widetilde{\hbox{S}}$ ). Then the rate of convergence of GS( ${\bm{\gamma}}_{T}$ ) is given by the largest squared eigenvalue of the $k$ -dimensional matrix $C-\mathbb{I}_{k}$ , where $C=(c_{d\ell})_{d,\ell=0}^{k-1}$ is defined by ( $\widetilde{\hbox{S}}$ ) and $\mathbb{I}_{k}$ is the $k$ -dimensional matrix.

In particular, considering the special case of Model Sk it is easy to deduce the following result.

Corollary 7.

The rate of convergence of $GS({\bm{\gamma}}_{T})$ targeting Model Sk is given by the largest squared eigenvalue of the $k$ -dimensional matrix

[TABLE]

*where $r_{\ell}=\frac{I_{\ell}\tau_{\ell}}{\tau_{\ell-1}+I_{\ell}\tau_{\ell}}$ with $(\tau_{1},\dots,\tau_{k-1})$ and $(I_{1},\dots,I_{k-1})$ given by Model Sk, $\tau_{0}=0$ , $\tau_{k}=\tau_{e}$ and $I_{k}=J$ . *

7.6 Example: rates of convergence for 4-level models

The results developed in Sections 7.4 and 7.5 allow to analyze hierarchical models with an arbitrary number of levels. For example we could consider 4-level extensions of Model S3.

Model S4.

(Symmetric 4-levels hierarchical model) Suppose

[TABLE]

where $i$ , $j$ , $k$ and $\ell$ run from 1 to $I$ , $J$ , $K$ and $L$ respectively and $\epsilon_{ijk\ell}$ are iid normal random variables with mean 0 and variance $\sigma_{e}^{2}$ . We employ a standard Bayesian model specification assuming $a_{i}\sim N(0,\sigma_{a}^{2})$ , $b_{ij}\sim N(0,\sigma_{b}^{2})$ , $c_{ijk}\sim N(0,\sigma_{c}^{2})$ and a flat prior on $\mu$ .

In order to fit Model S4 with a Gibbs Sampler like GS( ${\bm{\beta}}_{T}$ ), one could consider centering or non-centering each of the three levels $(a_{i})_{i}$ , $(b_{ij})_{ij}$ and $(c_{ijk})_{ijk}$ . Let $(\lambda_{1},\lambda_{2},\lambda_{3})\in\{0,1\}^{3}$ be the non-centering indicators associated to the resulting in $8=2^{3}$ combinations. Here $\lambda_{d}=1$ indicates that the $d$ -th level is non-centred while $\lambda_{d}=0$ indicates that it is centred. The corresponding rates of convergence $\rho_{(\lambda_{1},\lambda_{2},\lambda_{3})}$ can then be expressed in terms of the following normalized variance ratios

[TABLE]

where $\tilde{\sigma}_{1}^{2}=\frac{\sigma_{a}^{2}}{I}$ , $\tilde{\sigma}_{2}^{2}=\frac{\sigma_{b}^{2}}{IJ}$ , $\tilde{\sigma}_{3}^{2}=\frac{\sigma_{c}^{2}}{IJK}$ and $\tilde{\sigma}_{4}^{2}=\frac{\sigma_{e}^{2}}{IJKL}$ . If $\lambda_{1}=1$ (i.e. using the non-centred parametrization $(a_{i})_{i}$ at level $1$ ) the rates are

[TABLE]

When $\lambda_{1}=0$ the expressions for the convergence rates are still explicit, but slightly more involved and are reported in Section 3.1 of the supplementary material. These rates can be derived from Corollary 5 and Theorem 10. It is worth noting that also in this 4-level case the skeleton chain $\delta^{(0)}{\bm{\beta}}$ is always the slowest chain for all centred and non-centred parametrizations (which can be checked by computing the rates of convergence of $\delta^{(1)}{\bm{\beta}}$ , $\delta^{(2)}{\bm{\beta}}$ and $\delta^{(3)}{\bm{\beta}}$ using Theorem 10 and comparing those to the ones of $\delta^{(0)}{\bm{\beta}}$ ), even if for the general $k$ -level case we were able to prove this fact only for the fully-centred parametrization (Theorem 11). The expressions given here can be easily used to derive conditionally optimal parametrizations for Model S4 given the rescaled variance components $(\tilde{\sigma}_{i}^{2})_{i=1}^{4}$ . For example, choosing whether to center or not each level by comparing the level-specific rescaled variances with the sum of the rescaled variances of the lower levels like in Section 3.2 leads to rates of convergence upper bounded by $\frac{3}{4}$ .

8 Conclusions and future work

In this work we studied the convergence properties of the Gibbs Sampler algorithm in the context of Gaussian multilevel models. To do so we developed a novel analytic approach based on multigrid decompositions that allows to factorize the Markov chain of interest into independent and easier to analyze sub-chains. This decomposition enables us to evaluate explicitly the $L^{2}$ -rate of convergence in various models of interest. The results offer a detailed and valuable insight into the interaction between multilevel structures (e.g. nested and crossed) and the Gibbs Sampler and provide guidance on the choice of the computationally optimal parametrizations or linear constraints, which can potentially be relevant also beyond the Gaussian case (see e.g. Section 5), and indication of which parameters to monitor in the convergence diagnostic process (see Theorem 2 and discussion at the end of Section 2.1). Since the first preprint version of this paper, the multigrid decomposition developed in this paper has already found other practical applications. In particular Papaspiliopoulos et al. (2019) have successfully exploited it to analyze the computational complexity of the Gibbs Sampler in the context of crossed random effect models (see also Gao and Owen, 2017) and to design an algorithmic modification with linear computational complexity.

Together with explicit formulas for $L^{2}$ -rates of convergences, the multigrid decomposition we developed in this paper provides a simple and intuitive theoretical characterizations of practical behaviors commonly observed in practice when fitting hierarchical models with MCMC, such as slower mixing for hyper-parameters at higher levels (see Theorems 2 and 11), algorithmic scalability with width of the hierarchy but not with height (e.g. Theorem 3 and Corollary 7) and good performances of centred parametrization in data-rich contexts (Theorem 8). We hope that the results presented in this work will provide a first step towards providing quantitative understanding of the behavior of MCMC algorithms (even beyond the Gibbs Sampler) in the extremely popular context of Bayesian hierarchical and multilevel models.

The present work could be extended in many directions. For example, it would be interesting to extend the results for non-symmetric cases, either by generalizing the bounds of Theorem 8 or by weakening the symmetry assumption in (S). In terms of classes of models considered, a natural and important extension would be to consider the multivariate case (where each parameter $\gamma_{t}$ is a multivariate random vector) and the regression case. We expect many results developed in this work to extend to the multivariate and regression case, even if in that context the role played by non-symmetric cases will be more crucial. Another important class of models that would be worth approaching with methodologies analogous to the ones developed here are models based on Gaussian processes commonly used, for example, in spatial statistics (see e.g. Bass and Sahu, 2016a).

An important and ambitious aim would be to extend the results to other tractable distributions within the exponential family beyond the Gaussian case. A starting point for this could be to analyze Model CkP as mentioned in Remark 2. Also, many non-Gaussian hierarchical models can be well-approximated by Gaussian ones for sufficiently large data sets, so that it is reasonable to conjecture that the qualitative conclusions (at least) of our study might remain valid when extrapolated to non-Gaussian models, rather like the analysis given in Sahu and Roberts (1999). A detailed study of this question is left for future work.

We have concentrated in this paper on deterministic samplers. However, explicit rates of convergence of random scan samplers are also available in the Gaussian case as described in Amit (1996) and extended in Roberts and Sahu (1997). deterministic and random scan samplers can sometimes differ substantially in their convergence properties, see for example Roberts and Rosenthal (2015), although no general theory for this phenomenon is well-understood, so that the insights of this work could be particularly useful in this direction. Also, in the random scan case the reversibility of the induced Markov chains would allow us to apply the Cauchy interlacing theorem under weaker assumptions than Theorem 11 and thus prove orderings results for general hierarchical parametrizations ${\bm{\beta}}_{T}=(\beta_{t})_{t\in T}$ .

While this work is focused on $L^{2}$ -rates of convergence, the same approach could be used to derive bounds on the distance (e.g. total variation or Wasserstein) between the distribution of the Markov chain at a given iteration and the target distribution (see e.g. Amit (1996), Roberts and Sahu (1997, eq.(15)) and Khare et al., 2009, Sec.4.4). Such a formulation would be interesting to extend the recent growth in literature on providing rigorous characterizations of the computational complexity of Bayesian hierarchical linear models, see for example Rajaratnam and Sparks (2015); Roberts and Rosenthal (2016); Johndrow et al. (2015). In order to provide full characterizations, however, the case of unknown variances should be considered (see e.g. Jones and Hobert (2004) for the two level case).

Acknowledgments

The authors are grateful for stimulating discussions with Omiros Papaspiliopoulos and Art Owen. GZ supported in part by an EPSRC Doctoral Prize fellowship, by the European Research Council (ERC) through StG “N-BNP” 306406 and by MIUR through the PRIN Project 2015SNS29B. GOR acknowledges support from EPSRC through grants EP/K014463/1 (i-Like) and EP/D002060/1 (CRiSM).

1 Remarks on the results’ proofs

We list proofs following the mathematical chronology. This is different from the order of appearance in the paper because here we start from the results for $k$ -level models, namely the results from Lemma 1 to Corollary 7, and then move to the ones for $3$ -level models and crossed models, namely the results from Theorem 1 to Theorem 8.

The results developed here rely on classical theory for Gaussian Gibbs Samplers and their autoregression representations. The following theorem summarizes the most relevant facts for our purposes. See Lemma 1 and Theorem 1 of Roberts and Sahu [1997] for proofs and more detailed statements.

Theorem 1.1.

The Markov chain $({\bm{\beta}}(s))_{s=1,2,\dots}$ generated by a deterministic-scan Gibbs Sampler targeting a $n$ -dimensional Gaussian distribution with covariance matrix $\Sigma$ evolves as a multivariate Gaussian autoregressive process with transition kernel given by

[TABLE]

for some fixed $n$ -dimensional square matrix $B$ and $n$ -dimensional vector $b$ . The rate of convergence of $({\bm{\beta}}(s))_{s=1,2,\dots}$ equals the largest modulus eigenvalue of $B$ .

The autoregressive matrix $B$ , which we will sometimes refer to as $B$ -matrix of the Gibbs Sampler under consideration, can be explicitly derived from the target covariance $\Sigma$ , following the recipe in Section 2.2 of Roberts and Sahu [1997]. The latter involves computing an $n$ -dimensional square matrix $A$ and then exploiting the identity $B=(\mathbb{I}_{n}-L)^{-1}U$ , where $\mathbb{I}_{n}$ is the $n$ -dimensional identity matrix, $L$ is the lower triangular part of $A$ and $U=A-L$ . The specific form of $A$ , which we will sometimes refer to as $A$ -matrix, depends on the updating strategy of the Gibbs Sampler under consideration.

In principle, Theorem 1.1 provides a constructive way to compute the rates of convergence of Gaussian Gibbs Samplers. In realistic high-dimensional statistical scenarios, however, it is typically very challenging to compute analytically the largest modulus eigenvalue of $B$ , as the latter is a matrix with dimensionality equal to the number of parameters in the problem. In the following proofs, we will exploit the probabilistic structure of hierarchical models, and the multigrid decomposition in particular, to reduce the problem to studying low-dimensional Gaussian autoregressions, namely the skeleton chains $\delta^{(0)}({\bm{\beta}}(s))_{s=1,2,\dots}$ , with more tractable $B$ -matrices.

2 Proofs of Lemmas

Proofs of Lemmas 1 and 2

In order to prove Lemmas 1 and 2 we need some preliminary results on matrices $M=(M_{tr})_{t,r\in T}$ indexed by elements of a tree $T$ .

Lemma 5 (Triangular matrices on trees).

Suppose that a matrix $L=(L_{tr})_{t,r\in T}$ satisfies the following lower-triangularity condition

[TABLE]

Then $L$ is invertible and its inverse satisfies (Lm). Similarly, if $U=(U_{tr})_{t,r\in T}$ satisfies the following upper-triangularity condition

[TABLE]

then $U$ is invertible and its inverse still satisfies (Um).

Proof.

Suppose that $L$ satisfies (Lm). Also without loss of generality suppose $L_{tt}=1$ for all $t\in T$ by rescaling. Then write $L=(\mathbb{I}_{T}+N)$ where $\mathbb{I}_{T}$ is the $|T|\times|T|$ identity matrix and $N$ satisfies the following strict lower-triangularity condition

[TABLE]

Consider $N^{2}_{tr}=\sum_{s\in T}N_{ts}N_{sr}=\sum_{s\in T\,:\,r\prec s\prec t}N_{ts}N_{sr}$ . From the last expression it follows that $N^{2}_{tr}\neq 0$ implies $r\prec t$ and $|\ell(r)-\ell(t)|\geq 2$ , where $\ell(t)$ denote the level of $t$ in the tree $T$ . Iterating the same argument we have that $N^{p}_{tr}\neq 0$ implies $r\prec t$ and $|\ell(r)-\ell(t)|\geq p$ . It follows that $N^{p}$ satisfies (L ${}^{*}_{m}$ ) for all $p\geq 1$ and that $N^{p}=\bm{0}_{T}$ for all $p\geq k$ where $\bm{0}_{T}$ is the $|T|\times|T|$ zero matrix. Here $k$ indicates the number of levels of $T$ , as in Section 7 of the paper. From $N^{p}=\bm{0}_{T}$ for $p\geq k$ it follows that

[TABLE]

and therefore $L^{-1}=(\mathbb{I}_{T}+N)^{-1}=\mathbb{I}_{T}+\sum_{p=1}^{k-1}(-1)^{p}N^{p}$ . Since $N^{p}$ satisfies (L ${}^{*}_{m}$ ) for all $p\geq 1$ it follows that $L^{-1}=\mathbb{I}_{T}+\sum_{p=1}^{k-1}(-1)^{p}N^{p}$ satisfies (Lm).

The analogous statement for (Um) can be deduced by observing that $U$ satisfies (Um) if and only if its transpose satisfies (Lm). ∎

Lemma 5 can be used to deduce Lemma 1 in a straightforward way.

Proof of Lemma 1.

Suppose that ${\bm{\beta}}_{T}=\Lambda{\bm{\gamma}}_{T}$ is a hierarchical reparametrization of ${\bm{\gamma}}_{T}$ . This is equivalent to say that $\Lambda=(\Lambda_{tr})_{t,r,\in T}$ is a matrix satisfying (Lm) and that for all $t\in T$ , ${\bm{\beta}}_{t}=\sum_{s\in T}\Lambda_{ts}\gamma_{s}$ . Lemma 5 implies that $\Lambda$ is invertible and that $Z=\Lambda^{-1}$ satisfies (Lm). Therefore ${\bm{\gamma}}_{T}=Z{\bm{\beta}}_{T}$ is a hierarchical reparametrization of ${\bm{\beta}}_{T}$ . ∎

To prove Lemma 2 we need an additional preliminary result.

Lemma 6 (Closure of Hm).

Suppose that $M=(M_{tr})_{t,r\in T}$ satisfies the following condition

[TABLE]

Then if we multiply $M$ from the right with a matrix $L$ satisfying (Lm), the product $ML$ still satisfy (Hm). Similarly if we multiply $M$ from the left with a matrix $U$ satisfying (Um), the product $UM$ still satisfy (Hm).

Proof.

Consider $(UM)_{tr}=\sum_{s\in T}U_{ts}M_{sr}$ . From (Hm) and (Um), the elements $U_{ts}M_{sr}$ in the latter sum are non-zero only when $s$ belongs to the intersection of $\{s\in T\,:\,s\succeq t\}$ and $\{s\in T\,:\,s\succeq r\hbox{ or }s\preceq r\}$ . It is easy to see that the intersection of these two sets is non-empty only if $t\succeq r$ or $t\preceq r$ . Therefore $UM$ satisfies (Hm). The argument to show that $ML$ satisfies (Hm) is analogous. ∎

We now combine Lemmas 5 and 6 to prove Lemma 2.

Proof of Lemma 2.

Suppose that ${\bm{\beta}}_{T}$ satisfies (H). This is equivalent to saying that its precision matrix $Q^{({\bm{\beta}})}=(Q^{({\bm{\beta}})}_{tr})_{t,r\in T}$ satisfies (Hm). Consider a hierarchical reparametrization of ${\bm{\beta}}_{T}$ denoted by $\Lambda{\bm{\beta}}_{T}$ . Then its precision matrix is given by $Q^{(\Lambda{\bm{\beta}})}=(\Lambda^{T})^{-1}Q^{({\bm{\beta}})}\Lambda^{-1}$ where $\Lambda^{T}$ denote the transpose of $\Lambda$ . By definition of hierarchical reparametrizations, $\Lambda$ satisfies (Lm). Therefore Lemma 5 implies that $\Lambda^{-1}$ satisfies (Lm) and, consequently, Lemma 6 implies that $Q^{({\bm{\beta}})}\Lambda^{-1}$ satisfies (Hm). Since $\Lambda$ satisfies (Lm), then $\Lambda^{T}$ satisfies (Um) and thus Lemma 5 implies that $(\Lambda^{T})^{-1}$ satisfies (Um). We can then apply Lemma 6 to $(\Lambda^{T})^{-1}$ and $Q^{({\bm{\beta}})}\Lambda^{-1}$ to deduce that $(\Lambda^{T})^{-1}Q^{({\bm{\beta}})}\Lambda^{-1}$ satisfies (Hm) and thus $\Lambda{\bm{\beta}}_{T}$ satisfies (H). ∎

Corollary 4 follows easily from equation (T) and Lemma 2.

Proof of Lemma 3

The strategy to prove Lemma 3 is similar to the one used to prove Lemmas 1 and 2 above, with the difference that we have to check that the symmetry condition is preserved under the operations considered. To do so we first prove two auxiliary lemmas.

Lemma 7 (Symmetric triangular matrices on trees).

Suppose that a matrix $L=(L_{tr})_{t,r\in T}$ satisfies the following symmetric lower-triangularity condition

[TABLE]

where $(l_{pd})_{p,d=0}^{k-1}$ is a $k\times k$ real valued matrix. Then $L$ is invertible and its inverse satisfies (SLm). Similarly, if $U=(U_{tr})_{t,r\in T}$ satisfies the following symmetric upper-triangularity condition

[TABLE]

with $(u_{pd})_{p,d=0}^{k-1}$ being a $k\times k$ real valued matrix, then $U$ is invertible and its inverse still satisfies (SUm).

Proof.

Suppose that $L$ satisfies (SLm). Also without loss of generality suppose $L_{tt}=1$ for all $t\in T$ by rescaling. Since $L$ also satisfies (Lm), arguing as in the proof of Lemma 5 we can write $L^{-1}=\mathbb{I}_{T}+\sum_{p=1}^{k-1}(-1)^{p}N^{p}$ where $N=L-\mathbb{I}_{T}$ and $N$ satisfies the following symmetric strict lower-triangularity condition

[TABLE]

From $N^{2}_{tr}=\sum_{s\in T}N_{ts}N_{sr}=\sum_{s\in T\,:\,r\prec s\prec t}N_{ts}N_{sr}=\sum_{\ell^{\prime}\,:\,\ell(r)\prec\ell^{\prime}\prec\ell(t)}l_{\ell(t)\ell^{\prime}}l_{\ell^{\prime}\ell(r)}$ we deduce that $N^{2}$ still satisfies (SL ${}^{*}_{m}$ ) for some different $(l_{pd})_{p,d=0}^{k-1}$ . Iterating the same argument we have that $N^{p}$ satisfies (SL ${}^{*}_{m}$ ) for all $p\geq 1$ . Since $L^{-1}=\mathbb{I}_{T}+\sum_{p=1}^{k-1}(-1)^{p}N^{p}$ it follows that $L^{-1}$ satisfies (SLm).

The analogous statement for (SUm) can be deduced by observing that $U$ satisfies (SUm) if and only if its transpose satisfies (SLm). ∎

Lemma 8.

Let $M=(M_{tr})_{t,r\in T}$ and $L=(L_{tr})_{t,r\in T}$ satisfy respectively ( $\widetilde{\hbox{S}}$ ) and (SLm). Then the product $ML$ satisfies ( $\widetilde{\hbox{S}}$ ) for some different matrix $(c_{pd})_{p,d=1}^{k-1}$ . Similarly, if $U=(U_{tr})_{t,r\in T}$ satisfies (SUm) then the product $UM$ satisfy ( $\widetilde{\hbox{S}}$ ).

Proof.

First consider the product $ML$ . Lemma 6 implies that $ML$ satisfies (Hm) and therefore $(ML)_{tr}=0$ unless $t\preceq r$ or $t\succeq r$ . Given $t\preceq r$ or $t\succeq r$ and using ( $\widetilde{\hbox{S}}$ ) and (SLm) we have

[TABLE]

where the sum from $\ell(r)$ to $\ell(t)$ equals 0 if $\ell(r)\geq\ell(t)$ . The latter equation implies that $ML$ satisfies ( $\widetilde{\hbox{S}}$ ).

To prove that the product $UM$ satisfies ( $\widetilde{\hbox{S}}$ ) note that $M$ and $U$ satisfying ( $\widetilde{\hbox{S}}$ ) and (SUm) respectively is equivalent to $M^{T}$ and $U^{T}$ satisfying ( $\widetilde{\hbox{S}}$ ) and (SLm) respectively. Therefore, by the first part of this Lemma, $(UM)^{T}=M^{T}U^{T}$ satisfies (SLm) and thus $((UM)^{T})^{T}=UM$ satisfies ( $\widetilde{\hbox{S}}$ ). ∎

We now combine Lemmas 7 and 8 to prove Lemma 3.

Proof of Lemma 3.

Suppose that the precision matrix $Q^{(\tilde{{\bm{\beta}}})}=(Q^{(\tilde{{\bm{\beta}}})}_{tr})_{t,r\in T}$ of $\tilde{{\bm{\beta}}}_{T}$ satisfies ( $\widetilde{\hbox{S}}$ ). Consider a symmetric hierarchical reparametrization of $\tilde{{\bm{\beta}}}_{T}$ denoted by $\Lambda\tilde{{\bm{\beta}}}_{T}$ . Then its precision matrix is given by $Q^{(\Lambda\tilde{{\bm{\beta}}})}=(\Lambda^{T})^{-1}Q^{(\tilde{{\bm{\beta}}})}\Lambda^{-1}$ where $\Lambda^{T}$ denote the transpose of $\Lambda$ . By definition of symmetric hierarchical reparametrizations, $\Lambda$ satisfies (SLm). Therefore Lemma 7 implies that $\Lambda^{-1}$ satisfies (SLm) and, consequently, Lemma 8 implies that $Q^{(\tilde{{\bm{\beta}}})}\Lambda^{-1}$ satisfies ( $\widetilde{\hbox{S}}$ ). Since $\Lambda$ satisfies (SLm), then $\Lambda^{T}$ satisfies (SUm) and thus Lemma 5 implies that $(\Lambda^{T})^{-1}$ satisfies (SUm). We can then apply Lemma 8 to $(\Lambda^{T})^{-1}$ and $Q^{(\tilde{{\bm{\beta}}})}\Lambda^{-1}$ to deduce that $Q^{(\Lambda\tilde{{\bm{\beta}}})}=(\Lambda^{T})^{-1}Q^{(\tilde{{\bm{\beta}}})}\Lambda^{-1}$ satisfies ( $\widetilde{\hbox{S}}$ ). ∎

Proof of Lemma 4

Proof of Lemma 4.

Suppose ${\bm{\beta}}_{T}$ has zero mean (otherwise replace ${\bm{\beta}}_{T}$ by ${\bm{\beta}}_{T}-\mathbb{E}[{\bm{\beta}}]$ ). As in Section 7.4 of the paper, given any $r$ and $t$ in $T$ we denote $P(X_{\ell(t)}=t|X_{\ell(r)}=r)$ by $P(t|r)$ . Using (H), for any $d\in\{0,\dots,k-1\}$ and $t\in T_{d}$ , we can write the full conditional expectation of $\beta_{t}$ as

[TABLE]

where $A_{tr}=-\frac{Q_{tr}}{Q_{tt}}$ for any $r\neq t$ . Note that ( $\widetilde{\hbox{S}}$ ) implies

[TABLE]

It follows

[TABLE]

Since the last equation does not depend on ${\bm{\beta}}^{(d)}$ we have $\mathbb{E}[\beta_{t}|{\bm{\beta}}_{T\backslash t}]=\mathbb{E}[\beta_{t}|{\bm{\beta}}\backslash{\bm{\beta}}^{(d)}].$ For any $r\in T_{p}$ , by definition of $\phi^{(p)}_{r}{\bm{\beta}}^{(d)}$ , we have

[TABLE]

From the latter equation and the definition of $\delta^{(p)}_{r}{\bm{\beta}}^{(d)}$ it follows

[TABLE]

∎

3 Proofs for hierarchical models with an arbitrary number of levels

Proof or Theorem 9

To prove Theorem 9 we first need the following lemma.

Lemma 9.

Given $d\in\{0,\dots,k-1\}$ and $p,p^{\prime}\in\{0,\dots,d\}$ with $p\neq p^{\prime}$ ,

[TABLE]

Proof.

Let $d\in\{0,\dots,k-1\}$ , $p,p^{\prime}\in\{0,\dots,d\}$ with $p<p^{\prime}$ and ${\bm{\beta}}\backslash{\bm{\beta}}^{(d)}$ be fixed. To make the notation more compact we denote $\mathbb{E}[\cdot|{\bm{\beta}}\backslash{\bm{\beta}}^{(d)}]$ by $\tilde{\mathbb{E}}[\cdot]$ , $\phi^{(p)}_{r}{\bm{\beta}}^{(d)}$ by $\tilde{\phi}^{(p)}_{r}$ and $P(X_{p^{\prime}}=r^{\prime}|X_{p}=r)$ by $P(r^{\prime}|r)$ for all $r\in T_{p}$ and $r^{\prime}\in T_{p^{\prime}}$ . By replacing $\beta_{t}$ with $\beta_{t}-\tilde{\mathbb{E}}[\beta_{t}]$ we can suppose without loss of generality that $\tilde{\mathbb{E}}[\beta_{t}]=0$ for all $t\in T_{d}$ and therefore $\tilde{\mathbb{E}}[\tilde{\phi}^{(p)}_{r}]=\tilde{\mathbb{E}}[\tilde{\phi}^{(p^{\prime})}_{r^{\prime}}]=0$ and $\tilde{\mathbb{E}}[\delta^{(p)}_{r}{\bm{\beta}}^{(d)}]=\tilde{\mathbb{E}}[\delta^{(p^{\prime})}_{r^{\prime}}{\bm{\beta}}^{(d)}]=0$ for all $r\in T_{p}$ and $r^{\prime}\in T_{p^{\prime}}$ . By definition $\tilde{\phi}^{(p)}_{r}=\sum_{s\in T_{p^{\prime}}}P(s|r)\tilde{\phi}^{(p^{\prime})}_{s}$ and therefore

[TABLE]

where we used $\tilde{\mathbb{E}}[\tilde{\phi}^{(p^{\prime})}_{s}\tilde{\phi}^{(p^{\prime})}_{r^{\prime}}]=0$ for $r^{\prime}\neq s$ and $r^{\prime},s\in T_{p^{\prime}}$ , which follows from the conditional independence of $(\beta_{t})_{t\in T_{d}\,,\,t\succeq r^{\prime}}$ and $(\beta_{t})_{t\in T_{d}\,,\,t\succeq s}$ given ${\bm{\beta}}\backslash{\bm{\beta}}^{(d)}$ . Note that $P(r^{\prime}|r)$ in (3.1) could be 0. From $\tilde{\mathbb{E}}[\beta_{t}]=0$ for any $t\in T_{d}$ and ( $\widetilde{\hbox{S}}$ ) we have $\tilde{\mathbb{E}}[\beta_{t}^{2}]=P(t)^{-1}$ and therefore

[TABLE]

Combining (3.1) and (3.2) we have $\tilde{\mathbb{E}}[\tilde{\phi}^{(p)}_{r}\tilde{\phi}^{(p^{\prime})}_{r^{\prime}}]=0$ if $r^{\prime}\nsucc r$ and

[TABLE]

From the last equality and the definition of $\delta^{(p)}_{r}{\bm{\beta}}^{(d)}$ in (21) we have

[TABLE]

where the last equality is trivial if $pa(r)\nprec r^{\prime}$ and can be deduced from (3.3) otherwise. The desired conditional independence follows from $\tilde{\mathbb{E}}[\delta^{(p)}_{r}{\bm{\beta}}^{(d)}\delta^{(p^{\prime})}_{r^{\prime}}{\bm{\beta}}^{(d)}]=0=\tilde{\mathbb{E}}[\delta^{(p)}_{r}{\bm{\beta}}^{(d)}]\tilde{\mathbb{E}}[\delta^{(p^{\prime})}_{r^{\prime}}{\bm{\beta}}^{(d)}]$ for all $r\in T_{p}$ and $r^{\prime}\in T_{p^{\prime}}$ . ∎

Proof of Theorem 9.

Theorem 9 follows easily from Lemmas 4 and 9 as follows. For each $d\in\{0,\dots,k-1\}$ the sampling step

[TABLE]

in Sampler GS( ${\bm{\beta}}_{T}$ ) is equal in distribution to sampling jointly the (d+1) residuals

[TABLE]

From the conditional independence statement in Lemma 9 the latter is equivalent to sampling independently each residual $\delta^{(p)}{\bm{\beta}}^{(d)}(s+1)$ from

[TABLE]

Moreover, from Lemma 4

[TABLE]

Therefore the original sampling step in (3.4) is equivalent to sampling independently

[TABLE]

for $p=0,\dots,d$ . The thesis follows from the equivalence between (3.4) and (3.5). ∎

Proofs of Corollary 5, Theorem 10 and Theorem 11

Proof of Corollary 5.

The map from ${\bm{\beta}}_{T}$ to $(\delta^{(0)}{\bm{\beta}},\dots,\delta^{(k-1)}{\bm{\beta}})$ is an injective linear transformation. The injectivity holds because for any $d\in\{0,\dots,k-1\}$ and $t\in T_{d}$ we can reconstruct $\beta_{t}$ from $(\delta^{(0)}{\bm{\beta}}^{(d)},\dots,\delta^{(d)}{\bm{\beta}}^{(d)})$

[TABLE]

It follows that $(\delta{\bm{\beta}}(s))_{s\in\mathbb{N}}=(\delta^{(0)}{\bm{\beta}}(s),\dots,\delta^{(k-1)}{\bm{\beta}}(s))_{s\in\mathbb{N}}$ is a Markov chain with the same rate of convergence of the original chain $({\bm{\beta}}(s))_{s\in\mathbb{N}}$ . Then the thesis follows from Theorem 9 and the fact that the rate of convergence of a collection of independent Markov chains equals the supremum of the rates of convergence of the single chains. ∎

Proof of Theorem 10.

We are interested in the rate of convergence of the blocked sampler targeting $\delta^{(p)}{\bm{\beta}}=(\delta^{(p)}{\bm{\beta}}^{(p)},\dots,\delta^{(p)}{\bm{\beta}}^{(k-1)})$ and evolving according to (22). Consider first the case $p\in\{1,\dots,k-1\}$ . Note that $\delta^{(p)}{\bm{\beta}}$ has a singular variance-covariance matrix because for each $t\in T_{p-1}$ and $d\in\{p,\dots,k-1\}$ it follows from (7.4) and (21) that

[TABLE]

and therefore some elements of $(\delta^{(p)}{\bm{\beta}})$ are linear combinations of the others. In order to use standard tools it is more convenient to work with non-singular Gaussian random vectors. To do so it is sufficient to consider a sub-vector of $\delta^{(p)}{\bm{\beta}}$ obtained by removing from $T_{p}$ one children node for each parent node in $T_{p-1}$ . More formally, let $f$ be an arbitrary map from $T_{p-1}$ to $T_{p}$ such that $f(t)\in ch(t)$ for all $t\in T_{p-1}$ and then define the subset $T^{\prime}_{p}\subseteq T_{p}$ as $T^{\prime}_{p}=T_{p}\backslash f(T_{p-1})$ . It is then easy to see that the resulting sub-vector $\delta_{T^{\prime}_{p}}^{(p)}{\bm{\beta}}=(\delta_{T^{\prime}_{p}}^{(p)}{\bm{\beta}}^{(p)},\dots,\delta_{T^{\prime}_{p}}^{(p)}{\bm{\beta}}^{(k-1)})$ with $\delta_{T^{\prime}_{p}}^{(p)}{\bm{\beta}}^{(d)}=(\delta_{r}^{(p)}{\bm{\beta}}^{(d)})_{r\in T^{\prime}_{p}}$ for all $d\in\{p,\dots,k-1\}$ has an invertible variance-covariance matrix. Moreover, since each $\delta^{(p)}{\bm{\beta}}^{(d)}$ is a function of the corresponding $\delta_{T^{\prime}_{p}}^{(p)}{\bm{\beta}}^{(d)}$ via (3.6) it follows that the blocked sampler targeting $\delta^{(p)}{\bm{\beta}}$ and evolving according to (22) is equivalent in distribution to a blocked Gibbs Sampler targeting $\delta_{T^{\prime}_{p}}^{(p)}{\bm{\beta}}$ and evolving according to

[TABLE]

for $d\in\{p,\dots,k-1\}$ . Let $A_{\delta^{(p)}_{T^{\prime}_{p}}{\bm{\beta}}}=(A_{\delta_{r}^{(p)}{\bm{\beta}}^{(d)}\,\delta_{r^{\prime}}^{(p)}{\bm{\beta}}^{(d^{\prime})}})_{r,r^{\prime}\in T^{\prime}_{p}\,,\,d,d^{\prime}\in\{p,\dots,k-1\}}$ be the $A$ -matrix associated to the Gibbs sampler in (3.7), defined by

[TABLE]

See the discussion of Theorem 1.1 for some details and references on $A$ -matrices. Then Lemma 4 implies that $A_{\delta_{r}^{(p)}{\bm{\beta}}^{(d)}\,\delta_{r^{\prime}}^{(p)}{\bm{\beta}}^{(d^{\prime})}}=c_{dd^{\prime}}$ if $r=r^{\prime}$ and $d^{\prime}\in\{p,\dots,k-1\}\backslash d$ and 0 otherwise. The latter is equivalent to the equation

[TABLE]

where $C^{(p)}$ is the $(k-p)\times(k-p)$ square matrix $C^{(p)}=(c_{dd^{\prime}})_{d,d^{\prime}=p}^{k-1}$ , $\mathbb{I}_{n}$ denotes the $n$ dimensional identity matrix, $\varotimes$ denotes the Kronecker product of matrices and $|T^{\prime}_{p}|=|T_{p}|-|T_{p-1}|$ is the cardinality of $T^{\prime}_{p}$ . Theorem 1.1 implies that the rate of convergence of the Gibbs sampler in (3.7) equals the largest modulus eigenvalue of $B=(\mathbb{I}_{(k-p)|T^{\prime}_{p}|}-L)^{-1}U$ , where $L$ is the lower triangular part of $A_{\delta^{(p)}_{T^{\prime}_{p}}{\bm{\beta}}}$ and $U=A_{\delta^{(p)}_{T^{\prime}_{p}}{\bm{\beta}}}-L$ . Using basic properties of the Kronecker product we can see that

[TABLE]

where $\tilde{L}$ is the lower triangular part of $(C^{(p)}-\mathbb{I}_{k-p})$ and $\tilde{U}=(C^{(p)}-\mathbb{I}_{k-p})-\tilde{L}$ . From $B=\tilde{B}\varotimes\mathbb{I}_{|T^{\prime}_{p}|}$ , where $\tilde{B}=(\mathbb{I}_{(k-p)}-\tilde{L})^{-1}\tilde{U}$ , it follows that the unique eigenvalues of $B$ are the same as the unique eigenvalues of $\tilde{B}$ and thus the largest modulus eigenvalue of $B$ equals the one of $\tilde{B}$ . The case $p=0$ is analogous (with no need to consider a sub-vector of $\delta^{(0)}{\bm{\beta}}$ and $T^{\prime}_{0}$ being equal to $T_{0}$ itself) and trivial to check. ∎

Proof of Theorem 11.

Theorem 9 shows that $(\delta^{(p)}{\bm{\gamma}}(s))_{s}$ for $p\in\{0,\dots,k-1\}$ are $k$ independent Markov chains. Arguing as in the proof of Theorem 10 above for each $p$ we consider $(\delta_{T^{\prime}_{p}}^{(p)}{\bm{\gamma}}(s))_{s}$ rather than $(\delta^{(p)}{\bm{\gamma}}(s))_{s}$ to avoid working with singular Gaussian random vectors. For any $p$ from 0 to $k-1$ , (T) implies that the $Q$ -matrix of $\delta_{T^{\prime}_{p}}^{(p)}{\bm{\gamma}}$ is tridiagonal (with $k-p$ blocks corresponding to $\delta_{T^{\prime}_{p}}^{(p)}{\bm{\gamma}}^{(p)}$ up to $\delta_{T^{\prime}_{p}}^{(p)}{\bm{\gamma}}^{(k-1)}$ ). It follows by Theorem 5 of [Roberts and Sahu, 1997] that the largest modulus eigenvalue of the autoregressive matrix $B$ of the Gibbs Sampler in (3.7) coincides with the square of the largest eigenvalue of the corresponding $A$ -matrix. We denote the latter by $\lambda(A_{\delta_{T^{\prime}_{p}}^{(p)}{\bm{\gamma}}(s)})^{2}$ , where $\lambda(\cdot)$ is the function mapping a symmetric matrix to its largest eigenvalue. Then, using (3.9) from the proof of Theorem 10, it follows $\lambda(A_{\delta_{T^{\prime}_{p}}^{(p)}{\bm{\gamma}}(s)})^{2}=\lambda(C^{(p)}-\mathbb{I}_{k-p})^{2}$ and thus the rate of convergence of $(\delta^{(p)}{\bm{\gamma}}(s))_{s}$ is given by $\rho(\delta^{(p)}{\bm{\gamma}}(s))=\lambda(C^{(p)}-\mathbb{I}_{k-p})^{2}$ . Noting that $C^{(p+1)}-\mathbb{I}_{k-(p+1)}$ is obtained from $C^{(p)}-\mathbb{I}_{k-p}$ by removing the first row and column, the desired inequality $\rho(\delta^{(p)}{\bm{\gamma}}(s))=\lambda(C^{(p)}-\mathbb{I}_{k-p})^{2}\geq\lambda(C^{(p+1)}-\mathbb{I}_{k-(p+1)})^{2}=\rho(\delta^{(p+1)}{\bm{\gamma}}(s))$ follows by applying the Cauchy interlacing theorem (see e.g. Bhatia, 2013), which states that the eigenvalues of a principal submatrix of a symmetric matrix interlace the original eigenvalues. ∎

3.1 Convergence rates for the example in Section 7.6 of the paper

If instead $\lambda_{1}=0$ (i.e. using $(\gamma_{i})_{i}$ with $\gamma_{i}=\mu+a_{i}$ at level 1) we have four different parametrization with rates of convergence given by

[TABLE]

4 Proofs for hierarchical models with three levels

Proof of Theorems 1, 2 and 3

Theorems 1, 2 and 3 are substantially special cases of the analogous theorems for $k$ -levels. In particular Theorem 1 is a special case of Theorem 9 for $k=2$ and Theorems 2 and 3 can be directly verified as follows. Using Corollary 5 we can evaluate the rates of convergence of the three subchains $\rho(\delta^{(0)}{\bm{\beta}}(s))$ , $\rho(\delta^{(1)}{\bm{\beta}}(s))$ and $\rho(\delta^{(2)}{\bm{\beta}}(s))$ for the four parametrizations under consideration in Section 3 and check by inspection that

[TABLE]

and that the rates of convergence $\rho(\delta^{(0)}{\bm{\beta}}(s))$ are the ones given by Theorem 3. In particular the rates of convergence of $\delta^{(0)}{\bm{\beta}}(s))$ , $\delta^{(1)}{\bm{\beta}}(s)$ and $\delta^{(2)}{\bm{\beta}}(s)$ under $GS({\bm{\beta}})$ are given by the following Table. The inequality $\rho(\delta^{(1)}{\bm{\beta}}(s))\geq\rho(\delta^{(2)}{\bm{\beta}}(s))$ is trivial, while the one $\rho(\delta^{(0)}{\bm{\beta}}(s))\geq\rho(\delta^{(1)}{\bm{\beta}}(s))$ can be checked case by case using the expressions in Figure 1 and the fact that the ratio of variances lie between 0 and 1.

Proofs of Corollaries 1-3 and Theorems 7- 8

Proof of Corollary 1.

By Theorem 1 the rate of convergence of the whole chain $({\bm{\beta}}(s))_{s\in\mathbb{N}}$ coincides with the maximum of the rates of the subchains, meaning that $\rho({\bm{\beta}}(s))$ equals the maximum of $\rho(\delta^{(0)}{\bm{\beta}}(s))$ , $\rho(\delta^{(1)}{\bm{\beta}}(s))$ and $\rho(\delta^{(2)}{\bm{\beta}}(s))$ . By Theorem 2 the latter equals $\rho(\delta^{(0)}{\bm{\beta}}(s))$ . ∎

Proof of Corollary 2.

Follows from Theorem 3 by checking that for both ${\bm{\alpha}}=\bm{\gamma}$ and ${\bm{\alpha}}=\textbf{a}$ , the inequality $\rho_{(\mu,{\bm{\alpha}},\bm{\eta})}\leq\rho_{(\mu,{\bm{\alpha}},\textbf{b})}$ holds if and only if $\tilde{\sigma}_{b}^{2}\geq\tilde{\sigma}_{e}^{2}$ ; and for both ${\bm{\alpha}}=\bm{\gamma}$ and ${\bm{\alpha}}=\textbf{a}$ the inequality $\rho_{(\mu,\bm{\gamma},{\bm{\alpha}})}\leq\rho_{(\mu,\textbf{a},{\bm{\alpha}})}$ holds if and only if $\tilde{\sigma}_{a}^{2}\geq\tilde{\sigma}_{b}^{2}+\tilde{\sigma}_{e}^{2}$ . ∎

Proof of Theorem 7.

The Markov chain under consideration is a Gibbs Sampler sweeping through $(\mu,\beta_{1},\dots,\beta_{I})|\textbf{y}$ for some observed data $\textbf{y}=((y_{ij})_{j=1}^{J_{i}})_{i=1}^{I}$ assuming Model NS2 and $\beta_{i}=\gamma_{i}-\lambda_{i}\mu$ with $\lambda_{i}\in\{0,1\}$ . To compute the Gibbs Sampler rate of convergence we first need to compute the $(I+1)\times(I+1)$ matrix $A$ indexed by $(\alpha_{1},\alpha_{2})\in\{\mu,\beta_{1},\dots,\beta_{I}\}\times\{\mu,\beta_{1},\dots,\beta_{I}\}$ and defined as $A_{\alpha_{1}\alpha_{2}}=-\frac{Q_{\alpha_{1}\alpha_{2}}}{Q_{\alpha_{1}\alpha_{1}}}$ for $\alpha_{1}\neq\alpha_{2}$ and [math] for $\alpha_{1}=\alpha_{2}$ , where $Q$ is the precision matrix of $(\mu,\beta_{1},\dots,\beta_{I})|\textbf{y}$ . See the discussion of Theorem 1.1 for more details and references on the derivation of $A$ -matrices. By computing the precision matrix of $(\mu,\beta_{1},\dots,\beta_{I})|\textbf{y}$ it can be seen that $A$ is given by

[TABLE]

where for all $i$

[TABLE]

From Theorem 1.1 the rate of convergence of the Gibbs sampler of interest equals the largest modulus eigenvalue of the autoregressive matrix $B=(\mathbb{I}_{I+1}-L)^{-1}U$ , where $\mathbb{I}_{I+1}$ is the $(I+1)$ -dimensional identity matrix, $L$ is the lower triangular part of $A$ and $U=A-L$ . In this case $B$ is given by

[TABLE]

Finally note that $B$ has $I$ eigenvalues equal to 0 and one equal to

[TABLE]

∎

Proof of Corollary 3.

Starting from (14) we can see that for any $i\in\{1,\dots,I\}$

[TABLE]

where $\rho_{-i}=\sum_{\ell\neq i\,:\,\lambda_{\ell}=0}\tilde{\tau}_{\ell}\frac{\tilde{\tau}_{\ell}}{\tilde{\tau}_{\ell}+\tau_{a}}+\sum_{\ell\neq i\,:\,\lambda_{\ell}=1}\tilde{\tau}_{a}\frac{\tilde{\tau}_{a}}{\tilde{\tau}_{\ell}+\tau_{a}}\geq 0$ . Equation (4.1) implies that $\rho_{\lambda_{1}\dots\lambda_{i-1}0\lambda_{i+1}\dots\lambda_{I}}>\rho_{\lambda_{1}\dots\lambda_{i-1}1\lambda_{i+1}\dots\lambda_{I}}$ if and only if $\tau_{a}>\tilde{\tau}_{i}$ , which in turn implies the statement in Corollary 3. ∎

Proof of Theorem 8.

Given an instance of Model NS3 with variance terms $(\sigma_{a}^{2},(\sigma_{b,i}^{2})_{i},(\sigma_{e,ij}^{2})_{ij})$ satisfying

[TABLE]

the proof will proceed by comparing the original Gibbs Sampler with an auxiliary Gibbs Sampler targeting a different instance of Model NS3 with variance terms $(\sigma_{a}^{2},(\sigma_{b,i}^{2})_{i},(\bar{\sigma}_{e,ij}^{2})_{ij})$ satisfying (S*) and thus allowing direct analysis using Corollary 6. In the context of Model NS3, (S*) reduces to requiring $\sum_{j=1}^{J_{i}}\rho_{\gamma_{i}\eta_{ij}}^{2}$ to be constant over $i$ , where $\rho_{\gamma_{i}\eta_{ij}}$ is the partial correlation $Corr(\gamma_{i},\eta_{ij}|\mu,(\gamma_{\ell})_{\ell\neq i},(\eta_{\ell s})_{(\ell s)\neq(ij)})$ as in Section 7.3 of the paper. By computing the partial correlations of Model NS3 it can be checked that $\sum_{j=1}^{J_{i}}\rho_{\gamma_{i}\eta_{ij}}^{2}=r^{(i)}_{a,b}r^{(i)}_{e,b}$ , where $r^{(i)}_{a,b}$ and $r^{(i)}_{e,b}$ are defined in Theorem 8. For each $i=1,\dots,I$ we define auxiliary variance terms $(\bar{\sigma}_{e,ij}^{2})_{j=1}^{J_{i}}$ such that $\bar{\sigma}_{e,ij}^{2}\geq\sigma_{e,ij}^{2}$ for all $j=1,\dots,J_{i}$ and

[TABLE]

Such $(\bar{\sigma}_{e,ij}^{2})_{j=1}^{J_{i}}$ exist because $r^{(i)}_{a,b}\geq\max_{\ell=1,\dots,I}r^{(\ell)}_{a,b}r^{(\ell)}_{e,b}$ by (4.2) and the left hand side of (4.3) can take any value in $(0,r^{(i)}_{a,b}]$ for $(\bar{\sigma}_{e,ij}^{2})_{j=1}^{J_{i}}$ belonging to $[0,\infty)$ . (4.3) implies that the instance of Model NS3 with variance terms $(\sigma_{a}^{2},(\sigma_{b,i}^{2})_{i},(\bar{\sigma}_{e,ij}^{2})_{ij})$ satisfies (S*) with $c_{0}=\sum_{i=1}^{I}\rho_{\mu\gamma_{i}}^{2}=1-\frac{1}{I}\sum_{i=1}^{I}r^{(i)}_{a,b}$ and $c_{1}=\sum_{j=1}^{J_{i}}\rho_{\gamma_{i}\eta_{ij}}^{2}=\max_{\ell=1,\dots,I}r^{(\ell)}_{a,b}r^{(\ell)}_{e,b}$ . As discussed in Example 3, for models with centred parametrization like Model NS3, (S*) implies (S) and, after rescaling, ( $\widetilde{\hbox{S}}$ ). In this case the matrix $C=(c_{dp})_{d,p=0}^{2}$ is given by

[TABLE]

Therefore, by Corollary 6, the rate of convergence of the Gibbs Sampler targeting the posterior distribution of Model NS3 with variance terms $(\sigma_{a}^{2},(\sigma_{b,i}^{2})_{i},(\bar{\sigma}_{e,ij}^{2})_{ij})$ is given by $c_{0}+c_{1}$ , which is the largest squared eigenvalue of $C-\mathbb{I}_{3}$ , where $\mathbb{I}_{3}$ is the 3-dimensional identity matrix.

Finally we show that the Gibbs Sampler rate of convergence induced by the auxiliary variance terms $(\sigma_{a}^{2},\sigma_{b,i}^{2},\bar{\sigma}_{e,ij}^{2})$ is greater or equal than the original one given by $(\sigma_{a}^{2},\sigma_{b,i}^{2},\sigma_{e,ij}^{2})$ . Denote by $Q$ the precision matrix of the original posterior distribution and by $\bar{Q}$ the auxiliary one. By deriving $Q$ and $\bar{Q}$ from the definition of Model NS3, it is easy to see that the only terms of $Q$ and $\bar{Q}$ affected by replacing $\sigma_{e,ij}^{2}$ with $\bar{\sigma}_{e,ij}^{2}$ are $(Q_{\eta_{ij}\eta_{ij}})_{ij}$ and $(\bar{Q}_{\eta_{ij}\eta_{ij}})_{ij}$ . Moreover $\bar{\sigma}_{e,ij}^{2}\geq\sigma_{e,ij}^{2}$ implies $Q_{\eta_{ij}\eta_{ij}}=\frac{1}{\sigma_{b,i}^{2}}+\frac{K_{ij}}{\sigma_{e,ij}^{2}}\geq\frac{1}{\sigma_{b,i}^{2}}+\frac{K_{ij}}{\bar{\sigma}_{e,ij}^{2}}=\bar{Q}_{\eta_{ij}\eta_{ij}}$ . The result then follows from the fact that the convergence rate of a deterministic scan Gibbs Sampler with single-site update is a non-increasing function of the diagonal elements of the target precision matrix (when the off-diagonal terms are kept constant), see Theorem 7 of Roberts and Sahu [1997]. ∎

5 Proofs for crossed models

Proof of Theorem 5.

The case $(\lambda_{1},\lambda_{2})=(1,1)$ follows directly from Theorem 4. Consider an arbitrary $(\lambda_{1},\lambda_{2})$ in $\{0,1\}^{2}$ and, analogously to (6), define

[TABLE]

for $s\in\{1,2\}$ . It can be checked that the first part of Theorem 4 extends to any $(\lambda_{1},\lambda_{2})\in\{0,1\}^{2}$ , meaning that $\left((\mu,{\bar{\beta}}^{(1)},{\bar{\beta}}^{(2)})(t)\right)_{t=1}^{\infty}$ , $\left(\delta{\bm{\beta}}^{(1)}(t)\right)_{t=1}^{\infty}$ and $\left(\delta{\bm{\beta}}^{(2)}(t)\right)_{t=1}^{\infty}$ are three independent Markov chains. Also, $\left(\delta{\bm{\beta}}^{(1)}(t)\right)_{t=1}^{\infty}$ and $\left(\delta{\bm{\beta}}^{(2)}(t)\right)_{t=1}^{\infty}$ perform i.i.d. sampling from $\mathcal{L}(\delta{\bm{\beta}}^{(1)}|\bm{y})$ and $\mathcal{L}(\delta{\bm{\beta}}^{(2)}|\bm{y})$ respectively and thus have rate of convergence equal to 0. It follows that the rate of convergence of the original Gibbs Sampler $\left((\mu,{\bm{\beta}}^{(1)},{\bm{\beta}}^{(2)})(t)\right)_{t=1}^{\infty}$ coicides with the one of the (low dimensional) three component Gibbs sampler targeting $\mathcal{L}(\mu,\bar{\beta}^{(1)},\bar{\beta}^{(2)}|\bm{y})$ . Denote by $A_{\lambda_{1}\lambda_{2}}$ the $A$ -matrix of the Gibbs Sampler targeting $\mathcal{L}(\mu,\bar{\beta}^{(1)},\bar{\beta}^{(2)}|\bm{y})$ , and by $L_{\lambda_{1}\lambda_{2}}$ and $U_{\lambda_{1}\lambda_{2}}$ its lower and upper triangular parts. Then it holds

[TABLE]

and it can be obtained with simple calculations that the 3 eigenvalues of $(I-L_{01})^{-1}U_{01}$ are $\{0,0,1-r_{2}(1-r_{1})\}$ . It follows by Theorem 1.1 that $\rho_{01}=1-r_{1}(1-r_{2})$ and, by symmetry, $\rho_{10}=1-r_{2}(1-r_{1})$ . Consider now $(\lambda_{1},\lambda_{2})=(0,0)$ . We have

[TABLE]

The matrix $(I-L_{00})^{-1}U_{00}$ has one zero eigenvalue and two eigenvalues given by $\frac{1}{2}\left(1+r_{1}r_{2}+q\right)$ and $\frac{1}{2}\left(1+r_{1}r_{2}-q\right)$ where

[TABLE]

It follows, again by Theorem 1.1, that $\rho_{00}=\frac{1}{2}\left(1+r_{1}r_{2}+q\right)$ . To conclude the proof we now check that the inequality $\rho_{00}\geq 1+r_{1}r_{2}-\min\{r_{1},r_{2}\}$ holds for all $0<r_{1},r_{2}<1$ . The latter is equivalent to

[TABLE]

Squaring both sides, rearranging and dividing by 4 one obtains

[TABLE]

Dividing by $\min\{r_{1},r_{2}\}$ and using $\frac{r_{1}r_{2}}{\min\{r_{1},r_{2}\}}=\max\{r_{1},r_{2}\}$

[TABLE]

The left-hand side equals

[TABLE]

which can be easily seen to be non-negative as both terms in the product are non-negative due to $0<r_{1},r_{2}<1$ . ∎

To prove Theorem 6 we first need the following lemma. Consider the linear mapping from $(\mu,{\bm{a}})$ to $(\mu,\bar{a},\tilde{\delta}{\bm{a}})$ defined as

[TABLE]

where $\bar{a}^{(s)}=\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}a^{(s)}_{i}$ and $\tilde{\delta}{\bm{a}}^{(s)}=(\tilde{\delta}a^{(s)}_{i})_{i=1}^{n_{s}}$ for $s=1,\dots,k$ with

[TABLE]

if $\|w^{(s)}\|^{2}>n_{s}^{-1}$ and $v^{(s)}_{i}=n_{s}^{-1}$ if $\|w^{(s)}\|^{2}=n_{s}^{-1}$ .

Lemma 10.

Denote the $k$ linear constraints $c_{1}=\dots=c_{k}=0$ by $c=0$ for brevity. Then $(\mu,\bar{a})$ and $\tilde{\delta}{\bm{a}}$ are conditionally independent given $\bm{y}$ and $c=0$ , meaning that

[TABLE]

Moreover $\tilde{\delta}{\bm{a}}^{(1)}$ , …, $\tilde{\delta}{\bm{a}}^{(k)}$ are conditionally independent given $\bm{y}$ and $c=0$ , meaning that

[TABLE]

Proof of Lemma 10.

Since the joint distribution $\mathcal{L}(\mu,\bar{a},\tilde{\delta}{\bm{a}}|\bm{y},c=0)$ is multivariate Gaussian it is sufficient to check that the conditional expectations $\mathbb{E}[\mu|\bm{y},\bar{a},\tilde{\delta}{\bm{a}},c=0]$ and $\mathbb{E}[\bar{a}^{(s)}|\bm{y},\mu,\bar{a}^{(-s)},\tilde{\delta}{\bm{a}},c=0]$ do not depend on $\tilde{\delta}{\bm{a}}$ and similarly that $\mathbb{E}[\tilde{\delta}{\bm{a}}^{(s)}|\bm{y},\mu,\bar{a},\tilde{\delta}{\bm{a}}^{(-s)},c=0]$ does not depend on $\mu$ , $\bar{a}$ and $\tilde{\delta}{\bm{a}}^{(-s)}$ in order to deduce (5.1). With standard (but tedious) calculations for conditioned multivariate Gaussian distributions one can explicitly compute

[TABLE]

where $\tilde{r}_{s}$ is defined as

[TABLE]

with $\|w^{(s)}\|^{2}=\sum_{i=1}^{n_{s}}(w_{i}^{(s)})^{2}/((\sum_{i=1}^{n_{s}}w_{i}^{(s)})^{2})$ . ∎

Proof of Theorem 6.

Using Lemma 10 and arguing as in the proof of Theorem 9 we can deduce that $\big{(}(\mu,\bar{a})(t)\big{)}_{t=1}^{\infty}$ , $\big{(}\tilde{\delta}{\bm{a}}^{(1)}(t)\big{)}_{t=1}^{\infty}$ , …, $\big{(}\tilde{\delta}^{(k)}{\bm{a}}(t)\big{)}_{t=1}^{\infty}$ are $(k+1)$ Markov chains and they evolve independently. Thus the rate of convergence of of the original Markov chain $\left((\mu,{\bm{a}})(t)\right)_{t=1}^{\infty}$ coincides with the supremum of the rate of convergence of these $(k+1)$ Markov chains. Since $\big{(}\tilde{\delta}^{(s)}{\bm{a}}(t)\big{)}_{t=1}^{\infty}$ performs i.i.d. sampling for all $s=1,\dots,k$ , its rate of convergence is 0 and thus the rate of convergence of $\left((\mu,{\bm{a}})(t)\right)_{t=1}^{\infty}$ coincides with the one of $\big{(}(\mu,\bar{a})(t)\big{)}_{t=1}^{\infty}$ . The latter is a $(k+1)$ components Gaussian Gibbs Sampler with one dimensional components. Using (5.5) and arguing as in the proof of Proposition 3 of Papaspiliopoulos et al. [2019] one can show that the $B$ matrix (as denoted in Theorem 1.1) of such Gibbs Sampler scheme is

[TABLE]

where $L$ is a $K\times K$ lower triangular matrix with diagonal elements equal to $(\tilde{r}_{1},\dots,\tilde{r}_{K})$ , and that the resulting rate of convergence of $(\mu(t),\bar{a}(t))_{t}$ is $\max_{s\in\{1,\dots,k\}}\tilde{r}_{s}$ , which is the desired thesis. ∎

6 Full conditionals and additional material on the simulations

6.1 Full conditional distributions of GS(1,1) and GS(0,0)

This section reports the full conditional distributions involved in Samplers GS(1,1) and GS(0,0) described in Section 2 of the paper.

Sampler GS( $1,1$ ).

Initialize $\mu(0)$ , $\textbf{a}(0)$ and $\textbf{b}(0)$ and then iterate

[TABLE]

where we use the dot subscript to indicate averaging over one dimension, meaning that $a_{\cdot}=\sum_{i}\frac{a_{i}}{I}$ , $b_{\cdot\cdot}=\sum_{i,j}\frac{b_{ij}}{IJ}$ , $y_{\cdot\cdot\cdot}=\sum_{i,j,k}\frac{y_{ijk}}{IJK}$ , ${b}_{i\cdot}=\sum_{j}\frac{b_{ij}}{J}$ , $y_{i\cdot\cdot}=\sum_{j,k}\frac{y_{ijk}}{JK}$ and $y_{ij\cdot}=\sum_{k}\frac{y_{ijk}}{K}$ .

Sampler GS( $0,0$ ).

Initialize $\mu(0)$ , $\bm{\gamma}(0)$ and $\bm{\eta}(0)$ and then iterate

[TABLE]

where as before the dot subscript indicates averaging over indices.

6.2 Traceplots and autocorrelation function for the simulations in Section 5 of the paper

Figures 2-4 provide traceplots and autocorrelation functions for $\log(\mu)$ , $\log(a^{(1)}_{2})$ and $\log(a^{(2)}_{2})$ for the Gibbs Samplers considered in Table 2 of the paper (NB: we consider $a^{(1)}_{2}$ and $a^{(2)}_{2}$ rather than $a^{(1)}_{1}$ and $a^{(2)}_{1}$ because the latter are constrained to be equal to 1 in one of the sampler implementation).

Similarly, Figures 5 and 6 provide traceplots and autocorrelation functions for $\log(\mu)$ , $\log(a^{(1)}_{2})$ and $\log(a^{(2)}_{2})$ for the HMC and NUTS algorithms considered in Table 3 of the paper. See Table 3 of the paper for corresponding runtimes and effective sample sizes.

6.3 R code for the simulations in Section 5 of the paper

In this section we provide the R and Stan code used to perform the simulations reported in Table 2 and 3 of the paper. First we provide the R code defining the functions for the three Gibbs Samplers implemented in Table 2 of the paper.

⬇

R CODE FOR THE GIBBS SAMPLERS IMPLEMENTED IN TABLE 2

Gibbs_CrossedPoisson_1<-function(y,num_iterations,n1,n2){

Gibbs Sampler for crossed random effect model with Poisson likelihood and Gamma prior

Version 1: unconstrained

y_bar<-mean(y)

samples<-matrix(NA,nrow = num_iterations,ncol = 1+n1+n2) # initialize empty matrix for posterior samples

mu<-1;a<-rep(1,n1);b<-rep(1,n2) # set starting parameters

for (t in 1:num_iterations){# Gibbs iterations

mu<-rgamma(1,shape = alpha_mu+y_bar,rate = beta_mu+mean(a)*mean(b))

a<-rgamma(n1,shape = alpha_a+rowMeans(y),rate =beta_a+ mu*mean(b))

b<-rgamma(n2,shape = alpha_b+colMeans(y),rate =beta_b+ mu*mean(a))

samples[t,]<-c(mu,a,b)

}

return(samples)

}

Gibbs_CrossedPoisson_2<-function(y,num_iterations,n1,n2){

Gibbs Sampler for crossed random effect model with Poisson likelihood and Gamma prior

Version 2: conditioning on a[1]=b[1]=1

y_bar<-mean(y)

samples<-matrix(NA,nrow = num_iterations,ncol = 1+n1+n2)# initialize empty matrix for posterior samples

mu<-1;a<-rep(1,n1);b<-rep(1,n2) # set starting parameters

for (t in 1:num_iterations){# Gibbs iterations

mu<-rgamma(1,shape = alpha_mu+y_bar,rate = beta_mu+mean(a)*mean(b))

a[-1]<-rgamma(n1,shape = alpha_a+rowMeans(y)[-1],rate =beta_a+ mu*mean(b))

b[-1]<-rgamma(n2,shape = alpha_b+colMeans(y)[-1],rate =beta_b+ mu*mean(a))

samples[t,]<-c(mu,a,b)

}

return(samples)

}

Gibbs_CrossedPoisson_3<-function(y,num_iterations,n1,n2){

Gibbs Sampler for crossed random effect model with Poisson likelihood and Gamma prior

Version 3: conditioning on mean(a)=mean(b)=1

y_bar<-mean(y)

samples<-matrix(NA,nrow = num_iterations,ncol = 1+n1+n2)# initialize empty matrix for posterior samples

mu<-1;a<-rep(1,n1);b<-rep(1,n2) # set starting parameters

for (t in 1:num_iterations){# Gibbs iterations

mu<-rgamma(1,shape = alpha_mu+y_bar,rate = beta_mu+mean(a)*mean(b))

a<-rgamma(n1,shape = alpha_a+rowMeans(y),rate =beta_a+ mu*mean(b))

a<-a/mean(a)

b<-rgamma(n2,shape = alpha_b+colMeans(y),rate =beta_b+ mu*mean(a))

b<-b/mean(b)

samples[t,]<-c(mu,a,b)

}

return(samples)

}

Second we provide the Stan code defining the models used for the HMC and NUTS simulations in Table 3 of the paper.

⬇

STAN CODE DEFINING THE MODELS USED FOR THE SIMULATIONS IN TABLE 3

/*

Crossed random effect model with Poisson likelihood and Gamma prior

Version 1: unconstrained

*/

data {

int<lower=0> N;

int<lower=0> n1;

int<lower=0> n2;

int<lower=1,upper=n1> blk1[N];

int<lower=1,upper=n2> blk2[N];

int<lower=0> y[N];

}

parameters {

vector<lower=0>[n1] a;

vector<lower=0>[n2] b;

real<lower=0> mu;

}

transformed parameters {

vector[N] lambda;

real<lower=0> abar;

real<lower=0> bbar;

for (n in 1:N){

lambda[n] = mu * a[blk1[n]]* b[blk2[n]];

}

abar = mean(a);

bbar = mean(b);

}

model {

a ~ gamma(2, 0.1);

b ~ gamma(2, 0.1);

mu ~ gamma(2, 0.1);

y ~ poisson(lambda);

}

/*

Crossed random effect model with Poisson likelihood and Gamma prior

Version 2: conditioning on a[1]=b[1]=1

*/

data {

int<lower=0> N;

int<lower=0> n1;

int<lower=0> n2;

int<lower=1,upper=n1> blk1[N];

int<lower=1,upper=n2> blk2[N];

int<lower=0> y[N];

}

parameters {

vector<lower=0>[n1-1] a;

vector<lower=0>[n2-1] b;

real<lower=0> mu;

}

transformed parameters {

vector[N] lambda;

vector<lower=0>[n1] a_all;

vector<lower=0>[n2] b_all;

real<lower=0> abar;

real<lower=0> bbar;

a_all[1]=1;

for (i in 2:n1){

a_all[i]=a[i-1];

}

bb[1]=1;

for (j in 2:n2){

bb[j]=b[j-1];

}

for (n in 1:N){

  lambda[n] = mu * a_all[blk1[n]]* b_all[blk2[n]];

}

abar = mean(a);

bbar = mean(b);

}

model {

a ~ gamma(2, 0.1);

b ~ gamma(2, 0.1);

mu ~ gamma(2, 0.1);

y ~ poisson(lambda);

}

/*

Crossed random effect model with Poisson likelihood and Gamma prior

Version 3: conditioning on mean(a)=mean(b)=1

*/

data {

int<lower=0> N;

int<lower=0> n1;

int<lower=0> n2;

int<lower=1,upper=n1> blk1[N];

int<lower=1,upper=n2> blk2[N];

int<lower=0> y[N];

vector[n1] alpha_a;

vector[n2] alpha_b;

}

parameters {

simplex[n1] a_norm;

simplex[n2] b_norm;

real<lower=0> mu;

}

transformed parameters {

vector<lower=0>[n1] a;

vector<lower=0>[n2] b;

vector[N] lambda;

real<lower=0> abar;

real<lower=0> bbar;

a = a_norm * n1;

b = b_norm * n2;

for (n in 1:N){

lambda[n] = mu * a[blk1[n]]* b[blk2[n]];

}

abar = mean(a);

bbar = mean(b);

}

model {

a_norm ~ dirichlet(alpha_a);

b_norm ~ dirichlet(alpha_b);

mu ~ gamma(2, 0.1);

y ~ poisson(lambda);

}

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alsmeyer and Fuh [2001] Gerold Alsmeyer and Cheng-Der Fuh. Limit theorems for iterated random functions by regenerative methods. Stochastic processes and their applications , 96(1):123–142, 2001.
2Amit [1991] Yali Amit. On rates of convergence of stochastic relaxation for Gaussian and non-Gaussian distributions. J. Multivariate Anal. , 38(1):82–99, 1991.
3Amit [1996] Yali Amit. Convergence properties of the Gibbs sampler for perturbations of Gaussians. Ann. Statist. , 24(1):122–140, 1996.
4Bass and Sahu [2016 a] Mark R Bass and Sujit K Sahu. A comparison of centring parameterisations of gaussian process-based models for bayesian computation using mcmc. Statistics and Computing , pages 1–22, 2016 a.
5Bass and Sahu [2016 b] Mark R Bass and Sujit K Sahu. A comparison of centring parameterisations of gaussian process-based models for bayesian computation using mcmc. Statistics and Computing , pages 1–22, 2016 b.
6Bhatia [2013] Rajendra Bhatia. Matrix analysis , volume 169. Springer Science & Business Media, 2013.
7Brown et al. [2018] Lawrence D Brown, Gourab Mukherjee, Asaf Weinstein, et al. Empirical bayes estimates for a two-way cross-classified model. The Annals of Statistics , 46(4):1693–1720, 2018.
8Browne [2004] William J Browne. An illustration of the use of reparameterisation methods for improving MCMC efficiency in crossed random effect models. Multilevel modelling newsletter , 16(1):13–25, 2004.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Multilevel linear models, Gibbs samplers and multigrid decompositions

Abstract

1 Introduction

1.1 Paper overview and structure

2 Three level hierarchical linear models

Model S3** (Symmetric 3-levels hierarchical model).**

Sampler GS(1,11,11,1)****.

Sampler GS(0,00,00,0)****.

2.1 Illustrative example

3 Multigrid decomposition for the three level hierarchical model

Theorem 1** (Multigrid Decomposition).**

Remark 1**.**

Theorem 2** (Hierarchical ordering of convergence rates).**

Corollary 1**.**

3.1 Explicit rates of convergence under different parametrizations

Theorem 3**.**

3.2 Conditionally optimal parametrization

Corollary 2** (Optimal parametrization for Model S3).**

4 Multigrid decomposition for crossed effect models

Model Ck** (k-factors crossed-effects model).**

Sampler GS-crossed**.**

Theorem 4** (Papaspiliopoulos et al. (2019)).**

4.1 Reparametrizations and crossed effects models

Theorem 5**.**

4.2 Connections to statistical identifiability

Theorem 6**.**

5 Beyond the Gaussian case: a Poisson example

Model CkP** (Poisson crossed-effects model).**

5.1 Comparison with Hamiltonian Monte Carlo

Remark 2**.**

6 Non-symmetric hierarchical models

Model NS2** (Non-symmetric 2-levels hierarchical model).**

Theorem 7**.**

Corollary 3**.**

Model NS3** (Non-symmetric 3-levels hierarchical model).**

Theorem 8**.**

7 Hierarchical linear models with arbitrary number of levels

7.1 Model formulation

Model NSk** (kkk-levels hierarchical model).**

Sampler GS(γT{\bm{\gamma}}_{T}γT​)****.

7.2 Non centering and hierararchical reparametrizations

Example 1** (Fully non-centred parametrization).**

Definition 1** (Hierarchical reparametrizations).**

Lemma 1**.**

Sampler GS(βT{\bm{\beta}}_{T}βT​)****.

Lemma 2**.**

Corollary 4**.**

7.3 Symmetry assumption

Definition 2** (Symmetric hierarchical reparametrizations).**

Lemma 3**.**

Model Sk** (Symmetric kkk-levels hierarchical model).**

Example 2** (Weakly symmetric case).**

Example 3** (Non-symmetric cases).**

7.4 Multigrid decomposition

Example 4** (Averaging operators in the symmetric case).**

Lemma 4** (ppp-residuals interact only with ppp-residuals).**

Theorem 9** (Multigrid decomposition for kkk levels).**

Corollary 5**.**

7.5 Hierarchical ordering of rates

Theorem 10**.**

Theorem 11**.**

Corollary 6**.**

Corollary 7**.**

7.6 Example: rates of convergence for 4-level models

Model S4**.**

8 Conclusions and future work

Acknowledgments

1 Remarks on the results’ proofs

Theorem 1.1**.**

2 Proofs of Lemmas

Proofs of Lemmas 1 and 2

Lemma 5** (Triangular matrices on trees).**

Proof.

Model S3 (Symmetric 3-levels hierarchical model).

Sampler GS( $1,1$ ).

Sampler GS( $0,0$ ).

Theorem 1 (Multigrid Decomposition).

Remark 1.

Theorem 2 (Hierarchical ordering of convergence rates).

Corollary 1.

Theorem 3.

Corollary 2 (Optimal parametrization for Model S3).

Model Ck (k-factors crossed-effects model).

Sampler GS-crossed.

Theorem 4 (Papaspiliopoulos et al. (2019)).

Theorem 5.

Theorem 6.

Model CkP (Poisson crossed-effects model).

Remark 2.

Model NS2 (Non-symmetric 2-levels hierarchical model).

Theorem 7.

Corollary 3.

Model NS3 (Non-symmetric 3-levels hierarchical model).

Theorem 8.

Model NSk ( $k$ -levels hierarchical model).

Sampler GS( ${\bm{\gamma}}_{T}$ ).

Example 1 (Fully non-centred parametrization).

Definition 1 (Hierarchical reparametrizations).

Lemma 1.

Sampler GS( ${\bm{\beta}}_{T}$ ).

Lemma 2.

Corollary 4.

Definition 2 (Symmetric hierarchical reparametrizations).

Lemma 3.

Model Sk (Symmetric $k$ -levels hierarchical model).

Example 2 (Weakly symmetric case).

Example 3 (Non-symmetric cases).

Example 4 (Averaging operators in the symmetric case).

Lemma 4 ( $p$ -residuals interact only with $p$ -residuals).

Theorem 9 (Multigrid decomposition for $k$ levels).

Corollary 5.

Theorem 10.

Theorem 11.

Corollary 6.

Corollary 7.

Model S4.

Theorem 1.1.

Lemma 5 (Triangular matrices on trees).

Lemma 6 (Closure of Hm).

Lemma 7 (Symmetric triangular matrices on trees).

Lemma 8.

Lemma 9.

Lemma 10.

Sampler GS( $1,1$ ).

Sampler GS( $0,0$ ).