A Framework for Adaptive MCMC Targeting Multimodal Distributions

Emilia Pompe; Chris Holmes; Krzysztof {\L}atuszy\'nski

arXiv:1812.02609·stat.CO·January 14, 2019

A Framework for Adaptive MCMC Targeting Multimodal Distributions

Emilia Pompe, Chris Holmes, Krzysztof {\L}atuszy\'nski

PDF

1 Repo

TL;DR

This paper introduces an adaptive MCMC framework for efficiently sampling from multimodal distributions by combining local moves and mode-jumping strategies, learning optimal parameters dynamically.

Contribution

It presents a novel auxiliary variable adaptive MCMC method that automatically learns parameters and effectively explores multimodal distributions.

Findings

01

Proves ergodic properties of the proposed class of algorithms.

02

Develops an auxiliary variable scheme for adaptive MCMC.

03

Demonstrates improved sampling efficiency in multimodal contexts.

Abstract

We propose a new Monte Carlo method for sampling from multimodal distributions. The idea of this technique is based on splitting the task into two: finding the modes of a target distribution $π$ and sampling, given the knowledge of the locations of the modes. The sampling algorithm relies on steps of two types: local ones, preserving the mode; and jumps to regions associated with different modes. Besides, the method learns the optimal parameters of the algorithm while it runs, without requiring user intervention. Our technique should be considered as a flexible framework, in which the design of moves can follow various strategies known from the broad MCMC literature. In order to design an adaptive scheme that facilitates both local and jump moves, we introduce an auxiliary variable representing each mode and we define a new target distribution $\tilde{π}$ on an augmented state…

Figures38

Click any figure to enlarge with its caption.

Tables7

Table 1. Table 1 : The lowest and the highest value (across 20 runs of the experiment) of the acceptance rates of jump moves between the two modes for the mixture of Gaussians for different jump methods and dimensions.

	deterministic		Gaussian		$t$ -distributed
	Lowest	Highest	Lowest	Highest	Lowest	Highest
d=10	0.98	0.99	0.85	0.87	0.71	0.73
d=20	0.98	0.99	0.79	0.83	0.66	0.68
d=80	0.91	0.98	0.23	0.41	0.24	0.39
d=130	0.72	0.98	0.04	0.13	0.06	0.15
d=160	0.79	0.97	0.01	0.07	0.03	0.07
d=200	0.64	0.97	0.01	0.05	0.02	0.06

Table 2. Table 2 : The lowest and the highest value (across 20 runs of the experiment) of the acceptance rate of jumps from a given mode, dimensions 10 and 20.

	deterministic		Gaussian		$t$ -distributed
	Lowest	Highest	Lowest	Highest	Lowest	Highest
d=10	0.27	0.77	0.12	0.52	0.20	0.65
d=20	0.20	0.75	0.09	0.35	0.11	0.48

Table 3. Table 3 : Settings of the parameters used for the examples presented in this paper.

	Mixture of Gaussians	Mixture of banana-shaped and t-distributions	Sensor network	LOH example
Main algorithm
number of iterations	500,000	500,000	500,000	200,000
$α$	0.7	0.7	0.7	0.7
$β$	0.0001	0.0001	0.0001	0.0001
$ϵ$	0.1	0.1	0.1	0.1
$\tilde{ϵ_{w}}$	0.01	0.01	0.01	0.01
$A C_{2}$	1000	1000	500	500
optimal acceptance rate	0.234	0.234	0.234	0.234
local proposal	Gaussian	Gaussian	Gaussian	Gaussian/ $t$ -distributed
distributions $Q_{i}$	$t$ with 7 df	$t$ with 7 df	$t$ with 7 df	$t$ with 7 df
df of the proposal (if $t$ -distributed)	7	7	7	7
Burn-in algorithm
number of BFGS runs	1500	40,000	10,000	500
$b_{acc}$	1.1	1.1	1.1	1.1

Table 4. Table 4 : The lowest and the highest value (across 20 runs of the experiment) of the acceptance rates of jump moves between the two modes of the posterior distribution in the LOH study for different jump methods (for the Gaussian and t 𝑡 t -distributed local proposal).

	deterministic		Gaussian		$t$ -distributed
	Lowest	Highest	Lowest	Highest	Lowest	Highest
Gaussian local proposal
mode 1 to mode 2	0.01	0.02	0.02	0.02	0.02	0.03
mode 2 to mode 1	0.44	0.71	0.70	0.76	0.71	0.77
$t$ -distributed local proposal
mode 1 to mode 2	0.01	0.03	0.02	0.03	0.02	0.02
mode 2 to mode 1	0.50	0.78	0.61	0.76	0.63	0.76

Table 5. Table 5 : First part: number of the target density and its gradient evaluations in the optimisation runs for the mixture of Gaussians. Second part: number of iterations used for the estimation of the covariance matrices in the burn-in algorithm.

	Optimisation runs			Covariance matrix estimation
	minimum	mean	maximum	minimum	maximum
d=10	9	11.39	42	3000	3000
d=20	9	10.61	39	3000	7000
d=80	6	7.36	27	255,000	511,000
d=130	8	8.07	24	511,000	511,000
d=160	6	8.02	22	511,000	511,000
d=200	6	6.85	23	1,023,000	1,023,000

Table 6. Table 6 : First part: number of the target density and its gradient evaluations in the optimisation runs for the mixture of banana-shaped and t 𝑡 t -distributions. Second part: number of iterations used for the estimation of the covariance matrices in the burn-in algorithm.

	Optimisation runs			Covariance matrix estimation
	minimum	mean	maximum	minimum	maximum
d=10	21	49	220	7000	15,000
d=20	21	48	216	15,000	63,000
d=50	-	-	-	255,000	255,000
d=80	-	-	-	511,000	511,000

Table 7. Table 7 : The lowest and the highest value (across 20 runs of the experiment) of the acceptance rate of jumps from a given mode, dimensions 50 and 80.

	deterministic		Gaussian		$t$ -distributed
	Lowest	Highest	Lowest	Highest	Lowest	Highest
d=50	0.26	1.00	0.03	0.14	0.06	0.23
d=80	0.16	1.00	0.01	0.07	0.03	0.12

Equations383

\tilde{π}_{γ} (x, i) := π (x) \frac{w _{γ, i} Q _{i} ( μ _{i} , Σ _{γ, i} ) ( x )}{\sum _{j \in I} w _{γ, j} Q _{j} ( μ _{j} , Σ _{γ, j} ) ( x )},

\tilde{π}_{γ} (x, i) := π (x) \frac{w _{γ, i} Q _{i} ( μ _{i} , Σ _{γ, i} ) ( x )}{\sum _{j \in I} w _{γ, j} Q _{j} ( μ _{j} , Σ _{γ, j} ) ( x )},

\tilde{π}_{γ} (B \times I) = \int_{B} i \in I \sum π (x) \frac{w _{γ, i} Q _{i} ( μ _{i} , Σ _{γ, i} ) ( x )}{\sum _{j \in I} w _{γ, j} Q _{j} ( μ _{j} , Σ _{γ, j} ) ( x )} d x = \int_{B} π (x) \cdot 1 d x = π (B) .

\tilde{π}_{γ} (B \times I) = \int_{B} i \in I \sum π (x) \frac{w _{γ, i} Q _{i} ( μ _{i} , Σ _{γ, i} ) ( x )}{\sum _{j \in I} w _{γ, j} Q _{j} ( μ _{j} , Σ _{γ, j} ) ( x )} d x = \int_{B} π (x) \cdot 1 d x = π (B) .

α_{γ, L} ((x, i) \to (y, i)) = min [1, \frac{π ~ _{γ} ( y , i )}{π ~ _{γ} ( x , i )}] = min [1, \frac{π ( y ) Q _{i} ( μ _{i} , Σ _{γ, i} ) ( y )}{π ( x ) Q _{i} ( μ _{i} , Σ _{γ, i} ) ( x )} \frac{\sum _{j \in I} w _{γ, j} Q _{j} ( μ _{j} , Σ _{γ, j} ) ( x )}{\sum _{j \in I} w _{γ, j} Q _{j} ( μ _{j} , Σ _{γ, j} ) ( y )}] .

α_{γ, L} ((x, i) \to (y, i)) = min [1, \frac{π ~ _{γ} ( y , i )}{π ~ _{γ} ( x , i )}] = min [1, \frac{π ( y ) Q _{i} ( μ _{i} , Σ _{γ, i} ) ( y )}{π ( x ) Q _{i} ( μ _{i} , Σ _{γ, i} ) ( x )} \frac{\sum _{j \in I} w _{γ, j} Q _{j} ( μ _{j} , Σ _{γ, j} ) ( x )}{\sum _{j \in I} w _{γ, j} Q _{j} ( μ _{j} , Σ _{γ, j} ) ( y )}] .

α_{γ, J} ((x, i) \to (y, k)) = min [1, \frac{π ~ _{γ} ( y , k )}{π ~ _{γ} ( x , i )} \frac{a _{γ, k i} R _{γ, J, i} ( x )}{a _{γ, ik} R _{γ, J, k} ( y )}] .

α_{γ, J} ((x, i) \to (y, k)) = min [1, \frac{π ~ _{γ} ( y , k )}{π ~ _{γ} ( x , i )} \frac{a _{γ, k i} R _{γ, J, i} ( x )}{a _{γ, ik} R _{γ, J, k} ( y )}] .

(x - μ_{i})^{T} Σ_{γ, i}^{- 1} (x - μ_{i}) = (y - μ_{k})^{T} Σ_{γ, k}^{- 1} (y - μ_{k}) .

(x - μ_{i})^{T} Σ_{γ, i}^{- 1} (x - μ_{i}) = (y - μ_{k})^{T} Σ_{γ, k}^{- 1} (y - μ_{k}) .

y := μ_{k} + Λ_{γ, k} Λ_{γ, i}^{- 1} (x - μ_{i}),

y := μ_{k} + Λ_{γ, k} Λ_{γ, i}^{- 1} (x - μ_{i}),

Σ_{γ, i} = Λ_{γ, i} Λ_{γ, i}^{T} and Σ_{γ, k} = Λ_{γ, k} Λ_{γ, k}^{T} .

Σ_{γ, i} = Λ_{γ, i} Λ_{γ, i}^{T} and Σ_{γ, k} = Λ_{γ, k} Λ_{γ, k}^{T} .

α_{γ, J} ((x, i) \to (y, k)) = min [1, \frac{π ~ ( y , k )}{π ~ ( x , i )} \frac{a _{γ, k i} det Σ _{γ, k}}{a _{γ, ik} det Σ _{γ, i}}] .

α_{γ, J} ((x, i) \to (y, k)) = min [1, \frac{π ~ ( y , k )}{π ~ ( x , i )} \frac{a _{γ, k i} det Σ _{γ, k}}{a _{γ, ik} det Σ _{γ, i}}] .

[L_{1}, U_{1}] \times \dots \times [L_{d}, U_{d}],

[L_{1}, U_{1}] \times \dots \times [L_{d}, U_{d}],

\tilde{π}_{γ} (B \times Φ) = π (B) for every B \in B (X) and γ \in Y .

\tilde{π}_{γ} (B \times Φ) = π (B) for every B \in B (X) and γ \in Y .

(\tilde{π}_{γ} \tilde{P}_{γ}) (\cdot) = \tilde{π}_{γ} (\cdot) and n \to \infty lim ∥ \tilde{P}_{γ}^{n} (\tilde{x}, \cdot) - \tilde{π}_{γ} (\cdot) ∥_{T V} = 0 for all \tilde{x} := (x, ϕ) \in \tilde{X} .

(\tilde{π}_{γ} \tilde{P}_{γ}) (\cdot) = \tilde{π}_{γ} (\cdot) and n \to \infty lim ∥ \tilde{P}_{γ}^{n} (\tilde{x}, \cdot) - \tilde{π}_{γ} (\cdot) ∥_{T V} = 0 for all \tilde{x} := (x, ϕ) \in \tilde{X} .

G_{n} := σ {\tilde{X}_{0}, \dots, \tilde{X}_{n}, Γ_{0}, \dots, Γ_{n}} .

G_{n} := σ {\tilde{X}_{0}, \dots, \tilde{X}_{n}, Γ_{0}, \dots, Γ_{n}} .

\mathbb{P}\big{[}\tilde{X}_{n+1}\in\tilde{B}|\tilde{X}_{n}=\tilde{x},\Gamma_{n}=\gamma,\mathcal{G}_{n-1}\big{]}=\tilde{P}_{\gamma}(\tilde{x},\tilde{B})\textrm{ for }\tilde{x}\in\tilde{\mathcal{X}},\gamma\in\mathcal{Y},\tilde{B}\in\mathcal{B}(\tilde{\mathcal{X}}).

\mathbb{P}\big{[}\tilde{X}_{n+1}\in\tilde{B}|\tilde{X}_{n}=\tilde{x},\Gamma_{n}=\gamma,\mathcal{G}_{n-1}\big{]}=\tilde{P}_{\gamma}(\tilde{x},\tilde{B})\textrm{ for }\tilde{x}\in\tilde{\mathcal{X}},\gamma\in\mathcal{Y},\tilde{B}\in\mathcal{B}(\tilde{\mathcal{X}}).

\tilde{A}_{n}^{\mathcal{G}_{t}}(\tilde{B}):=\mathbb{P}\big{[}\tilde{X}_{n}\in\tilde{B}|\tilde{X}_{0}=\tilde{x}_{0},\dots,\tilde{X}_{t}=\tilde{x}_{t},\Gamma_{0}=\gamma_{0},\dots,\Gamma_{t}=\gamma_{t}\big{]}

\tilde{A}_{n}^{\mathcal{G}_{t}}(\tilde{B}):=\mathbb{P}\big{[}\tilde{X}_{n}\in\tilde{B}|\tilde{X}_{0}=\tilde{x}_{0},\dots,\tilde{X}_{t}=\tilde{x}_{t},\Gamma_{0}=\gamma_{0},\dots,\Gamma_{t}=\gamma_{t}\big{]}

\tilde{A}_{n}^{(\tilde{x},\gamma)}(\tilde{B}):=\tilde{A}_{n}^{\mathcal{G}_{0}}(\tilde{B})=\mathbb{P}\big{[}\tilde{X}_{n}\in\tilde{B}|\tilde{X}_{0}=\tilde{x},\Gamma_{0}=\gamma\big{]}\quad\textrm{for }\tilde{B}\in\mathcal{B}(\tilde{\mathcal{X}}).

\tilde{A}_{n}^{(\tilde{x},\gamma)}(\tilde{B}):=\tilde{A}_{n}^{\mathcal{G}_{0}}(\tilde{B})=\mathbb{P}\big{[}\tilde{X}_{n}\in\tilde{B}|\tilde{X}_{0}=\tilde{x},\Gamma_{0}=\gamma\big{]}\quad\textrm{for }\tilde{B}\in\mathcal{B}(\tilde{\mathcal{X}}).

A_{n}^{G_{t}} (B) := \tilde{A}_{n}^{G_{t}} (B \times Φ) and A_{n}^{(\tilde{x}, γ)} (B) := \tilde{A}_{n}^{(\tilde{x}, γ)} (B \times Φ), for B \in B (X) .

A_{n}^{G_{t}} (B) := \tilde{A}_{n}^{G_{t}} (B \times Φ) and A_{n}^{(\tilde{x}, γ)} (B) := \tilde{A}_{n}^{(\tilde{x}, γ)} (B \times Φ), for B \in B (X) .

T_{n} (\tilde{x}, γ) := ∥ A_{n}^{(\tilde{x}, γ)} (\cdot) - π (\cdot) ∥_{T V} = B \in B (X) sup ∣ A_{n}^{(\tilde{x}, γ)} (B) - π (B) ∣.

T_{n} (\tilde{x}, γ) := ∥ A_{n}^{(\tilde{x}, γ)} (\cdot) - π (\cdot) ∥_{T V} = B \in B (X) sup ∣ A_{n}^{(\tilde{x}, γ)} (B) - π (B) ∣.

n \to \infty lim T_{n} (\tilde{x}, γ) = 0 for all \tilde{x} \in \tilde{X}, γ \in Y .

n \to \infty lim T_{n} (\tilde{x}, γ) = 0 for all \tilde{x} \in \tilde{X}, γ \in Y .

∥ \tilde{P}_{γ}^{N} (\tilde{x}, \cdot) - \tilde{π}_{γ} (\cdot) ∥_{T V} \leq ε, for all \tilde{x} \in \tilde{X} and γ \in Y .

∥ \tilde{P}_{γ}^{N} (\tilde{x}, \cdot) - \tilde{π}_{γ} (\cdot) ∥_{T V} \leq ε, for all \tilde{x} \in \tilde{X} and γ \in Y .

D_{n} := \tilde{x} \in \tilde{X} sup ∥ \tilde{P}_{Γ_{n + 1}} (\tilde{x}, \cdot) - \tilde{P}_{Γ_{n}} (\tilde{x}, \cdot) ∥_{T V}

D_{n} := \tilde{x} \in \tilde{X} sup ∥ \tilde{P}_{Γ_{n + 1}} (\tilde{x}, \cdot) - \tilde{P}_{Γ_{n}} (\tilde{x}, \cdot) ∥_{T V}

M_{ε} (\tilde{x}, γ) := in f {k \geq 1 : ∥ \tilde{P}_{γ}^{k} (\tilde{x}, \cdot) - \tilde{π}_{γ} (\cdot) ∥_{T V} \leq ε} .

M_{ε} (\tilde{x}, γ) := in f {k \geq 1 : ∥ \tilde{P}_{γ}^{k} (\tilde{x}, \cdot) - \tilde{π}_{γ} (\cdot) ∥_{T V} \leq ε} .

P (M_{ε} (\tilde{X}_{n}, Γ_{n}) > N ∣ \tilde{X}_{0} = \tilde{x}, Γ_{0} = γ) \leq \tilde{δ}

P (M_{ε} (\tilde{X}_{n}, Γ_{n}) > N ∣ \tilde{X}_{0} = \tilde{x}, Γ_{0} = γ) \leq \tilde{δ}

\frac{\sum _{i = 1}^{n} g ( X _{i} )}{n} \to π (g)

\frac{\sum _{i = 1}^{n} g ( X _{i} )}{n} \to π (g)

\tilde{P}_{γ} V_{\tilde{π}_{γ}} (\tilde{x}) \leq λ V_{\tilde{π}_{γ}} (\tilde{x}) + b for all \tilde{x} \in \tilde{X} and γ \in Y,

\tilde{P}_{γ} V_{\tilde{π}_{γ}} (\tilde{x}) \leq λ V_{\tilde{π}_{γ}} (\tilde{x}) + b for all \tilde{x} \in \tilde{X} and γ \in Y,

\tilde{P}_{\gamma}V_{\tilde{\pi}_{\gamma}}(\tilde{x}):=\mathbb{E}\left(V_{\tilde{\pi}_{\gamma}}(\tilde{X}_{n+1})\big{|}\tilde{X}_{n}=\tilde{x},\Gamma_{n}=\gamma\right).

\tilde{P}_{\gamma}V_{\tilde{\pi}_{\gamma}}(\tilde{x}):=\mathbb{E}\left(V_{\tilde{\pi}_{\gamma}}(\tilde{X}_{n+1})\big{|}\tilde{X}_{n}=\tilde{x},\Gamma_{n}=\gamma\right).

\tilde{P}_{γ}^{n_{0}} (\tilde{x}, \cdot) \geq δ ν_{γ} (\cdot) for all \tilde{x} with V_{\tilde{π}_{γ}} (\tilde{x}) \leq v .

\tilde{P}_{γ}^{n_{0}} (\tilde{x}, \cdot) \geq δ ν_{γ} (\cdot) for all \tilde{x} with V_{\tilde{π}_{γ}} (\tilde{x}) \leq v .

N_{j} = k = 1 \sum j n_{k} with N_{0} = 0 and n_{0} = 0.

N_{j} = k = 1 \sum j n_{k} with N_{0} = 0 and n_{0} = 0.

D_{n} = \tilde{x} \in \tilde{X} sup ∥ \tilde{P}_{Γ_{n + 1}} (\tilde{x}, \cdot) - \tilde{P}_{Γ_{n}} (\tilde{x}, \cdot) ∥_{T V}

D_{n} = \tilde{x} \in \tilde{X} sup ∥ \tilde{P}_{Γ_{n + 1}} (\tilde{x}, \cdot) - \tilde{P}_{Γ_{n}} (\tilde{x}, \cdot) ∥_{T V}

N_{j}^{*} = k = 1 \sum j n_{k}^{*} with N_{0}^{*} = 0 and n_{0}^{*} = 0.

N_{j}^{*} = k = 1 \sum j n_{k}^{*} with N_{0}^{*} = 0 and n_{0}^{*} = 0.

n_{k}^{*} = n_{k} + Uniform [0, ⌊ k^{κ^{*}} ⌋] for some κ^{*} \in (0, κ) .

n_{k}^{*} = n_{k} + Uniform [0, ⌊ k^{κ^{*}} ⌋] for some κ^{*} \in (0, κ) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

timsf/jams
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Framework for Adaptive MCMC Targeting Multimodal Distributions

Emilia Pompelabel=e1][email protected] [

Chris Holmeslabel=e2][email protected] [

Krzysztof Łatuszyński label=e3][email protected] [ University of Oxford\thanksmarkm1 and University of Warwick\thanksmarkm2

Department of Statistics

University of Oxford

24–29 St Giles’

Oxford OX1 3LB

United Kingdom

E-mail: e2

Department of Statistics

University of Warwick

Coventry, CV4 7AL

United Kingdom

Abstract

We propose a new Monte Carlo method for sampling from multimodal distributions. The idea of this technique is based on splitting the task into two: finding the modes of a target distribution $\pi$ and sampling, given the knowledge of the locations of the modes. The sampling algorithm relies on steps of two types: local ones, preserving the mode; and jumps to regions associated with different modes. Besides, the method learns the optimal parameters of the algorithm while it runs, without requiring user intervention. Our technique should be considered as a flexible framework, in which the design of moves can follow various strategies known from the broad MCMC literature.

In order to design an adaptive scheme that facilitates both local and jump moves, we introduce an auxiliary variable representing each mode and we define a new target distribution $\tilde{\pi}$ on an augmented state space $\mathcal{X}~{}\times~{}\mathcal{I}$ , where $\mathcal{X}$ is the original state space of $\pi$ and $\mathcal{I}$ is the set of the modes. As the algorithm runs and updates its parameters, the target distribution $\tilde{\pi}$ also keeps being modified. This motivates a new class of algorithms, Auxiliary Variable Adaptive MCMC. We prove general ergodic results for the whole class before specialising to the case of our algorithm.

60J05,

65C05,

62F15,

multimodal distribution,

adaptive MCMC,

ergodicity,

keywords:

[class=MSC]

keywords:

\startlocaldefs\endlocaldefs

,

and

t1Supported by the EPSRC and MRC Centre for Doctoral Training in Next Generation Statistical Science: the Oxford-Warwick Statistics Programme, EP/L016710/1, and the Clarendon Fund. t2Supported by the MRC, the EPSRC, the Alan Turing Institute, Health Data Research UK, and the Li Ka Shing foundation. t3Supported by the Royal Society via the University Research Fellowship scheme.

1 Introduction

Poor mixing of standard Markov Chain Monte Carlo (MCMC) methods, such as the Metropolis-Hastings algorithm or Hamiltonian Monte Carlo, on multimodal target distributions with isolated modes is a well-described problem in statistics. Due to their dynamics these algorithms struggle with crossing low probability barriers separating the modes and thus take a long time before moving from one mode to another, even in low dimensions. Sequential Monte Carlo (SMC) has often empirically proven to outperform MCMC on this task, its robust behaviour, however, relies strongly on the good between-mode mixing of the Markov kernel used within the SMC algorithm (see [38]). Therefore, constructing an MCMC algorithm which enables fast exploration of the state space for complicated target functions is of great interest, especially as multimodal distributions are common in applications. The examples include, but are not limited to, problems in genetics [15, 31], astrophysics [20, 21, 49] and sensor network localisation [28].

Moreover, multimodality is an inherent issue of Bayesian mixture models (e.g. [30]), where it is caused by label-switching, if both the prior and the likelihood are symmetric with respect to all components, or more generally, it may be caused by model identifiability issues or model misspecification (see [18]).

Designing MCMC algorithms for sampling from a multimodal target distribution $\pi$ on a $d$ -dimensional space $\mathcal{X}$ needs to address three fundamental challenges:

(1)

Identifying high probability regions where the modes are located;

(2)

Moving between the modes by crossing low probability barriers;

(3)

Sampling efficiently within the modes by accounting for inhomogeneity between them and their local geometry.

These challenges are fundamental in the sense that the high probability regions in (1) are typically exponentially small in dimension with respect to a reference measure on the space $\mathcal{X}$ . Furthermore, the design of basic reversible Markov chain kernels prevents them from visiting and crossing low energy barriers, hence from overcoming (2). Besides, accounting for inhomogeneity of the modes in (3) requires dividing the $d$ -dimensional space $\mathcal{X}$ into regions, which is an intractable task on its own that requires detailed a priori knowledge of $\pi$ .

Existing MCMC methodology for multimodal distributions usually identifies these challenges separately and a systematic way of addressing (1-3) is not available. Section 1.1 discusses main areas of the abundant literature on the topic in more detail.

In this paper we introduce a unifying framework for addressing (1-3) simultaneously via a novel design of Auxiliary Variable Adaptive MCMC. The framework allows us to split the sampling task into mode finding, between-region jump moves and local moves. In addition, it incorporates parameter adaptations for optimisation of the local and jump kernels, and identification of local regions. Unlike other state space augmentation techniques for multimodal distributions, where the auxiliary variables are introduced to improve mixing on the extended state space, auxiliary variables in our approach help to design an efficient adaptive scheme. We present the adaptive mechanics and main properties of the resulting algorithm in Section 1.2 after reviewing the literature.

1.1 Other approaches

Numerous MCMC methods have been proposed to address the issue of multimodality and we review briefly the main strands of the literature.

The most popular approach is based on tempering. The idea behind this type of methods relies on an observation that raising a multimodal distribution $\pi$ to the power $\beta\in(0,1)$ makes the modes ”flatter” and as a result, it is more likely to accept moves to the low probability regions. Hence, it is easier to explore the state space and find the regions where the modes of $\pi$ are located, addressing challenge (1) above, and also to move between these regions, addressing challenge (2). The examples of such methods, which incorporate $\pi^{\beta}$ by augmenting the state space, are parallel tempering proposed by [23] and its adaptive version [34], simulated tempering [33], tempered transitions [36] and the equi-energy sampler [31]. Despite their popularity, tempering-based approaches, as noticed by [56], tend to mix between modes exponentially slowly in dimension if the modes have different local covariance structures. Addressing this issue is an area of active research [50].

Another strand of research is optimisation-based methods, which address challenge (1) by running preliminary optimisation searches in order to identify local maxima of the target distribution. They use this information in their between-mode proposal design to overcome challenge (2). A method called Smart Darting Monte Carlo, introduced in [2], relies on moves of two types: jumps between the modes, allowed only in non-overlapping $\epsilon$ -spheres around the local maxima identified earlier; and local moves (Random Walk Metropolis steps). This technique was generalised in [48] by allowing the jumping regions to overlap and have an arbitrary volume and shape. [1] went one step further by introducing updates of the jumping regions and parameters of the proposal distribution at regeneration times, hence the name of their method Regeneration Darting Monte Carlo (RDMC). This includes a possibility of adding new locations of the modes at regeneration times if they are detected by optimisation searches running on separate cores.

Another optimisation-based method, Wormhole Hamiltonian Monte Carlo, was introduced by [32] as an extension of Riemanian Manifold HMC (see [19] and [24]). The main underlying idea here is to construct a network of ”wormholes” connecting the modes (neighbourhoods of straight line segments between the local maxima of $\pi$ ). The Riemannian metric used in the algorithm is a weighted mixture of a standard metric responsible for local HMC-based moves and another metric, influential in the vicinity of the wormholes, which shortens the distances between the modes. As before, updates of the parameters of the algorithm, including the network system, are allowed at regeneration times.

As we will see later, the algorithm we propose also falls into the category of optimisation-based methods.

The Wang-Landau algorithm [54, 53] or its adaptive version proposed by [12] belong to the exploratory strategies that aim to push the algorithm away from well-known regions and visit new ones, hence addressing challenge (1). The multi-domain sampling technique, proposed in [57], combines the idea of the Wang-Landau algorithm with the optimisation-based approach. This algorithm relies on partitioning the state space into domains of attraction of the modes. Local moves are Random Walk Metropolis steps proposed from a distribution depending on the domain of attraction of the current state. Jumps between the modes follow the independence sampler scheme, where the new states are proposed from a mixture of Gaussian distributions approximating $\pi$ .

Other common approaches include Metropolis-Hastings algorithms with a special design of the proposal distribution accounting for the necessity of moving between the modes [51, 49] and MultiNest algorithms based on nested sampling [20, 21].

1.2 Contribution

As mentioned before, the existing MCMC methods for multimodal distributions struggle to tackle challenges (1-3) simultaneously. In particular, challenge (3) typically fails to be addressed. The difficulty behind this challenge is that when modes have distinct shapes, different local proposal distributions will work well in regions associated with different modes. Note that the majority of the methods described above (all tempering-based techniques, the equi-energy sampler, the adaptive and non-adaptive Wang-Landau algorithm) only employ a single transition kernel, regardless of the region.

In applied problems optimal parameters of the MCMC kernels are unknown, therefore recent approaches involve tuning them while the algorithm runs. In case of unimodal target distributions Adaptive MCMC techniques prove to be useful [43, 5, 25]. The parameters, such as covariance matrices, the scaling, the step size and the number of leapfrog steps of the involved Metropolis-Hastings, MALA, or HMC kernels can be learned on the fly as the simulation progresses, based on the samples observed so far. The adaptive algorithms remain ergodic under suitable regularity conditions [3, 42, 22, 7, 14].

In case of multimodal distributions an analogous idea would be to apply these Adaptive MCMC methods separately to regions associated with different modes, to improve the within-mode mixing. Note that in order to sample from different proposal distributions in regions associated with different modes, one needs to control at each step of the algorithm which region the current state belongs to. Besides, adapting parameters of the local proposal distributions on the fly must be based on samples that actually come from the corresponding region. The only known approach to assigning samples to regions is that of the multi-domain sampler [57]. However, in their setting keeping track of the regions requires running a gradient ascent procedure at each MCMC step, which imposes a high computational burden on the whole algorithm. Other optimisation-based approaches known in the literature (e.g. [48] and [1]) tend to ignore the necessity of assigning samples to regions and the possibility of moving between the modes via local steps.

An issue that we have not raised so far is that the adaptive optimisation-based methods presented above, such as those of [1] and [32], allow for adaptations only at regeneration times. Although this approach seems appealing from the point of view of the theory, since no further proofs of convergence are needed, it does not work well in practice in high dimensions. The reason for this is that regenerations happen rarely in large dimensions, which makes the adaptive scheme prohibitively inefficient. Besides, identifying regeneration times using the method of [35], as authors of both algorithms propose, requires case-specific calculations which precludes any generic implementation of an algorithm based on regenerations. Moreover, the resulting identified regenerations are of orders of magnitude more infrequent than the ”true” ones which are already rare.

We aim to remedy these shortcomings by proposing a framework for designing an adaptive algorithm on an augmented state space $\mathcal{X}\times\mathcal{I}$ , where $\mathcal{I}=\{1,\dots,N\}$ , and the auxiliary variable $i$ of the resulting sample $(x,i)$ encodes the corresponding region for $x$ . Local MCMC kernels update $x$ only, while jump kernels that move between the modes update $x$ and $i$ simultaneously. Furthermore, the design of the target distribution on the augmented state space prevents the algorithm from moving to a region associated with a different mode via local steps. In the sequel we make specific choices for the adaptive scheme, the local and jump kernels, as well as the burn-in routine used for setting up initial values of the parameters of the algorithm. However, the design is modular and different approaches can be incorporated in the framework. Besides, it allows for a multicore implementation of a large part of the algorithm.

This approach motivates introducing the Auxiliary Variable Adaptive MCMC class, where not only transition kernels are allowed to be modified on the fly, but also the augmented target distributions. It turns out that apart from our method, there is a wide range of algorithms that belong to this class, including adaptive parallel tempering or adaptive versions of pseudo-marginal MCMC. Thus our general ergodicity results, proved for the whole class under standard regularity conditions, can potentially be useful for analysing other methods.

The remainder of the paper is organised as follows. In Section 2 we present our algorithm, the Jumping Adaptive Multimodal Sampler (JAMS) and discuss its properties. In Section 3 we define the Auxiliary Variable Adaptive MCMC class and establish convergence in distribution and a Weak Law of Large Numbers for this class, under the uniform and the non-uniform scenario. We present theoretical results specialised to the case of our proposed algorithm in Section 4. Ergodicity is derived here from the analogues of the Containment and Diminishing Adaptation conditions introduced in [42], as opposed to identifying regeneration times, which allows us to circumvent the issues described above. The proofs of all our theorems along with some additional comments about the theoretical results are gathered in Supplementary Material A. Section 5 demonstrates the performance of our method on two synthetic and one real data example. Additional details of our numerical experiments are available in Supplementary Material B. We conclude with a summary of our results in Section 6.

2 Jumping Adaptive Multimodal Sampler (JAMS)

2.1 Main algorithm

Let $\pi$ be the multimodal target distribution of interest defined on $(\mathcal{X},\mathcal{B}(\mathcal{X}))$ . We introduce a collection of target distributions $\{\tilde{\pi}_{\gamma}\}_{\gamma\in\mathcal{Y}}$ on the augmented state space $\mathcal{X}\times\mathcal{I}$ , where $\mathcal{I}:=\{1,\ldots,N\}$ is the finite set of indices of the modes of $\pi$ . We defer the discussion about finding the modes to Section 2.2. Here $\gamma$ denotes the design parameter of the algorithm that may be adapted on the fly. For a fixed $\gamma\in\mathcal{Y}$ , $\tilde{\pi}_{\gamma}$ is defined as

[TABLE]

where $Q_{i}(\mu_{i},\Sigma_{\gamma,i})$ is an elliptical distribution (such as the normal or the multivariate $t$ distribution) centred at $\mu_{i}$ with covariance matrix $\Sigma_{\gamma,i}$ . We shall think of $\{\mu_{i}\}_{i\in\mathcal{I}}$ and $\{\Sigma_{\gamma,i}\}_{i\in\mathcal{I}}$ as locations and covariances of the modes of $\pi$ , respectively. Firstly, notice that constructing a Markov chain targeting $\tilde{\pi}_{\gamma}$ provides a natural way of identifying the mode at each step by recording the auxiliary variable $i$ . Besides, for each $B\in\mathcal{B}(\mathcal{X})$ and $\gamma\in\mathcal{Y}$ we have

[TABLE]

Hence, $\pi$ is the marginal distribution of $\tilde{\pi}_{\gamma}$ for each $\gamma\in\mathcal{Y}$ , so sampling from $\tilde{\pi}_{\gamma}$ can be used to generate samples from $\pi$ .

The sampling algorithm that we propose is summarised in Algorithm 1. The method relies on MCMC steps of two types, performed with probabilities $1-\epsilon$ and $\epsilon$ , respectively.

•

Local move: Given the current state of the chain $(x,i)$ and the current parameter $\gamma$ , a local kernel $\tilde{P}_{\gamma,L,i}$ invariant with respect to $\tilde{\pi}_{\gamma}$ is used to update $x$ , while $i$ remains fixed, hence the mode is preserved.

•

Jump move: Given the current state of the chain $(x,i)$ and the current parameter $\gamma$ , a new mode $k$ is proposed with probability $a_{\gamma,ik}$ . Then a new point $y$ is proposed using a distribution $R_{\gamma,J,ik}(x,\cdot)$ . The new pair is accepted or rejected using the standard Metropolis-Hastings formula such that jump kernel is invariant with respect to $\tilde{\pi}_{\gamma}$ .

Our choice for the local kernel is Random Walk Metropolis (RWM) with proposal $R_{\gamma,L,i}(x,\cdot)$ that follows either the normal or the $t$ distribution. This allows us to employ well-developed adaptation strategies for RWM and build on its stability properties to establish ergodicity of JAMS in Section 4. However, in practice any other MCMC kernel, such as MALA or HMC, may be used. The standard Metropolis-Hastings acceptance probability formula that admits $\tilde{\pi}_{\gamma}$ as the invariant distribution for the local move becomes:

[TABLE]

As for the jump moves, we consider two different methods of proposing a new point $y$ associated with mode $k$ . The first one, which we call independent proposal jumps, is to draw $y$ from an elliptical distribution centred at $\mu_{k}$ with covariance matrix $\Sigma_{\gamma,k}$ , independently from the current point $(x,i)$ . Since there is no dependence on $x$ and $i$ , in case of independent proposal jumps the proposal distribution to mode $k$ will be denoted by $R_{\gamma,J,k}(\cdot)$ . For independent proposal jumps, the acceptance probability is equal to

[TABLE]

Alternatively, given that the current state is $(x,i)$ , we can propose a ”corresponding” point $y$ in mode $k$ such that

[TABLE]

The required equality is satisfied for

[TABLE]

where

[TABLE]

Herein this method will be called deterministic jumps. The acceptance probability is then given by

[TABLE]

Note that in both cases the design of the jump moves takes into account the shapes of the two modes involved, which helps achieving high acceptance rates and consequently improves the between-mode mixing.

As presented in Algorithm 1, the method involves learning the parameters on the fly. We design an adaptation scheme of three lists of parameters: covariance matrices (used both for adapting the target distribution $\tilde{\pi}_{\gamma}$ and the proposal distributions), weights $w_{\gamma,i}$ and probabilities $a_{\gamma,ik}$ of proposing mode $k$ in a jump from mode $i$ . Hence, formally $\mathcal{Y}$ refers to the product space of $\Sigma_{\gamma,i}$ , $w_{\gamma,i}$ and $a_{\gamma,ik}$ for $i,k\in\{1,\ldots,N\}$ restricted by $\sum_{j\in\mathcal{I}}w_{\gamma,j}=1$ and $\sum_{k\in\mathcal{I}}a_{\gamma,ik}=1$ for each $\gamma\in\mathcal{Y}$ and each $i\in\mathcal{I}$ . An adaptive scheme for $w_{\gamma,i}$ and $a_{\gamma,ik}$ that follows an intuitive heuristic is discussed briefly in Section 10 of Supplementary Material B.

Our method of adapting the covariance matrices $\Sigma_{\gamma,i}$ is presented in Algorithm 2. For every $i\in\mathcal{I}$ the matrix $\Sigma_{\gamma,i}$ is based on the empirical covariance matrix of the samples from the region associated with mode $i$ obtained so far. This is possible in our framework by keeping track of the auxiliary variable $i$ . Updates are performed every certain number of iterations (denoted by $AC_{2}$ in Algorithm 2). This method follows the classical Adaptive Metropolis methodology (cf. [27, 43]) applied separately to the covariance structure associated with each mode. For the local proposal distributions the covariance matrices are additionally scaled by the factor $2.38^{2}/d$ , which is commonly used as optimal for Adaptive Metropolis algorithms [40, 44]. Since representing a covariance matrix in high dimensions reliably typically requires a large number of samples, we do not apply this method straight away. Instead, we perform adaptive scaling, aiming to achieve the optimal acceptance rate (typically fixed at 0.234; see [40, 44]) for local moves, until the number of samples observed in a given mode exceeds a pre-specified constant (denoted by $AC_{1}$ in Algorithm 2).

It is worth outlining that this special construction of the target distribution $\tilde{\pi}_{\gamma}$ makes it unlikely for the algorithm to escape via local steps from the mode it is assigned to and settle in another one. Indeed, if a proposed point $y$ is very distant from the current mode $\mu_{i}$ and close to another mode $\mu_{k}$ , the acceptance probability becomes very small due to the expression $Q_{i}(\mu_{i},\Sigma_{\gamma,i})(y)$ in the numerator of (2.3) and $Q_{k}(\mu_{k},\Sigma_{\gamma,k})(y)$ in the denominator, as $Q_{i}(\mu_{i},\Sigma_{\gamma,i})(y)$ will typically be tiny in such case and $Q_{k}(\mu_{k},\Sigma_{\gamma,k})(y)$ will be large. This allows for controlling from which mode a given state of the chain was sampled. The property of our algorithm described above is crucial for its efficiency as it enables estimating matrices $\Sigma_{\gamma,i}$ based on samples that are indeed close to mode $\mu_{i}$ , which in turn improves both the within-mode and the between-mode mixing. If we were working directly with $\pi$ , the corresponding acceptance probability would be given by $\min\left[1,\frac{\pi(y)}{\pi(x)}\right]$ and we would not have a natural mechanism for preventing the sampler from visiting different regions via local moves.

2.2 Burn-in algorithm

Note that Algorithm 1 takes mode locations $\{\mu_{1},\ldots,\mu_{N}\}$ and initial values of the matrices $\{\Sigma_{\gamma_{0},1},\ldots,\Sigma_{\gamma_{0},N}\}$ as input. Recall also that further improvements in the estimation of $\Sigma_{\gamma,i}$ are possible after some samples in mode $i$ have been observed (see Algorithm 2). Hence, the matrices $\{\Sigma_{\gamma_{0},1},\ldots,\Sigma_{\gamma_{0},N}\}$ need to represent well the shapes of the corresponding modes so that jumps to all the modes are accepted reasonably quickly.

We address the issues of finding the local maxima of $\pi$ , setting up the starting values of the covariance matrices and other aspects of the implementation of our method by introducing a burn-in algorithm, summarised by Algorithm 3.

The burn-in algorithm runs in advance, before the main MCMC sampler (Algorithm 1) is started, in order to provide initial values of the parameters. Since it needs to find the locations of the modes of $\pi$ , and this may be arbitrarily hard, one may prefer a version of this method in which the burn-in algorithm continues running in parallel to the main sampler on multiple cores. These cores communicate with the main sampler every certain number of iterations so that it can incorporate recently discovered modes into the augmented target distribution $\tilde{\pi}_{\gamma}$ . For clarity of presentation, we focus on the sequential setting where the burn-in routine runs before the main algorithm, and in Sections 3 and 4 we develop ergodic theory that covers this case. However, the ergodic theory is immediately applicable to the version where the burn-in and the main algorithm run in parallel, as explained in Remark 4.4.

We sketch different stages of the burn-in routine below, in Sections 2.2.1 – 2.2.4, additional details are given in Supplementary Material B. The flowchart illustrating how the full algorithm works is shown in Figure 1.

2.2.1 Starting points for the optimisation procedure

We sample the starting points for optimisation searches uniformly on a compact set which is a product of intervals provided by the user

[TABLE]

where $d$ is the dimension of the state space $\mathcal{X}$ . Note that if the domain of attraction of each mode overlaps with $[L_{1},U_{1}]\times\ldots\times[L_{d},U_{d}]$ , then asymptotically all modes will be found, as we will have at least one starting point in each domain.

When dealing with Bayesian models, one can alternatively sample the starting points from the prior distribution.

2.2.2 Mode finding via an optimisation procedure

The BFGS optimisation algorithm [37] is initiated from every starting point. The BFGS method method provides the optimum point and the Hessian matrix at this point which is particularly useful in the next step of mode merging.

For numerical reasons, instead of working directly with $\pi$ , we typically use the BFGS algorithm to find the local minima of $-\log(\pi)$ .

2.2.3 Mode merging

Starting the optimisation procedure from different points belonging to the same basin of attraction will take us to points which are close to the true local maxima, but numerically different, an issue that seems to be ignored in optimisation-based MCMC literature.

We deal with this in a heuristic way (lines 5-16 of Algorithm 3) by classifying two vectors $m_{i}$ and $m_{j}$ as corresponding to the same mode if the squared Mahalanobis distance between them is smaller than some pre-specified value $q$ . If we let $H_{i}$ and $H_{j}$ denote the Hessian matrices of $-\log(\pi)$ at $m_{i}$ and $m_{j}$ , respectively, the above Mahalanobis distance is calculated for $H_{i}^{-1}$ and $H_{j}^{-1}$ (for symmetry, we average over these two values). This method is scale invariant as the Hessian captures the local shape and scale.

2.2.4 Initial covariance matrix estimation

In order to find initial covariance matrix estimates $\Sigma_{\gamma_{0},1},...,\Sigma_{\gamma_{0},N}$ that accurately reflect the geometry of different modes, we employ the augmented target machinery of Algorithm 1 in the following way. We run Algorithm 1 without jumps, i.e. with $\epsilon=0$ , in parallel, starting from each of the modes $\mu_{1},\ldots,\mu_{N}$ . This implies that we run $N$ chains and each of them adapts only the matrix $\Sigma_{i}$ corresponding to the mode $\mu_{i}$ which was its starting point. We make a number of rounds (denoted by $K$ ) of this procedure and after each round we update the target distribution $\tilde{\pi}$ by exchanging the knowledge about the adapted covariance matrices between cores. The final covariance matrices passed to the main MCMC sampler are calculated based on the samples collected in all rounds.

The reason why we exchange information between rounds, despite the additional cost of communication between cores, is that we want the sampler adapting $\Sigma_{k,i}$ to know where the regions associated with other modes are so that it is less likely to visit those regions and contaminate the estimate. Essentially the initial covariance estimation revisits the problem of collecting samples only from the corresponding regions, discussed in the previous parts of this paper.

The initial value of the matrix corresponding to mode $i$ is the inverse of the Hessian evaluated at $\mu_{i}$ (see line 17 of Algorithm 3). The values of the other parameters of the algorithm, such as $\alpha$ , $\beta$ and $AC_{2}$ , are set to be the same as in the main algorithm. The values of $w_{\gamma,i}$ and $a_{\gamma,ik}$ are not updated during those runs. Besides, $w_{\gamma,i}$ are set to $1/N$ .

The intuition for the choice of the number of rounds $K$ of the above procedure is to stop the burn-in algorithm when running an additional round does not yield much improvement in the accuracy of the estimation of $\Sigma_{1},\ldots,\Sigma_{N}$ . We use the inhomogeneity factor (see [43] and [47]), a well-established measure of covariance estimation accuracy in the MCMC context, to choose $K$ automatically. We quantify the dissimilarity between $\Sigma_{k-1,i}$ and $\Sigma_{k,i}$ for $i\in\mathcal{I}$ by their inhomogeneity factor, denoted by $b_{k,i}$ , and stop the covariance estimation when this factor drops below a pre-specified threshold $b_{\text{acc}}$ for all $i\in\mathcal{I}$ . Details are given in Supplementary Material B.

2.3 Further comments

From the point of view of the fundamental challenges (1-3) discussed in Section 1, JAMS deals with (1) through its mode finding stage. Challenges (2) and (3) are addressed via jumps and local moves, respectively. As explained in Section 2.1, the auxiliary variable approach facilitates moving efficiently between modes as well as accounting for inhomogeneity between them by using different local proposal distributions in different regions.

It is important to point out that the auxiliary variable approach presented above should be thought of as a flexible framework rather than one specific method. The BFGS algorithm used for mode finding could be replaced with another optimisation procedure and similarly, local moves could be performed using a different MCMC sampler, e.g. HMC. One could also consider another scheme for updating the parameters, for example, combining adaptive scaling with covariance matrix estimation (see [52]).

3 Auxiliary Variable Adaptive MCMC

We introduce a general class of Auxiliary Variable Adaptive MCMC algorithms, as follows.

Recall that $\pi(\cdot)$ is a fixed target probability density on $(\mathcal{X},\mathcal{B}(\mathcal{X}))$ . For an auxiliary pair $(\Phi,\mathcal{B}(\Phi))$ , define $\tilde{\mathcal{X}}:=\mathcal{X}\times\Phi,$ and for an index set $\mathcal{Y}$ , consider a family of probability measures $\{\tilde{\pi}_{\gamma}(\cdot)\}_{\gamma\in\mathcal{Y}}$ on $(\tilde{\mathcal{X}},\mathcal{B}(\tilde{\mathcal{X}})),$ such that

[TABLE]

Let $\{\tilde{P}_{\gamma}\}_{\gamma\in\mathcal{Y}}$ be a collection of Markov chain transition kernels on $(\tilde{\mathcal{X}},\mathcal{B}(\tilde{\mathcal{X}})),$ such that each $\tilde{P}_{\gamma}$ has $\tilde{\pi}_{\gamma}$ as its invariant distribution and is Harris ergodic, i.e. for all $\gamma\in\mathcal{Y},$

[TABLE]

Here $\|\cdot-\cdot\|_{TV}$ is the usual total variation distance, defined for two probability measures $\mu$ and $\nu$ on a $\sigma-$ algebra of sets $\mathcal{G}$ as $\|\mu(\cdot)-\nu(\cdot)\|_{TV}=\sup_{B\in\mathcal{G}}|\mu(B)-\nu(B)|.$

To define the dynamics of the Auxiliary Variable Adaptive MCMC sequence $\{(\tilde{X}_{n},\Gamma_{n})\}_{n=0}^{\infty},$ where $\Gamma$ represents a random variable taking values in $(\mathcal{Y},\mathcal{B}(\mathcal{Y}))$ , denote its filtration as

[TABLE]

Now, the conditional distribution of $\Gamma_{n+1}$ given $\mathcal{G}_{n}$ will be specified by the adaptive algorithm being used, such as Algorithm 1, while the dynamics of the $\tilde{X}$ coordinate follows

[TABLE]

Note that depending on the adaptive update rule for $\Gamma_{n}$ , the sequence $\{(\tilde{X}_{n},\Gamma_{n})\}_{n=0}^{\infty},$ defined above is not necessarily a Markov chain. By $\tilde{A}_{n}^{\mathcal{G}_{t}}(\cdot)$ denote the distribution of the $\tilde{\mathcal{X}}$ -marginal of $\{(\tilde{X}_{n},\Gamma_{n})\}_{n=0}^{\infty}$ at time $n$ , conditionally on the history up to time $t,$ i.e.

[TABLE]

for $\tilde{B}\in\mathcal{B}(\tilde{\mathcal{X}})$ , and in particular for $t=0$ , we shall write

[TABLE]

By $A_{n}^{\mathcal{G}_{t}}(\cdot)$ and $A_{n}^{(\tilde{x},\gamma)}(\cdot)$ denote the further marginalisation of $\tilde{A}_{n}^{\mathcal{G}_{t}}(\cdot)$ and $\tilde{A}_{n}^{(\tilde{x},\gamma)}(\cdot)$ , respectively, onto the space of interest $\mathcal{X},$ where the target measure $\pi(\cdot)$ lives, namely

[TABLE]

Finally, in order to define ergodicity of the Auxiliary Variable Adaptive MCMC, let

[TABLE]

Definition 3.1.

We say that the Auxiliary Variable Adaptive MCMC algorithm generating $\{(\tilde{X}_{n},\Gamma_{n})\}_{n=0}^{\infty},$ is ergodic, if

[TABLE]

As we shall see in Section 4, JAMS belongs to the class defined above. There exist other algorithms falling into this category, therefore the results presented in this paper, in particular Theorems 3.2, 3.3 and 3.4, may be useful for analysing their ergodicity. Examples of other algorithms in this class include adaptive parallel tempering [34] and adaptive versions of pseudo-marginal algorithms [4, 6]. A more detailed discussion on this may be found in Supplementary Material A.

3.1 Theoretical results for the class

The two main approaches to verifying ergodicity of Adaptive MCMC are based on martingale approximations [3, 22, 7] or coupling [42]. Here we extend the latter to the Auxiliary Variable Adaptive MCMC class by constructing explicit couplings. In particular, ergodicity of this class of algorithms will be verified for the uniform and the non-uniform case, providing results analogous to Theorems 1 and 2 of [42].

For the uniform case analogues of the usual conditions of Simultaneous Uniform Ergodicity and Diminishing Adaptation will be required.

Theorem 3.2 (Ergodicity – uniform case).

Consider an Auxiliary Variable Adaptive MCMC algorithm on a state space $\tilde{\mathcal{X}}=\mathcal{X}\times\Phi$ , following dynamics (3.3) with a family of transition kernels $\{\tilde{P}_{\gamma}\}_{\gamma\in\mathcal{Y}}$ satisfying (3.1) and (3.2). If conditions (a) and (b) below are satisfied, then the algorithm is ergodic in the sense of Definition 3.1.

(a)

(Simultaneous Uniform Ergodicity). For all $\varepsilon>0,$ there exists $N=N(\varepsilon)\in\mathbb{N}$ such that

[TABLE] 2. (b)

(Diminishing Adaptation). The random variable

[TABLE]

converges to [math] in probability.

In fact assumption (a) of Theorem 3.2 can be relaxed. To this end, define the $\varepsilon-$ convergence time as

[TABLE]

It is enough that the random variable $M_{\varepsilon}(\tilde{X}_{n},\Gamma_{n})$ is bounded in probability. Precisely, the following ergodicity result holds for the non-uniform case.

Theorem 3.3 (Ergodicity – non-uniform case).

Consider an Auxiliary Variable Adaptive MCMC algorithm, under the assumptions of Theorem 3.2 and replace condition (a) with the following:

(a)

(Containment). For all $\varepsilon>0$ and all $\tilde{\delta}>0$ , there exists $N=N(\varepsilon,\tilde{\delta})$ such that

[TABLE]

for all $n\in\mathbb{N}$ .

Then the algorithm is ergodic in the sense of Definition 3.1.

We establish the Weak Law of Large Numbers for the class of Auxiliary Variable Adaptive MCMC algorithms for both the uniform and the non-uniform case. By letting $\Phi$ be a singleton, our result applies to the standard Adaptive MCMC setting and extends the result of [42] where the WLLN was provided for the uniform case only.

Theorem 3.4 (WLLN).

Consider an Auxiliary Variable Adaptive MCMC algorithm, as in Theorem 3.3, together with assumptions a) and b) of this theorem. Let $g:\mathcal{X}\to\mathbb{R}$ be a bounded measurable function. Then

[TABLE]

in probability as $n\to\infty$ .

While Containment is a weaker condition than Simultaneous Uniform Ergodicity, it is less tractable and in the standard Adaptive MCMC setting drift conditions are typically used to verify it [42, 10]. Lemma 3.5 helps verifying Containment via geometric drift conditions in the Auxiliary Variable framework. The lemma additionally assumes that the adaptation happens on a compact set only (cf. condition e) below). Adapting on a compact set has been theoretically investigated in [16] and used in certain adaptive Gibbs sampler contexts in [13]. We shall use Lemma 3.5 as the main tool for establishing ergodic theorems for JAMS.

Lemma 3.5.

Assume that the following conditions are satisfied.

a)

For each $\gamma\in\mathcal{Y}$ $\|\tilde{P}_{\gamma}^{k}(\tilde{x},\cdot)-\tilde{\pi}_{\gamma}(\cdot)\|_{TV}\to 0$ as $k\to\infty$ . 2. b)

There exists $\lambda<1$ , $b<\infty$ and a collection of functions $V_{\tilde{\pi}_{\gamma}}:\tilde{\mathcal{X}}\to[1,\infty)$ for $\gamma\in\mathcal{Y}$ , such that the following simultaneous drift condition is satisfied:

[TABLE]

where for $\tilde{x}\in\tilde{\mathcal{X}}$

[TABLE]

Moreover, $V_{\tilde{\pi}_{\gamma}}(\tilde{x})$ is bounded on compact sets as a function of $(\tilde{x},\gamma)$ . 3. c)

There exist $\delta>0$ , $v>2n_{0}b/(1-\lambda^{n_{0}})$ and a positive integer $n_{0}$ , such that the following minorisation condition holds: for each $\gamma\in\mathcal{Y}$ we can find a probability measure $\nu_{\gamma}$ on $\tilde{\mathcal{X}}$ satisfying

[TABLE] 4. d)

$\mathcal{Y}$ * is compact in some topology.* 5. e)

There exists a compact set $A$ such that if $X_{n}\notin A$ , then $\Gamma_{n+1}=\Gamma_{n}$ . 6. f)

$\mathbb{E}V_{\tilde{\pi}_{\Gamma_{0}}}(\tilde{X}_{0})<\infty$ .

Then the Containment condition (3.5) holds.

3.2 Adaptive Increasingly Rarely version of the class

Adaptive Increasingly Rarely (AIR) MCMC algorithms were introduced in [14] as an alternative to classical Adaptive MCMC methods. While they share the same self-tuning properties, their ergodic properties are mathematically easier to analyse and their computational cost of adaptation is smaller.

The key idea behind the AIR algorithms is to allow the updates of parameters only at pre-specified times $N_{j}$ with and increasing sequence of lags $n_{k}$ between them. $N_{j}$ is therefore defined as

[TABLE]

For the sequence $\{n_{k}\}_{k>1}$ [14] proposed using any scheme that satisfies $c_{2}k^{\kappa}\geq n_{k}\geq c_{1}k^{\kappa}$ for some positive $c_{1}$ , $c_{2}$ and $\kappa$ . In order to ensure that the random variable

[TABLE]

converges to [math] in probability (which is equivalent to Diminishing Adaptation), the following modification is introduced. The updates happen at times $N^{*}_{j}$ , where

[TABLE]

and

[TABLE]

Observe that $D_{n}$ is only positive if $n+1\in\{N^{*}_{j}\}_{j\geq 1}$ . Besides, if $n+1>N_{k}$ then $\mathbb{P}(D_{n}>0)\leq\frac{1}{\lfloor{k^{\kappa^{*}}}\rfloor},$ so in particular $D_{n}$ goes to 0 as $n$ tends to infinity.

We apply the same idea to Auxiliary Variable Adaptive MCMC algorithms, by adapting the parameters of the transition kernels and the target distributions only at times $N_{j}^{*}$ , as described above, so that Diminishing Adaptation is automatically satisfied for these algorithms. In Section 4 we study in detail an AIR version of JAMS (see Algorithm 4).

4 Ergodicity of the Jumping Adaptive Multimodal Sampler

We will use our results from Section 3 to prove ergodicity of JAMS. Firstly observe that this algorithm indeed belongs to the Auxiliary Variable Adaptive MCMC class. To see this, recall that the method utilises a collection of distributions $\{\tilde{\pi}_{\gamma}(\cdot)\}_{\gamma\in\mathcal{Y}}$ on $\mathcal{\tilde{X}}:=\mathcal{X}\times\mathcal{I}$ , which corresponds to the notation introduced for the Auxiliary Variable Adaptive MCMC class, with $\Phi=\mathcal{I}$ . Indeed, for each $B\in\mathcal{B}(\mathcal{X})$ and $\gamma\in\mathcal{Y}$ we have $\tilde{\pi}_{\gamma}(B\times\mathcal{I})=\pi(B)$ (see (2.2)).

Let $\tilde{P}_{\gamma,L,i}$ denote the kernel associated with the local move around mode $i$ and analogously let $\tilde{P}_{\gamma,J,i}$ be the kernel of the jump to mode $i$ . The full transition kernel $\tilde{P}_{\gamma}$ is thus defined as

[TABLE]

It is easily checked that the acceptance probabilities (2.3) and (2.4) or (2.6) ensure that detailed balance holds for the above kernels $\tilde{P}_{\gamma}$ , admitting $\tilde{\pi}_{\gamma}$ as their invariant distributions. They also satisfy the Harris ergodicity condition. The above discussion shows that the algorithm indeed falls into the category of the Auxiliary Variable Adaptive MCMC, so Theorems 3.2 and 3.3 can be used to establish its ergodicity.

The main results of this section are stated in Theorems 4.1 and 4.2, which establish convergence of our algorithm to the correct limiting distribution under the uniform and the non-uniform scenario, respectively.

4.1 Overview of the assumptions

In order to prove ergodic results for JAMS, we consider Algorithm 4, which is a slightly modified version of Algorithm 1. While being easier to analyse mathematically, it inherits the main properties of Algorithm 1. The modifications are two-fold: firstly, we update the parameters only if the most recent sample $(x_{n},i_{n})$ is such that $x_{n}$ belongs to some fixed compact set $A_{i_{n}}$ and secondly, we adapt them ”increasingly rarely” (see Section 3.2). If jumps are proposed deterministically, we additionally assume that they are allowed only on ”jumping regions” $JR_{\gamma,i}$ defined as

[TABLE]

for $i\in\mathcal{I}$ and some $R>0$ . Note that equation (2.5) ensures that if $x$ belongs to $JR_{\gamma,i}$ and we propose a deterministic jump from $(x,i)$ to $(y,k)$ , then $y$ must be in $JR_{\gamma,k}$ . Thus the detailed balance condition is satisfied. The reasons for these modifications will become clearer when we present the proofs of the ergodic theorems.

Even though the theory presented below works for any choice of the compact sets $A_{1},\ldots,A_{N}$ , we propose to define these sets in the following way. Recall that the burn-in routine (Algorithm 3) provides the list of mode locations $\{\mu_{1},...,\mu_{N}\}$ and initial estimates of covariance matrices $\{\Sigma_{\gamma_{0},1},...,\Sigma_{\gamma_{0},N}\}$ . By $\lambda_{i}$ denote the maximum eigenvalue of $\Sigma_{\gamma_{0},i}$ and let $\lambda_{M}=\max\{\lambda_{1},...,\lambda_{N}\}$ . Let $C$ be the convex hull of $\{\mu_{1},...,\mu_{N}\}$ and $D_{C}$ its diameter. Define

[TABLE]

where $d$ is the dimension of $\mathcal{X}.$

Observe that Algorithm 4 is constructed in such a way that all the covariance matrices $\Sigma_{\gamma,i}$ are based on samples belonging to a compact set $A_{i}$ . This implies that these matrices are bounded from above. Since we keep adding $\beta I_{d}$ to the covariance matrix at each step, they are also bounded from below. Recall also that the covariance matrices for the local proposal distributions are scaled by a fixed factor $2.38^{2}/d$ . Consequently, there exist positive constants $m$ and $M$ for which

[TABLE]

As for the adaptive scheme for $w_{\gamma,i}$ and $a_{\gamma,ik}$ , we only require that these values be bounded away from 0, i.e. there exist $\epsilon_{a}$ and $\epsilon_{w}$ such that

[TABLE]

Therefore, the parameter space $\mathcal{Y}$ may be considered as compact.

4.2 Theoretical results for JAMS

We begin with the case when the jump moves are proposed independently from distributions $R_{\gamma,J,i}$ with heavier tails than the tails of the target distribution $\pi$ for all $i\in\mathcal{I}$ and $\gamma\in\mathcal{Y}$ , i.e.

[TABLE]

We prove that under this assumption Simultaneous Uniform Ergodicity is satisfied for Algorithm 4 and consequently, by Theorem 3.2, the algorithm is ergodic.

Theorem 4.1.

Consider Algorithm 4 and assume that the relationship between the target distribution $\pi$ and the proposal distributions $R_{\gamma,J,i}$ satisfies (4.4). Then Algorithm 4 is ergodic.

When the tails of the distribution $\pi$ are heavier then the tails of the proposal distributions $R_{\gamma,J,i}$ , or when the jumps follow the deterministic scheme, Simultaneous Uniform Ergodicity does not hold. However, it turns out that under some additional regularity conditions Algorithm 4 is still ergodic, as it satisfies the assumptions of Lemma 3.5.

Theorem 4.2.

Consider Algorithm 4 and assume that the following conditions are satisfied.

a)

For each $i\in\mathcal{I},\gamma\in\mathcal{Y}$ the proposal distribution for local moves $R_{\gamma,L,i}$ follows an elliptical distribution parametrised by $\Sigma_{\gamma,i}$ . Furthermore, the family of distributions $R_{\gamma,L,i}(\textbf{0},\cdot)$ , $\gamma\in\mathcal{Y}$ , has uniformly bounded probability density functions, and for any compact set $C\subset\mathcal{X}$ we have

[TABLE] 2. b)

Let $r_{\gamma,i}(x)$ be the rejection set for local moves, i.e. $r_{\gamma,i}(x):=\{y\in\mathcal{X}:\tilde{\pi}_{\gamma}(y,i)<\tilde{\pi}_{\gamma}(x,i)\}$ . We assume that

[TABLE] 3. c)

The target distribution $\pi$ is super-exponential, i.e. it is positive with continuous first derivatives and satisfies

[TABLE] 4. d)

Every $Q_{i},$ $i\in\mathcal{I},$ is an elliptical distribution parametrised by $\Sigma_{\gamma,i}$ positive on $\mathcal{X}$ and additionally, the following condition is satisfied:

[TABLE]

Additionally, one of the following two conditions for jump moves holds.

e1)

Jump moves follow the procedure for deterministic jumps, as described in Section 2.1. 2. e2)

Jump moves follow the independent proposal procedure, as described in Section 2.1. The proposal distributions for jumps have uniformly bounded probability density functions and satisfy

[TABLE]

where $B\left(\mu_{i},r\right)$ is a ball of radius $r$ and centre $\mu_{i}$ . Moreover, the relationship between the target distribution $R_{\gamma,J,i}$ is given by

[TABLE]

Then Algorithm 4 is ergodic.

When proving the above result, we will refer to the proof of Theorem 4.1 of [29]. Assumptions b) and c) are analogues of the regularity conditions considered in [29]. Condition a) holds automatically for our algorithm if we assume that the proposal distributions for local moves follow either the normal or the $t$ distribution (see Section 2.1) and when (4.2) holds. Condition (4.9) is satisfied if the proposal distributions for jumps follow, for example, the normal distribution. Condition d) can be easily verified if every $Q_{i}$ , $i\in\mathcal{I}$ follows the $t$ distribution with the same number of degrees of freedom.

The result stated below establishes the Weak Law of Large Numbers for our algorithm.

Theorem 4.3.

Consider Algorithm 4 and assume that conditions of either Theorem 4.1 or Theorem 4.2 are satisfied. Then the Weak Law of Large Numbers holds for all bounded and measurable functions.

*Remark 4.4**.*

Note that Theorems 4.1, 4.2 and 4.3 are based on an assumption that the list of modes is fixed. Let us now consider Algorithm 4 in the version with mode finding running in parallel to the main MCMC sampler, as shown in Figure 1. Assume additionally that

[TABLE]

where $\tau$ is the time of adding the last mode. In this case Theorems 4.1, 4.2 and 4.3 still hold. Indeed, as the parallel burn-in algorithm runs independently of JAMS, we can rephrase all the probabilistic limiting statements in the proofs on the set $C_{t}:=\{\tau<t\}$ and then let $t\to\infty.$

The following lemmas are useful in verifying assumption b) of Theorem 4.2.

Lemma 4.5.

Let $r(x):=\{y\in\mathcal{X}:\pi(y)<\pi(x)\}$ and $a(x):=\{y\in\mathcal{X}:\pi(y)\geq\pi(x)\}$ . Consider Algorithm 4 together with conditions a), c) and d) of Theorem 4.2. Assume additionally that for some $\gamma^{*}\in\mathcal{Y}$

[TABLE]

Then condition (4.6) holds.

Lemma 4.6.

Consider Algorithm 4 together with conditions a), c) and d) of Theorem 4.2. Assume additionally that the target distribution $\pi$ satisfies

[TABLE]

Then condition (4.6) holds.

The following corollary shows Algorithm 4 in a standard setting is successful at targeting mixtures of normal distributions.

Corollary 4.7.

Let the target distribution $\pi$ be given by

[TABLE]

where $w_{i}>0$ and $p_{i}$ is a polynomial of order $\geq 2$ for each $i=1,\ldots,n$ . If additionally $Q_{i}$ for $i\in\mathcal{I}$ follows the multivariate $t$ distribution with the same number of degrees of freedom, and $R_{\gamma,L,i}(\textbf{0},\cdot)$ follows the normal distribution, the assumptions of Lemma 4.6 are satisfied.

5 Examples

In this section we present empirical results for our method (Algorithm 1 preceded by the Algorithm 3). We test its performance on three examples – the first one is a mixture of two Gaussians motivated by [56]; the second one is a mixture of fifteen multivariate $t$ distributions and five banana-shaped ones; the third one is a Bayesian model for sensor network localisation. Our implementation admits three versions, varying in the way the jumps between modes are performed. In particular, we consider here the deterministic jump and two independent proposal jumps, with Gaussian and $t$ -distributed proposals.

Additionally, we compare the performance of our algorithm against adaptive parallel tempering [34], which was chosen here as it is the refined version of the most commonly used MCMC method for multimodal distributions (parallel tempering). What is more, this algorithm has a generic implementation, where the user only needs to provide the target density function. In order to make a comparison between the efficiency of these algorithms, among other things, we analyse the Root Mean Square Error (RMSE) divided by the square root of the dimension of the state space, given a computational budget. We measure the computational cost by the number of evaluations of the target distribution (and its gradient, if applicable), as this is typically the dominating factor in real data examples. Herein we define RMSE as the Euclidian distance between the true $d$ -dimensional expected value (if known) and its empirical estimate based on MCMC samples.

In order to depict the variability in the results delivered by both methods, each simulation was repeated 20 times. For exact settings of the experiments, as well as some additional results, we refer the reader to Supplementary Material B.

5.1 Mixture of Gaussians

The following target density was studied by [56]:

[TABLE]

for $\sigma_{1}\neq\sigma_{2}$ . In particular, they showed that the parallel tempering algorithm will tend to stay in the wider mode and, if started in the wider mode, may take a long time before getting to the more narrow one. We looked at the results for the target distribution (5.1) in several different dimensions $d$ ranging between 10 and 200, for $\sigma_{1}^{2}=0.5\sqrt{d/100}$ and $\sigma_{2}^{2}=\sqrt{d/100}$ . The results for our method shown below are based on 500,000 iterations of the main algorithm, preceded by the burn-in algorithm including 1500 BFGS runs. The length of the covariance matrix estimation was chosen automatically using the rule described in Supplementary Material B and varied between 3000 iterations (for $d=10$ ) to 1,023,000 iterations (for $d=200$ ) per mode. For dimensions $d=10$ and $d=20$ we ran also the adaptive parallel tempering (APT) algorithm, with 700,000 iterations and 5 temperatures. Overall this requires 3,500,000 evaluations of the target density that cannot be performed in parallel, despite the name of the method, as the communication between chains running at different temperatures is needed after every iteration. In the light of the tendency of the parallel tempering algorithm to stay in wider modes, each time the APT algorithm was started in $-\underbrace{(1,\ldots,1)}_{d}\in\mathbb{R}^{d}$ . In order to base our analysis on the same sample size of 500,000 for the two methods, in case of adaptive parallel tempering we applied an initial burn-in period of 200,000 steps.

The results presented in the boxplots of Figure 2, as well as the upper panel of density plots (Figure 4) show that our method outperforms adaptive parallel tempering on this example, even when the latter method is given a much larger computational budget. The summary of the acceptance rates of the jump moves presented in Table 1 demonstrates that the algorithm preserves good mixing between the modes in all its jump versions up to dimension 80. It is remarkable that the deterministic jump ensures excellent mixing even in much higher dimensions, outperforming the remaining two methods (see Figure 3 and the lower panel of Figure 4), with the acceptance rate between 0.64 and 0.97 in dimension 200.

5.2 Mixture of $t$ and banana-shaped distributions

A classic example of a multimodal distribution is a mixture of 20 bivariate Gaussian distributions introduced in [31] (in two versions, with equal and unequal weights and covariance matrices). It was later studied also by [34] and [49]. Our algorithm works well on both versions, however, since the example is relatively simple and the performance of the existing methods on it is already satisfying, we do not expect our method to yield much improvement. Therefore, we decided to modify this example in the way described below in order to make it more challenging. Instead of the Gaussian distribution, the first five modes follow the banana-shaped distribution with $t$ tails and the remaining ones – multivariate $t$ with 7 degrees of freedom and the covariance matrices $0.01\sqrt{d}I_{d}$ , where $d$ is the dimension (the covariance matrices in the original example were given by $0.01I_{2}$ ). The weights are assumed to be equal to 0.05. We consider dimensions $d=10$ and $d=20$ by repeating the original coordinates of the centres of the modes five and ten times, respectively.

Recall the definition of the $d$ -dimensional banana-shaped distribution introduced by [26]111Originally in the paper by Haario et. al. [26] the function $f$ was the density of the Gaussian distribution $N(\textbf{0},C)$ .. Let $f$ be the density of the centred $t$ distribution with 7 degrees of freedom and shape matrix $C$ , for $C=\text{diag}(100,\underbrace{1,\ldots,1}_{d-1})$ . Then the density of the banana-shaped distribution (with $t$ -tails) is given by

[TABLE]

where

[TABLE]

In order to decrease the variance of the banana-shaped elements of the mixture, we used the following transformation of $f_{b}$ (setting $b=0.03$ )

[TABLE]

Furthermore, the formula on the second coordinate of (5.2) was assigned to coordinate 2, 4, 6, 8, 10 for mode 1, 2, 3, 4 and 5, respectively.

The results below are based on 500,000 iterations, preceded by 40,000 BFGS runs. The number of iterations of the covariance matrix estimation varied between 7,000 and 15,000 steps per mode for dimension $d=10$ and between 15,000 and 63,000 steps per mode for dimension $d=20$ . For adaptive parallel tempering we used 2,100,000 iterations and 5 temperatures. We applied an initial burn-in period of 600,000 steps and we thinned the chain keeping every third sample.

In Supplementary Material B we present results for the same example obtained using JAMS in dimensions $d=50$ and $d=80$ assuming that the modes of the target distribution are known, since mode finding (in particular, getting to each basin of attraction) is the main bottleneck for this example.

For dimensions $d=10$ and $d=20$ all modes were found by the BFGS runs in each of the 20 simulations. Even though the banana-shaped modes are highly skewed, our method exhibits good between-mode mixing properties, as shown in Table 2. Figure 5 illustrates that the empirical means based on JAMS samples approximate well the true expected value of the target distribution, consistently across all experiments, and that our method significantly outperforms APT with a smaller computational cost.

5.3 Sensor network localisation

We consider here an example from [28], analysed later by [1], [32] and, in a modified version by [49]. There are 11 sensors with locations $x_{1},\ldots,x_{11}$ scattered on a space $[0,1]^{2}$ . The locations of sensors $x_{1},\ldots,x_{8}$ are unknown, the remaining three locations are known. For any two sensors $i$ and $j$ we observe the distance $y_{ij}$ between them with probability $\exp\left(-\frac{\|x_{i}-x_{j}\|^{2}}{2\times 0.3^{2}}\right)$ . Once observed, the distance $y_{ij}$ follows the normal distribution given by $y_{ij}\sim N\left(\|x_{i}-x_{j}\|,0.02^{2}\right)$ . Let $w_{ij}$ be equal to 1 when $y_{ij}$ is observed and [math] otherwise, and denote $y:=\{y_{ij}\}$ and $w:=\{w_{ij}\}$ . The goal of the study is to make inference about the unknown locations $x_{i}=(z_{i1},z_{i2})$ for $i=1,\ldots 8$ given $y$ and $w$ . Following [1] and [32] we put an improper uniform prior on each of the coordinates $z_{i1}$ and $z_{i2}$ for $i=1,\ldots 8$ . The resulting posterior distribution is given by

[TABLE]

where

[TABLE]

Since there are few observed distances with known locations (see: top left panel of Figure 6), the model is non-identifiable which results in multimodality of the posterior distribution.

We ran JAMS on this example for 500,000 iterations of the main algorithm. This was preceded by 10,000 BFGS runs and covariance matrix estimation (between 7000 and 15,0000 iterations per mode). For parallel tempering we used 700,000 iterations (with a burn-in period of 200,000) and 4 temperatures. If JAMS is implemented on 8 cores, this means that running an APT simulation is about twice as costly as running a JAMS one (see Supplementary Material B for details).

Despite the fact that for all 20 APT experiments the acceptance rates at all temperature levels, as well as for between-temperature swaps, converged to the optimal acceptance rate 0.234 (see [8]), the behaviour of this algorithm was unstable. As shown in Figure 6, in case of APT the estimation of the location of sensor 1 depends on the starting point. In case of JAMS, both modes for $x_{1}$ (in red) are represented.

Figure 7 illustrates stability of JAMS across all experiments and jump methods. In Supplementary Material B we assign an even higher computational budget to adaptive parallel tempering allowing for 5 temperatures and observe a substantial improvement in mixing and stability, but the results are still worse than those of JAMS.

6 Summary and discussion

The approach we proposed here is based on three fundamental ideas. Firstly, we split the task into mode finding and sampling from the target distribution. Secondly, we base our algorithm on local moves responsible for mixing within the same mode and jumps that facilitate crossing the low probability barriers between the modes. Finally, we account for inhomogeneity between the modes by using different proposal distributions for local moves at each mode and adapting their parameters separately. Similarly, the jump moves account for the difference in geometry of the two involved modes. This is possible thanks to the auxiliary variable approach which enables assigning each MCMC sample to one of the modes and ensuring that it is unlikely to escape to another mode via local moves. This improves over the popular tempering-based approaches, which do not have the mechanism of controlling the mode at each step, and therefore their adaptive versions [34] only learn the global covariance matrix rather than the local ones. This is highly inefficient if the shapes of the modes are very distinct and results in exponential efficiency decay.

The optimisation-based approaches are naturally well-suited for the task of collecting the MCMC samples separately for each mode and learning the covariance matrices on this basis. However, the approaches known in the literature do not have a suitable framework for adaptation and tend to be either very costly (e.g. [57]) or to ignore the issue of the possibility of moving between the modes via local steps (e.g.[1]). Moreover, some of the other fundamental issues of optimisation-based methods have not been systematically addressed by the researchers so far. These include an efficient design of the mode finding phase, distinguishing between newly discovered modes and replicated ones, as well as adapting beyond the infrequent regeneration times, which does not require case-specific calculations. We hope that the method we proposed will fill this gap.

Furthermore, an important advantage of our approach from the point of view of the modern compute resources is that a large part of the algorithm can be implemented on multiple cores.

To develop a methodological approach and prove ergodic results for our algorithm, we introduced the Auxiliary Variable Adaptive MCMC class. As discussed briefly in Section 3, there are other adaptive algorithms falling in this category, so our theoretical results may potentially be useful beyond the scope of the Jumping Adaptive Multimodal Sampler. We have also shown that the Auxiliary Variable Adaptive MCMC methods enjoy robust ergodicity properties analogous to Adaptive MCMC under essentially the same well-studied regularity conditions.

Currently the main bottleneck of the method is mode finding, and in particular, sampling starting points for optimisation runs in such a way that there is at least one point in the basin of attraction of each mode. Therefore in our future work we will focus on designing more efficient algorithms for identifying high probability regions.

Acknowledgements

We thank Louis Aslett, Ewan Cameron, Arnaud Doucet, Geoff Nicholls and Jeffrey Rosenthal for helpful comments, and Shiwei Lan for pointing us to the data for the sensor network example. We would also like to thank Radu Craiu for providing the data set for the LOH example considered in Supplementary Material B.

Supplementary Material A

In Section 7 we present the proofs of our theoretical results stated in Section 3. In Section 8 we prove the results stated in Section 4. In Section 9 we give some comments about other algorithms in the Auxiliary Variable Adaptive MCMC class.

7 Proofs for Section 3

To prove our results presented in Section 3 we will use the coupling construction analogous to [42] (see also [45] for a more rigorous presentation).

Our proofs will be rigorous and will rely on an explicit coupling construction. The more complex setting of Auxiliary Variable Adaptive MCMC necessitates a few preliminary steps: to interpolate between the adaptive process and the target distribution we shall construct two processes to be thought of as ”Markovian” and ”intermediate”. These processes will facilitate application of the triangle inequality in the proofs.

Recall $\{(\tilde{X}_{n},\Gamma_{n})\}_{n=0}^{\infty},$ the adaptive process on $\tilde{X}:=\mathcal{X}\times\Phi$ defined in Section 3 with dynamics governed by equation (3.3). On the same probability space define two additional sequences, namely $\{(\tilde{X}^{m(t^{*})}_{n},\Gamma^{m(t^{*})}_{n})\}_{n=0}^{\infty}$ and $\{(\tilde{X}^{i(t^{*},\kappa)}_{n},\Gamma^{i(t^{*},\kappa)}_{n})\}_{n=0}^{\infty}$ which are identical to $\{(\tilde{X}_{n},\Gamma_{n})\}_{n=0}^{\infty},$ before pre-specified time $t^{*}$ , i.e.

[TABLE]

After time $t^{*},$ the adaptive parameter $\Gamma^{m(t^{*})}$ of $\{(\tilde{X}^{m(t^{*})}_{n},\Gamma^{m(t^{*})}_{n})\}_{n=0}^{\infty}$ freezes and $\tilde{X}^{m(t^{*})}_{n}$ becomes a Markov chain with the marginal dynamics defined for $n+1>t^{*}$ as:

[TABLE]

The second sequence $\{(\tilde{X}^{i(t^{*},\kappa)}_{n},\Gamma^{i(t^{*},\kappa)}_{n})\}_{n=0}^{\infty}$ interpolates between $\{(\tilde{X}_{n},\Gamma_{n})\}_{n=0}^{\infty}$ and $\{(\tilde{X}^{m(t^{*})}_{n},\Gamma^{m(t^{*})}_{n})\}_{n=0}^{\infty}$ . We first define the dynamics of $\Gamma^{i(t^{*},\kappa)}$ for $n+1>t^{*},$ as:

[TABLE]

and define an auxiliary stopping time that records decoupling of $\Gamma_{n}$ and $\Gamma^{i(t^{*},\kappa)}_{n}$ as

[TABLE]

with the convention $\min\emptyset=\infty.$ Now, define the dynamics of $\tilde{X}^{i(t^{*},\kappa)}_{n}$ as:

[TABLE]

for $n+1>\tau_{i(t^{*},\kappa)}$ and $\tilde{B}\in\mathcal{B}(\tilde{\mathcal{X}})$ .

Define also the filtration $\{\mathcal{G}_{n}^{*}\}_{n=0}^{\infty}$ as an extension of $\{\mathcal{G}_{n}\}_{n=0}^{\infty}$ by:

[TABLE]

Let the distributions

[TABLE]

and

[TABLE]

be analogues of $\tilde{A}_{n}^{(\tilde{x},\gamma)}(\cdot)$ , $\tilde{A}_{n}^{\mathcal{G}_{t}}(\cdot)$ , $A_{n}^{(\tilde{x},\gamma)}(\cdot)$ , $A_{n}^{\mathcal{G}_{t}}(\cdot)$ , where in the definitions of the above terms, instead of $\{(\tilde{X}_{n},\Gamma_{n})\}_{n=0}^{\infty},$ we use the sequences $\{(\tilde{X}^{m(t^{*})}_{n},\Gamma^{m(t^{*})}_{n})\}_{n=0}^{\infty}$ and $\{(\tilde{X}^{i(t^{*},\kappa)}_{n},\Gamma^{i(t^{*},\kappa)}_{n})\}_{n=0}^{\infty}$ , respectively, and condition on the extended $\sigma$ -algebras defined in (7.9).

Lemma 7.1.

Let $\tilde{\nu}_{1}$ and $\tilde{\nu}_{2}$ be probability measures on $\tilde{\mathcal{X}}=\mathcal{X}\times\Phi$ and let $\nu_{1}$ and $\nu_{2}$ be their marginals on $\mathcal{X}$ . Then

[TABLE]

Proof.

Total variation distances on $\tilde{\mathcal{X}}$ involve suprema over larger classes of sets than those on $\mathcal{X}$ , in particular $|\nu_{1}(B)-\nu_{2}(B)|=|\tilde{\nu}_{1}(B\times\Phi)-\tilde{\nu}_{2}(B\times\Phi)|$ . ∎

Lemma 7.2.

Let Simultaneous Uniform Ergodicity, i.e. condition (a) of Theorem 3.2, hold. Then for all $\varepsilon>0,$ there exists $N_{0}=N_{0}(\varepsilon)$ such that for all $N\geq N_{0}$

[TABLE]

Proof.

First observe that for any $B\in\mathcal{B}(\mathcal{X})$ the object $A_{t^{*}+N}^{m(t^{*}),\mathcal{G}_{t^{*}}^{*}}(B)$ is a $\mathcal{G}_{t^{*}}^{*}$ -measurable random variable and apply Jensen’s inequality to obtain (7.12) below. Next, use Lemma 7.1 in (7.13). To get (7.14) recall that $\{\tilde{X}_{k}^{m(t^{*})}\}_{k=t^{*}}^{\infty}$ is a Markov chain started from $\tilde{X}_{t^{*}}$ with dynamics $\tilde{P}_{\Gamma_{t^{*}}}$ . Then use monotonicity of the total variation for Markov chains, and finally pick $N_{0}=N_{0}(\varepsilon)$ via assumption (a) of Theorem 3.2 to conclude (7.16).

[TABLE]

∎

Lemma 7.3.

Let Containment, i.e. condition (a) of Theorem 3.3, hold. Then for all $\tilde{x}\in\tilde{\mathcal{X}},\;\gamma\in\mathcal{Y}$ and $\varepsilon>0,$ there exists $N_{0}=N_{0}(\varepsilon,\tilde{x},\gamma),$ such that for all $t^{*}\in\mathbb{N}$ and all $N\geq N_{0},$

[TABLE]

Proof.

Reiterate the argument in the proof of Lemma 7.2 to get the first part of (7.17) and then to arrive on (7.14). Next, use the Containment condition, namely choose $N_{0}=N_{0}(\varepsilon,\tilde{x},\gamma)$ such that

[TABLE]

for all $n\in\mathbb{N}$ . Let $G:=\{M_{\varepsilon/2}(\tilde{X}_{t^{*}},\Gamma_{t^{*}})\leq N_{0}\}$ , then $\mathbb{P}(G^{\prime})\leq\varepsilon/2$ and $\|\tilde{P}_{\Gamma_{t^{*}}}^{N_{0}}(\tilde{X}_{t^{*}},\cdot)-\tilde{\pi}_{\Gamma_{t^{*}}}(\cdot)\|_{TV}\leq\varepsilon/2$ on $G$ . Therefore we get

[TABLE]

as required. ∎

Lemma 7.4.

Let Diminishing Adaptation, i.e. condition (b) of Theorem 3.2, hold. Then for all $\varepsilon>0,$ $\kappa>0$ and $N_{0}\in\mathbb{N}$ there exists $t_{0}=t_{0}(\varepsilon,\kappa,N_{0})$ such that for every $t^{*}\geq t_{0}$ and every $N\leq N_{0},$

[TABLE]

Proof.

Recall $D_{n}$ defined in condition (b) of Theorem 3.2 and let

[TABLE]

Note that by Diminishing Adaptation for every $n\geq t_{0}=t_{0}(\varepsilon,\kappa,N_{0})$ we have $\mathbb{P}(H_{n})\leq\varepsilon/N_{0}.$ Now, for $t^{*}\geq t_{0},$ define

[TABLE]

Consider the process $\{(\tilde{X}^{i(t^{*},\kappa)}_{n},\Gamma^{i(t^{*},\kappa)}_{n})\}_{n=0}^{\infty}$ with $t^{*}\geq t_{0}$ . Note that on $E$ we have $\tilde{X}_{n}=\tilde{X}^{i(t^{*},\kappa)}_{n}$ , for $n=0,1,\dots,t^{*}+N_{0},$ and therefore the coupling inequality (see e.g. Section 4.1 of [41]) for every $N\leq N_{0}$ yields as claimed

[TABLE]

∎

Lemma 7.5.

For every $\kappa>0,$ $t^{*}\in\mathbb{N}$ and $N\in\mathbb{N}$ , the distributions of $\{(\tilde{X}^{i(t^{*},\kappa)}_{n},\Gamma^{i(t^{*},\kappa)}_{n})\}_{n=0}^{\infty}$ and $\{(\tilde{X}^{m(t^{*})}_{n},\Gamma^{m(t^{*})}_{n})\}_{n=0}^{\infty}$ satisfy the following:

[TABLE]

Proof.

First apply Jensen’s inequality

[TABLE]

and recall equations (7.1) and (7.2) to note that

[TABLE]

that is $\{\tilde{X}_{n}^{m(t^{*})}\}_{n=t^{*}}^{t^{*}+N}$ is a Markov chain started from $\tilde{X}_{t^{*}}$ with dynamics $\tilde{P}_{\Gamma_{t^{*}}}$ . Combining (7.27) with (7.28) yields (7.25).

Now recall (7.5), (7.7), (7.8), i.e. the dynamics of $\{(\tilde{X}^{i(t^{*},\kappa)}_{n},\Gamma^{i(t^{*},\kappa)}_{n})\}_{n=K-N}^{K},$ and observe that (7.5) yields

[TABLE]

Hence, for every $n=t^{*},\dots,t^{*}+N-1,\;$ if $\tilde{X}^{m(t^{*})}_{n}=\tilde{X}^{i(t^{*},\kappa)}_{n},$ then by (7.29) and Proposition 3(g) of [41], there exists a coupling of $\tilde{X}^{m(t^{*})}_{n+1}$ and $\tilde{X}^{i(t^{*},\kappa)}_{n+1},$ such that

[TABLE]

Reiterating this construction $N$ times from $n=t^{*}$ to $n=t^{*}+N-1$ implies that there exists a coupling such that

[TABLE]

Hence by the coupling inequality

[TABLE]

as required. ∎

Proof of Theorem 3.2

Proof.

By the triangle inequality, for any $n,t^{*},$ and $\kappa,$ we have

[TABLE]

where in the second inequality, for the first two terms, we have used Lemma 7.1.

Now fix $\delta>0$ . To prove the claim, it is enough to construct a target time $K_{0}=K_{0}(\delta,\tilde{x},\gamma)$ , s.t.

[TABLE]

We shall find such target time of the form $K_{0}=t_{0}+N_{0}$ . To this end let $\varepsilon=\delta/3$ .

First, use Lemma 7.2 to fix $N_{0}=N_{0}(\varepsilon)$ so that

[TABLE]

Next, take $\kappa:=\varepsilon/N_{0}^{2}$ and use Lemma 7.5 to conclude that

[TABLE]

Finally, use Lemma 7.4 to find $t_{0}=t_{0}(\varepsilon,\kappa,N_{0})$ such that

[TABLE]

Letting $K_{0}:=t_{0}+N_{0}$ allows to decompose every $K\geq K_{0}$ into $K=t^{*}+N_{0}$ , so that (7.33), (7.34), (7.35) are satisfied, which yields the claim. ∎

Proof of Theorem 3.3

Proof.

The proof is identical, except that we use Lemma 7.3 instead of Lemma 7.2 to find $N_{0}=N_{0}(\varepsilon,\tilde{x},\gamma)$ in (7.33). ∎

Proof of Theorem 3.4

Proof.

To prove that the Weak Law of Large Numbers holds for the Auxiliary Variable Adaptive MCMC class, recall again the sequences $\{(\tilde{X}^{m(t^{*})}_{n},\Gamma^{m(t^{*})}_{n})\}_{n=0}^{\infty}$ and $\{(\tilde{X}^{i(t^{*},\kappa)}_{n},\Gamma^{i(t^{*},\kappa)}_{n})\}_{n=0}^{\infty}$ defined above. Without loss of generality we will assume that $\pi(g)=0$ and that $|g(x)|<a$ .

By Markov’s inequality

[TABLE]

hence to obtain the WLLN it is enough to show that for every $\delta>0$ there exists such $T_{0}=T_{0}(\delta)=T_{0}(\delta,\tilde{x},\gamma),$ where $(\tilde{x},\gamma)$ are the starting points of $(\tilde{X}_{n},\Gamma_{n}),$ that for all $T>T_{0}$

[TABLE]

We shall deal with (7.37) by considering second moments and therefore will have to deal with mixed terms of the form $\mathbb{E}g(X_{i})g(X_{j})$ . Let $\varepsilon>0$ be fixed and we shall pick a specific value later. Firstly, for $i<j,$ consider the following calculation.

[TABLE]

where $N_{0}(\varepsilon,\tilde{x},\gamma)$ has been obtained from Lemma 7.2, if assuming Simultaneous Uniform Ergodicity, or from Lemma 7.3, if assuming Containment.

Secondly, given $N_{0}$ in (7.38), fix $N_{1}\geq N_{0}$ such that $1/N_{1}<\varepsilon$ and consider pairs $i,j$ satisfying $N_{1}\leq j-i\leq N_{1}^{2}.$ Set $\kappa:=\varepsilon/N_{1}^{4},$ and compute

[TABLE]

Finally, use Lemma 7.4 to find $t_{0}=t_{0}(\varepsilon,\kappa,N_{1}^{2})$ and conclude

[TABLE]

We are ready to address the mixed term. Since $|g|<a$ , trivially for any $i,j$

[TABLE]

Moreover, for $\varepsilon>0,$ $N_{1}\geq N_{0}(\varepsilon,\tilde{x},\gamma),$ $\kappa=\varepsilon/N_{1}^{4}$ , pairs $i,j$ such that $N_{1}\leq j-i\leq N_{1}^{2}$ , and $i>t_{0}(\varepsilon,\kappa,N_{1}^{2})$ equations (7.38), (7.39) and (7.40) yield

[TABLE]

Consequently, for any $N_{1}>\max\{1/\varepsilon,N_{0}\}$ and $t_{1}>t_{0}$ , chosen as above, we can compute

[TABLE]

where we have used (7.42) and (7.41) to bound the first and second summation, respectively.

By the Cauchy-Schwartz inequality (7.43) implies

[TABLE]

for $N_{1}>\max\{1/\varepsilon,N_{0}(\varepsilon,\tilde{x},\gamma)\}$ and $t_{1}>t_{0}(\varepsilon,\kappa,N_{1}^{2})$ .

Following the proof of Theorem 5 of [42], fix $T$ so large that

[TABLE]

Use (7.44) and (7.45) to observe that

[TABLE]

Setting $\varepsilon:=(\delta/2(\sqrt{2}a+1))^{2}$ in the above argument yields (7.37) as desired. ∎

Proof of Lemma 3.5

Proof.

We will begin the proof by showing that assumption (3.6) implies that an analogous drift condition is satisfied for $\tilde{P}_{\gamma}^{n_{0}}$ , $n_{0}$ defined in (3.7), perhaps with different constants $\lambda$ and $b$ , which we define below. For any $k\in\{1,\ldots,n_{0}\}$ we have

[TABLE]

For $k=2$ we have

[TABLE]

By similar calculations and induction we obtain

[TABLE]

as required.

By Theorem 12 of [46] and conditions (3.7) and (7.48), there exists $K<\infty$ and $\rho<1$ , depending only on $\lambda$ , $b$ , $v$ , $n_{0}$ and $\delta$ , such that for each $\gamma\in\mathcal{Y}$ and for any $k\in\mathbb{N}$ we have

[TABLE]

We now use the monotonicity of $\|\tilde{P}_{\gamma}^{n}(\tilde{x},\cdot)-\tilde{\pi}_{\gamma}(\cdot)\|_{TV}$ in $n$ (see Proposition 3b) of [41]) to argue that

[TABLE]

Let $\tilde{\rho}:=\rho^{\frac{1}{n_{0}}}$ . It follows that for every $m\in\mathbb{N}$

[TABLE]

The next step of the proof will be to show that the sequence $V_{\tilde{\pi}_{\Gamma_{n}}}(\tilde{X}_{n})$ is bounded in probability. By Lemma 3 in [42], it suffices to show that $\sup_{n}\mathbb{E}V_{\tilde{\pi}_{\Gamma_{n}}}(\tilde{X}_{n})<\infty$ . Firstly, let us show that $\tilde{P}V_{\tilde{\pi}_{\gamma}}(\tilde{x})$ is bounded for $\gamma\in\mathcal{Y}$ and $\tilde{x}\in A$ . Note that

[TABLE]

Since $A$ and $\mathcal{Y}$ were assumed to be compact, $\sup_{\gamma\in\mathcal{Y}}\sup_{\tilde{x}\in A}V_{\tilde{\pi}_{\gamma}}(\tilde{x})<\infty$ . Additionally, the drift condition (3.6) yields

[TABLE]

Therefore we can define $M:=\sup_{\gamma\in\mathcal{Y}}\sup_{\tilde{x}\in A}\tilde{P}_{\gamma}V_{\tilde{\pi}_{\gamma}}(\tilde{x})<\infty$ . It follows that

[TABLE]

By the law of total expectation,

[TABLE]

which combined with (7.51) gives

[TABLE]

This implies, using Lemma 2 in [42] that

[TABLE]

Lemma 3.5 will now follow from combining the fact that the sequence $V_{\tilde{\pi}_{\Gamma_{n}}}(\tilde{X}_{n})$ is bounded in probability with (7.50). Note that for any fixed $\varepsilon$ and $\tilde{\delta}$ , there exists $N$ such that

[TABLE]

for all $n\in\mathbb{N}$ . The last inequality holds since $\varepsilon\tilde{\rho}^{-N}\rho-K\to\infty$ as $N\to\infty$ and $V_{\Gamma_{n}}(\tilde{X}_{n})$ is bounded in probability. ∎

8 Proofs for Section 4

Proof of Theorem 4.1

Proof.

The aim of the proof is to verify the assumptions of Theorem 3.2 and conclude. Diminishing Adaptation has been addressed in Section 3.2, so it is enough to prove that Simultaneous Uniform Ergodicity holds. Note that assumption (4.4) implies that for some positive constant $c_{1}$

[TABLE]

for each $k\in\mathcal{I}$ , $y\in\mathcal{X}$ and $\gamma\in\mathcal{Y}$ . For any $(x,i)\in\mathcal{X}\times\mathcal{I}$ , any set $\hat{C}\subset\mathcal{X}$ and any $k\in\mathcal{I}$ we can compute

[TABLE]

where $\epsilon_{a}$ is as in equation (4.3).

Furthermore, any set $C\subset\mathcal{X}\times\mathcal{I}$ may be decomposed as $C=\bigcup_{k\in\mathcal{I}}\hat{C}_{k}\times\{k\}$ , therefore

[TABLE]

Since $\tilde{\pi}_{\gamma}$ is a probability measure on $\mathcal{X}\times{I}$ for each $\gamma\in\mathcal{Y}$ and (8.2) holds for all $(x,i)\in\mathcal{X}\times{I}$ , by Theorem 8 of [41] we have

[TABLE]

which completes the proof. ∎

Proof of Theorem 4.2

We will show that the assumptions of Theorem 3.3 are satisfied. Since Diminishing Adaptation was discussed in Section 3.2, it suffices to prove that the Containment condition holds, which we will do using Lemma 3.5. Assumptions a) and d) were discussed in Section 4.1. Assumption e) follows directly from the construction of the algorithm for $A:=\bigcup_{i\in\mathcal{I}}A_{i}\times\{i\}$ . Assumption f) holds trivially, since $\tilde{X}_{0}$ and $\Gamma_{0}$ are deterministic (chosen by the user of the algorithm). The remaining part of the proof is organised as follows.

We show that the drift condition expressed in assumption b) of Lemma 3.5 is satisfied under the assumptions of Theorem 4.2. To this end, we consider a drift function of the form

[TABLE]

for some $s\in(0,1)$ and $c$ such that $c\pi(x)^{-s}\geq 1$ (thus enforcing $V_{\tilde{\pi}_{\gamma}}(\tilde{x})>1$ ). We first focus on obtaining the appropriate result for the local kernels and subsequently we combine it with the result for jumps. Finally, we prove that assumption c) of Lemma 3.5 is satisfied for $n_{0}=3$ .

Assumption b) of Lemma 3.5 (local kernels)

Proof.

The drift function $V_{\tilde{\pi}_{\gamma}}$ defined as above is a jointly continuous function of $(x,\gamma)$ so it is bounded on compact sets in $\mathcal{X}\times\mathcal{Y}$ for each $i\in\mathcal{I}$ , as required by assumption b) of Lemma 3.5. Therefore, it is also bounded on compact sets in $\mathcal{\tilde{X}}\times\mathcal{Y}$ . The proof will be continued for $s=\frac{1}{2}$ but analogous reasoning would be valid for any $s\in(0,1)$ .

We will prove that there exists $\lambda_{L}<1$ such that for the local move kernels we have

[TABLE]

for all $i\in\mathcal{I}$ . We will refer multiple times to the proof of Theorem 4.1 of [29]. Following the notation used there, let $C_{\pi(x)}(\delta)$ denote the radial $\delta$ -zone around $C_{\pi(x)}$ , where $C_{\pi(x)}$ is the contour manifold corresponding to $\pi(x)$ . Firstly, there exists $R_{0}$ such that for $|x|>R_{0}$ the contour manifold $C_{\pi(x)}$ is parametrised by $S^{d-1}$ and encloses the acceptance set for $\pi$ defined as $a(x):=\{y\in\mathcal{X}:\pi(y)>\pi(x)\}$ (we refer to Section 4 of [29] for the details of this argument). In our proof we will only consider $|x|>R_{0}$ . Define also

[TABLE]

By assumption (4.6) $\lambda_{L,i}<1$ .

Fix $i\in\mathcal{I}$ and $\epsilon>0$ . We will show that for sufficiently large $x$

[TABLE]

The idea of this proof is to split $\mathcal{X}$ into disjoint sets $\mathcal{X}\setminus B(x,K)$ , $B(x,K)\cap C_{\pi(x)}(\delta)$ and $B(x,K)\setminus C_{\pi(x)}(\delta)$ and show that for any $x$ with a sufficiently large norm the integral representing acceptance, that is, of the function $R_{\gamma,L,i}(x,y)\min\left[1,\frac{\tilde{\pi}_{\gamma}(y,i)}{\tilde{\pi}_{\gamma}(x,i)}\right]\frac{V_{\tilde{\pi}_{\gamma}}\left((y,i)\right)}{V_{\tilde{\pi}_{\gamma}}\left((x,i)\right)}$ on those sets is bounded from above by $\epsilon$ , $\epsilon$ and $\epsilon^{1/2}$ , respectively. We fix the values of $K$ and $\delta$ below. As for the rejection part, we use (8.5) to show that the corresponding integral is bounded by $\lambda_{L,i}+\epsilon$ , for all $x$ at a sufficient distance from 0. Putting all these upper bounds together, we obtain the required $\lambda_{i,L}+3\epsilon+\epsilon^{1/2}$ .

Firstly, observe that by assumption a) and condition (4.2) the family of distributions $R_{\gamma,L,i}(x,\cdot)$ is tight. Thus, there exists $K$ such that

[TABLE]

Furthermore, as shown the proof of Theorem 4.1 of [29], under assumption c) of Theorem 4.2 for any positive $\delta$ and $K$

[TABLE]

Fix $K$ satisfying (8.7). Since $\lim_{x\to\infty}\left(\frac{|x|+K}{|x|-K}\right)^{d-1}=1$ , there exists $R_{1}>0$ such that for $|x|>\max[R_{0},R_{1}]$

[TABLE]

Recall that by assumption a) for any $x\in\mathcal{X}$ we have $\sup_{\gamma\in\mathcal{Y}}\sup_{y\in\mathcal{X}}R_{\gamma,L,i}(x,y)>0$ . Now let us choose $\delta$ such that for $|x|>R_{1}$

[TABLE]

therefore getting

[TABLE]

Let $r(x)=\{y\in\mathcal{X}:\pi(y)<\pi(x)\}$ and $a(x)=\{y\in\mathcal{X}:\pi(y)\geq\pi(x)\}$ . We now split $B(x,K)\setminus C_{\pi(x)}(\delta)$ into $\left(r(x)\cap B(x,K)\right)\setminus C_{\pi(x)}(\delta)$ and $\left(a(x)\cap B(x,K)\right)\setminus C_{\pi(x)}(\delta)$ and we estimate the value of $\min\left[1,\frac{\tilde{\pi}_{\gamma}(y,i)}{\tilde{\pi}_{\gamma}(x,i)}\right]\frac{V_{\tilde{\pi}_{\gamma}}\left((y,i)\right)}{V_{\tilde{\pi}_{\gamma}}\left((x,i)\right)}$ on each of those sets separately. Fix $\tilde{K}$ such that

[TABLE]

This is possible by assumption d) combined with conditions (4.2) and (4.3). Since $\pi$ is super-exponential, there exists $R_{2}$ so large that for $|x|>\max[R_{0},R_{2}]$ :

If $y\in\left(r(x)\cap B(x,K)\right)\setminus C_{\pi(x)}(\delta)$ , then $\frac{\pi(y)}{\pi(x)}\leq\frac{\epsilon}{\tilde{K}}$ . 2. 2)

If $y\in\left(a(x)\cap B(x,K)\right)\setminus C_{\pi(x)}(\delta)$ , then $\frac{\pi(x)}{\pi(y)}\leq\frac{\epsilon}{\tilde{K}}$ .

In the first case we have (using (8.10)):

[TABLE]

Similarly for $y\in\left(a(x)\cap B(x,K)\right)\setminus C_{\pi(x)}(\delta)$ we get $\frac{\tilde{\pi}_{\gamma}(x,i)}{\tilde{\pi}_{\gamma}(y,i)}\leq\epsilon$ . Hence, on $B(x,K)\setminus C_{\pi(x)}(\delta)$ we have

[TABLE]

Furthermore, by assumption (8.5) we can choose $R_{3}$ such that for $|x|>R_{3}$

[TABLE]

Finally, for $|x|>\max[R_{0},R_{1},R_{2},R_{3}]$ we obtain

[TABLE]

which ends the proof of (8.6). Consequently, by setting $\lambda_{L}$ such that $\max_{i\in\mathcal{I}}\lambda_{i,L}<\lambda_{L}<1$ , we obtain (8.4). Observe that there exists $R_{L}>0$ such that if $|x|>R_{L}$ , then

[TABLE]

For $|x|\leq R_{L}$ we have

[TABLE]

Now analogously to $r_{\gamma,i}(x)$ , let us define the acceptance region for $\tilde{\pi}_{\gamma}$ as

[TABLE]

Note that

[TABLE]

Besides

[TABLE]

as for each $i$ the function $V_{\tilde{\pi}_{\gamma}}\left((x,i)\right)$ is jointly continuous with respect to $x$ and $\gamma$ . By setting

[TABLE]

we obtain

[TABLE]

for all $(x,i)\in\mathcal{X}\times\mathcal{I}$ . ∎

Assumption b) of Lemma 3.5 (jump kernels)

Proof.

Firstly recall that under assumption e1) of Theorem 4.2 we have:

[TABLE]

if $x$ belongs to the jumping region $JR_{\gamma,i}$ , and $\tilde{P}_{\gamma}V_{\tilde{\pi}_{\gamma}}\left((x,i)\right)=\int_{\mathcal{X}}\tilde{P}_{\gamma,L,i}\left((x,i),(dy,i)\right)V_{\tilde{\pi}_{\gamma}}\left((y,i)\right)$ otherwise. Recall as well that all the jumping regions $JR_{\gamma,i}$ for $\gamma\in\mathcal{Y}$ , $i\in\mathcal{I}$ are contained within a compact set $D$ and consequently any point $(y,k)$ proposed in a deterministic jump satisfies $(y,k)\in D\times\{k\}$ . Let us now define

[TABLE]

Observe that

[TABLE]

and so for all $(x,i)$

[TABLE]

Finally, setting $\lambda:=\lambda_{L}$ and $b:=b_{L}+b_{J}$ yields (3.6) under assumption e1).

Let us now consider assumption e2). Recall that for any $s\in(0,1)$ if $V_{\tilde{\pi}_{\gamma}}\left((x,i)\right)=c\tilde{\pi}_{\gamma}(x,i)^{-s}$ , then (8.14) holds for some $\lambda_{L}$ , $b_{L}$ and $R_{L}$ . Furthermore,

[TABLE]

By assumption (4.10) there exists a constant $c_{2}$ such that $\frac{R_{\gamma,J,i}(x)}{\pi(x)^{s_{J}}}<c_{2}$ for each $x\in\mathcal{X}$ , $i\in\mathcal{I}$ and $\gamma\in\mathcal{Y}$ and as a consequence,

[TABLE]

where the last inequality follows from (8.10). Fix $s<s_{J}$ and observe that

[TABLE]

where the last inequality follows from $\pi$ being super-exponential and $s_{J}-s$ positive.

Recall additionally that under e2) (8.15) holds for all $(x,i)\in\mathcal{X}\times\mathcal{I}$ . Putting together (8.14), (8.16), (8.17) and (8.15) yields

[TABLE]

By setting $\lambda:=(1-\epsilon)\lambda_{L}+\epsilon$ and $b:=(1-\epsilon)b_{L}+\epsilon b_{J}$ , we obtain the drift condition as given by (3.6). ∎

Assumption c) of Lemma 3.5

Proof.

Proving the minorisation condition (3.7) amounts to specifying $n_{0},$ $\delta$ , $\nu_{\gamma}$ and $v$ , and verifying that $\tilde{P}_{\gamma}^{n_{0}}(\tilde{x},B)\geq\delta\nu_{\gamma}(B)$ for all measurable sets $B$ and all $\tilde{x}$ satisfying $V_{\tilde{\pi}_{\gamma}}(\tilde{x})\leq v$ . Let $C_{v}$ be defined as $C_{v}:=\{x\in\mathcal{X}:c\pi(x)^{-s}\leq v\}$ , where $c$ is defined in (8.3). We specify the value of $v$ below, separately for assumptions e1) and e2).

Note that if $x\in C_{v}$ , then $V_{\tilde{\pi}_{\gamma}}\left((x,i)\right)\leq v$ for each $i\in\mathcal{I}$ and each $\gamma\in\mathcal{Y}$ . Observe also that $C_{v}$ is a compact set. Let $\nu_{\gamma}$ be the uniform distribution on $C_{v}\times\mathcal{I}$ (and 0 everywhere else) i.e. for $A\subseteq C_{v}$ we have $\nu_{\gamma}(A\times\{i\})=\frac{1}{N}\frac{\mu^{\text{Leb}}(A)}{\mu^{\text{Leb}}(C_{v})}$ . To prove the claim, it is enough to show that

[TABLE]

for $\hat{B}$ of the form $B\times\{k\}$ , for any $B\subseteq C_{v}$ and any $k\in\mathcal{I}$ .

Firstly, note that for any $i,k\in\mathcal{I}$

[TABLE]

where the last inequality is satisfied by assumption c) and equation (8.10).

We will first focus on verifying the minorisation condition under assumption e1). Recall that $D$ is a compact set in $\mathcal{X}$ such that for each $\gamma\in\mathcal{Y}$ and $i\in\mathcal{I}$ we have $JR_{\gamma,i}\subseteq D$ . Recall also that by the construction of the jumping regions there exists $r_{1}$ such that for each $\gamma\in\mathcal{Y}$ and $i\in\mathcal{I}$ the ball $B(\mu_{i},r_{1})\subseteq JR_{\gamma,i}$ . Let us now pick $v$ so large that $D\subseteq C_{v}$ and $v>2n_{0}b/(1-\lambda^{n_{0}})$ for $n_{0}=3$ .

The minorisation condition will be proved for $n_{0}=3$ . Indeed, three steps of the algorithm are enough to get from a point $(x,i)$ to a set $B\times\{k\}$ (a local step within mode $i$ to reach its jumping region, a jump to mode $k$ and a local move within mode $k$ to set $B$ ).

Fix $(x,i)\in C_{v}\times\mathcal{I}$ and a set $\hat{B}=B\times\{k\}$ for $B\subseteq C_{v}$ (we allow for the case $k=i$ ). Note that since $JR_{\gamma,i}\subset C_{v}$ for all $i\in\mathcal{I}$ and $\gamma\in\mathcal{Y}$ we have

[TABLE]

By equations (4.5) and (8.18) we get that $p_{1,i}$ defined above is strictly positive for $i\in\mathcal{I}$ . Considering the probability of accepting a deterministic jump from mode $i$ to mode $k$ , we obtain

[TABLE]

It follows from equations (8.18), (4.3) and (4.2) that $p_{2,ik}>0$ for $i,k\in\mathcal{I}$ . Analogous arguments show that

[TABLE]

Combining (8.19), (8.20) and (8.21) yields

[TABLE]

Setting $\delta:=(1-\epsilon)^{2}\epsilon\epsilon_{a}\min_{i,k\in\mathcal{I}}p_{1,i}p_{2,ik}p_{3,k}N\mu^{\text{Leb}}(C_{v})$ ends the proof.

We will now verify the minorisation condition under assumption e2). Let $v$ be so large that $B(\mu_{i},r)\subseteq C_{v}$ (see assumption (4.9)) for $i\in\mathcal{I}$ and $v>2n_{0}b/(1-\lambda^{n_{0}})$ , for $n_{0}=3$ . We will prove that indeed the minorisation condition holds for $n_{0}=3$ . Note that if we want to move from $(x,i)$ to a set $B\times\{k\}$ , it is enough to make a local step to $B(\mu_{i},r)$ , and then a jump to $B(\mu_{k},r)$ followed by a local step to $B$ .

As before, fix $(x,i)\in C_{v}\times\mathcal{I}$ and a set $\hat{B}=B\times\{k\}$ for $B\subseteq C_{v}$ . Again we include the case $k=i$ . Analogous calculations to (8.19) show that

[TABLE]

For the jump kernel involved we obtain

[TABLE]

Note that $p_{5,ik}$ is positive by equations (8.18) and (4.3), and assumption e2). Finally, similar calculations to (8.21) yield

[TABLE]

for $p_{3,k}$ defined in the previous part of the proof. We now combine (8.22), (8.23) and (8.24) to get

[TABLE]

Setting $\delta:=(1-\epsilon)^{2}\epsilon\epsilon_{a}\min_{i,k\in\mathcal{I}}p_{4,i}p_{5,ik}p_{3,k}N\mu^{\text{Leb}}(C_{v})$ ends the proof. ∎

Proof of Theorem 4.3

Proof.

This theorem is a direct corollary from Theorem 3.4. The assumptions of this theorem were verified in the proofs of Theorems 4.1 or 4.2, under the uniform and the non-uniform scenario, respectively. ∎

Proof of Lemma 4.5

Proof.

Fix $i\in\mathcal{I}$ and let $\epsilon_{L}$ be such that for $|x|$ larger than some $R_{0}$

[TABLE]

(such $\epsilon_{L}$ can be found due to assumption (4.12)). Hence, for $K$ sufficiently large

[TABLE]

which implies that for any $|x|>R_{0}$

[TABLE]

and consequently

[TABLE]

By assumption a) of Theorem 4.2 $\tilde{\epsilon}_{L}$ is indeed positive.

Let the acceptance region $a_{\gamma,i}(x)$ be given by (8.13). We will show that for $|x|$ sufficiently large and for each $\gamma\in\mathcal{Y}$

[TABLE]

which will prove the claim. We shall now repeat similar arguments to those used in the proof of Theorem 4.2, in the part for the local kernels. Firstly, we use formula (8.8) to conclude that for $|x|$ larger than some $R_{1}$ (which may depend on $K$ ) and for sufficiently small $\delta$ (which may depend on $K$ , $R_{1}$ and $\tilde{\epsilon}_{L}$ ), we have

[TABLE]

We put (8.25) together with (8.26) to obtain

[TABLE]

for $|x|>\max\left[R_{0},R_{1}\right]$ . Now recall that for each $\delta$ there exists $R_{2}$ such that for $|x|>R_{2}$ if $y\in\left(a(x)\cap B(x,K)\right)\setminus C_{\pi(x)}(\delta)$ then $\frac{\pi(y)}{\pi(x)}\geq\tilde{K}$ for $\tilde{K}$ defined in (8.10). Therefore in particular $y\in a_{\gamma,i}(x)$ for each $\gamma\in\mathcal{Y}$ . Finally, for $|x|>\max\left[R_{0},R_{1},R_{2}\right]$ we have

[TABLE]

which ends the proof. ∎

Proof of Lemma 4.6

Proof.

Fix any $\gamma\in\mathcal{Y}$ and $i\in\mathcal{I}$ . To prove the required result, we will use analogous arguments to those from the proof of Theorem 4.3 of [29]. Let $\epsilon>0$ and $R$ be such that for $|x|>R$

[TABLE]

Fix $K>0$ and define the cone $W(x)$ as

[TABLE]

We now refer to the proof of Theorem 4.3 of [29] to see that for $x$ sufficiently large $W(x)\subset a(x)$ . What is more,

[TABLE]

Hence, since $i$ was chosen arbitrarily, assumption (4.12) is satisfied for $\gamma^{*}:=\gamma$ .

We would like to point out here that originally Theorem 4.3 of [29] was proved under a stronger assumption, that is, $R_{\gamma,L,i}(x,y)=R_{\gamma,L,i}(|x-y|)$ . However, careful inspection of this proof shows that it is enough to assume that $R_{\gamma,L,i}(x,y)=R_{\gamma,L,i}(y,x)$ , which is satisfied in our case as $R_{\gamma,L,i}$ follows an elliptical distribution. ∎

Proof of Corollary 4.7

Proof.

We will again refer multiple times to [29]. Firstly, by Theorem 4.4 of this paper, if $\pi_{1}$ and $\pi_{2}$ are super-exponential and satisfy (4.13), then also $a_{1}\pi_{1}+a_{2}\pi_{2}$ is super-exponential and satisfies (4.13) for positive $a_{1}$ and $a_{2}$ . By Theorem 4.6 of the same paper, each density of the form $\pi(x)\propto\exp\left(-p(x)\right)$ is super-exponential and satisfies (4.13), if $p$ is a polynomial of order $\geq 2$ . Therefore, the assumptions of Lemma 4.6 hold, as required. ∎

9 Other algorithms in the Auxiliary Variable Adaptive MCMC class

As mentioned in Section 3, an instance of an algorithm in the Auxiliary Variable Adaptive MCMC class is adaptive parallel tempering introduced by [34]. Indeed, let us consider $\Phi:=\mathcal{X}^{N}$ and $\tilde{X}:=\mathcal{X}\times\Phi=\mathcal{X}\times\mathcal{X}^{N}$ and

[TABLE]

Then for any $B\in\mathcal{B}(\mathcal{X})$ we have

[TABLE]

where the last equality follows since $\beta_{N,\gamma}=1$ for all $\gamma$ . Additionally, the transition kernels used in adaptive parallel tempering $\{\tilde{P}_{\gamma}\}_{\gamma\in\mathcal{Y}}$ are defined in such a way that detailed balance holds.

Another example of a group of algorithms in the Auxiliary Variable Adaptive MCMC class is an adaptive version of pseudomarginal algorithms. Recall that pseudomarginal algorithms are a powerful tool used in situations when the target density $\pi(x)$ on $\mathcal{X}$ cannot be evaluated pointwise or this evaluation would be very expensive, but an unbiased estimator of $\pi(x)$ is available. In the simplest setting an importance sampling estimator is used for this purpose. Then the pseudomarginal algorithm is equivalent to the Metropolis-Hastings algorithm targeting a distribution $\tilde{\pi}^{N}(x,z)$ on an augmented state space $\mathcal{X}\times\mathcal{Z}$ , where $Z\in\mathcal{Z}$ is a vector representing $N$ samples on which the importance sampling estimator is based. A remarkable property of the pseudomarginal algorithms is that $\pi(x)$ is the marginal distribution of $\tilde{\pi}^{N}(x,z)$ regardless of $N$ (see [4] and [6]) . The number of samples $N$ and, in more complex settings, the amount of correlation between those samples (see [17]), may follow an adaptive scheme. Therefore, conditions $\eqref{ker_erg}$ and $\eqref{marginal_ok}$ are satisfied for $\Phi=\mathcal{Z}^{\mathbb{N}}\times\mathbb{N}$ , where $N\in\mathbb{N}$ corresponds to the number of samples used for estimation.

Supplementary Material B

In Sections 10 and 11 we present details of the implementation of our method. An additional simulation example and settings of our numerical experiments are shown in Section 12.

10 Updating $w_{\gamma,i}$ and $a_{\gamma,ik}$

Recall that $N$ denotes the number of modes. The weights $w_{\gamma,i}$ are set to $1/N$ at the beginning of the algorithm and they are adapted while the algorithm runs in such a way that they represent the proportion of samples observed so far in each mode. At the same time we do not allow any of the weights to get below some pre-specified value $\epsilon_{w}$ ; otherwise the target distribution $\tilde{\pi}_{\gamma}$ could run the risk of being severely distorted by weights very close or equal to 0. In particular we use the update scheme described below.

Let $n_{i,\text{obs}}$ be the number of samples in mode $i$ for $i=1,\ldots,N$ observed after $n$ iterations of the main algorithm. Then $n=\sum_{i=1}^{N}n_{i,\text{obs}}$ . Define

[TABLE]

It is easily checked that if there are no observations in mode $i$ , i.e. $n_{i,\text{obs}}$ is equal to 0, then $w_{i}=\epsilon_{w}$ . Since $\epsilon_{w}$ must satisfy $N\epsilon_{w}<1$ and the number of modes $N$ is typically unknown in advance, in our implementation the user provides $\tilde{\epsilon}_{w}$ and the algorithm sets $\epsilon_{w}:=\tilde{\epsilon}_{w}/N$ .

Even though the theory we present in Section 4 holds when $a_{\gamma,ik}$ follow some adaptive rule, we propose to keep these values fixed throughout the run of the algorithm, with a default choice $a_{\gamma,ii}=0$ and $a_{\gamma,ik}=1/(N-1)$ for $\gamma\in\mathcal{Y}$ , $i,k\in\{1,\ldots,N\}$ and $i\neq k$ . If $N$ is not very large the benefit of adapting $a_{\gamma,ik}$ is rather marginal while it may add to finite sample instability. A natural alternative improving acceptance rates would be to keep $a_{\gamma,ii}=0$ and $a_{\gamma,ik}=w_{\gamma,k}/\sum_{j\neq i}w_{\gamma,j}$ . However, consider a scenario when a mode with a significant weight in the target distribution is particularly difficult to jump into (for example, because the covariance matrix estimation has not been run for long enough). The jumps to this mode will very likely get rejected many times before we observe the first sample in this mode and start adapting its covariance matrix. In such case proposing modes proportionally to the number of samples collected so far in those modes would effectively make moves to this ”difficult” mode even less frequent. Consequently we could face the risk of underestimating the weight of this mode for a fixed computational budget. Hence, we adopt the more conservative approach of keeping these values fixed to avoid the risk described above.

Note also that we in our implementation we use $a_{\gamma,ii}=0$ even though formally we assumed in Section 4 that $a_{\gamma,ii}>\epsilon_{a}$ . This is because in practice we do not want to propose jumps to the same mode. In case of deterministic jumps this would mean proposing a move to the same state (recall equation (2.5)), which would have a negative impact on the mixing of the algorithm.

11 Burn-in algorithm

For the mode finding part in our implementation we use the BFGS method from the optimx package in R [39]. We only pass to the next stage of the burn-in algorithm (mode-merging) those vectors for which first and second order Kuhn, Karush, Tucker (KKT) optimality conditions are satisfied. Checking these conditions is necessary in order to avoid including points that are not local minima of $-\log(\pi)$ (but, for example, saddle points) in the list of modes. Besides, we recommend that the user codes up their own function for calculating the gradient and the Hessian, whenever possible, or uses packages that compute those values with high numerical precision. This will help ensure numerical stability of the optimisation runs. What is more, working with variables with bounded support tends to be problematic – the optimisation algorithm will typically struggle in the neighbourhood of the boundary. In such cases it is usually beneficial to work with transformed variables, defined on the whole space (see Section 12.1).

Recall that the initial value of the matrix corresponding to mode $i$ at the beginning of round 1 of the covariance matrix estimation is the inverse of the Hessian evaluated at $\mu_{i}$ (see line 17 of Algorithm 3). The heuristics behind this idea is that in case of the Gaussian distribution the inverse of the Hessian of $-\log(\pi)$ would correspond to the covariance matrix, so intuitively for a large class of target distributions this will be a good starting value.

As mentioned in Section 2.2, we propose a semi-automatic way of choosing the number of rounds of the covariance matrix estimation, denoted by $K$ . Recall that $\Sigma_{k,i}$ is the matrix corresponding to mode $i$ updated during round $k$ . The choice of $K$ is based on monitoring the following quantity, called inhomogeneity factor (see [43] and [47]), given by

[TABLE]

where $d$ is the dimension of the state space of $\pi$ and $\lambda_{j}$ for $j=1,\ldots,d$ are the eigenvalues of $\Sigma_{k-1,i}^{-1}\Sigma_{k,i}$ . Note that this factor is always a real number even though $\Sigma_{k-1,i}^{-1}\Sigma_{k,i}$ does not need to be symmetric. If $\lambda$ is a complex eigenvalue of $\Sigma_{k-1,i}^{-1}\Sigma_{k,i}$ , its conjugate $\bar{\lambda}$ is also an eigenvalue of $\Sigma_{k-1,i}^{-1}\Sigma_{k,i}$ and so the imaginary components cancel both in the numerator and the denominator of $\eqref{eq:inhomogeneity_factor}$ . Moreover, by Jensen’s inequality $b_{k,i}$ satisfies $b_{k,i}\geq 1,$ and $b_{k,i}=1$ if and only if $\Sigma_{k-1,i}$ and $\Sigma_{k,i}$ are proportional to each other. In particular, the value of $b_{k,i}$ is always equal to 1 in the scaling phase.

The procedure we propose is the following: perform $AC_{1}$ scaling steps for each matrix (perhaps split into several rounds). Then perform at least one covariance-based round for each mode. Continue running covariance-matrix rounds until the inhomogeneity factor drops below a certain threshold $b_{\text{acc}}$ for all matrices. In other words, having performed $AC_{1}$ scaling steps and at least one covariance-based round, we set $K$ to the smallest value of $k$ satisfying $\max_{i\in\{1,\ldots,N\}}b_{k,i}\leq b_{\text{acc}}.$ In the version in which the modes can be added when the main algorithm runs, one could consider stopping the burn-in algorithm separately for each mode and passing the covariance matrix to the main algorithm once its corresponding inhomogeneity factor goes below $b_{\text{acc}}$ .

As for the choice of the lengths of the rounds, by default we use a geometric sequence with a common ratio 2. The number $AC_{1}$ should grow with the dimension of the state space $d$ since the initial covariance matrix will be based on $AC_{1}$ samples for each mode. In our experiments $AC_{1}$ is equal to $\max(1000,d^{2}/2)$ .

Note that this construction implies that adapting the matrices by scaling will happen only during the burn-in algorithm, as the number of samples in each mode at the beginning of the main algorithm will be equal to the total length of the number of iterations in the burn-in algorithm, so in particular this number will exceed $AC_{1}$ .

The adaptation scheme of the main algorithm is based on updating the covariance matrices passed from the burn-in algorithm.

12 Examples – further details

Below we present one more example, a hierarchical Bayesian model for cancer data. We also discuss some further details related to the simulations described in Section 5. The exact parameters settings of our experiments are summarised in Table 3. For all examples shown in this paper we used an implementation of the algorithm in which the burn-in algorithm runs only before the main algorithm (without adding modes on the fly).

12.1 Hierarchical Bayesian model for LOH data

The example presented here is based on the Seattle Barrett’s Oesophagus study (see [11]) analysed later by [55], [15] and [9]. Loss of Heterozygosity is the process by which a region of the genome is deleted on either the paternal or maternal inherited chromosomes leading to a loss of diversity. Loss of Heterozygosity (LOH) rates were collected from oesophageal cancers for 40 regions, each on a distinct chromosome arm. They are of interest since chromosome regions with high rates of LOH are thought to contain so-called Tumour Suppressor Genes (TSGs) whose functionality is adversely affected by the reduction in genetic diversity. There exists also a proportion of ”background” (not cancer-related) LOH. The aim of this study is to provide, for each LOH rate, the probability of being in the TSG group and in the ”background” group. Following the approach adopted in the above papers, we consider the following mixture model:

[TABLE]

where $x_{i}$ is the number of events of interest (Loss of Heterozygosity) observed in region $i$ , and $N_{i}$ – the corresponding sample size. Besides, $\eta$ denotes the probability of a location being a member of the binomial group, $\pi_{1}$ is the probability of LOH in the binomial group, $\pi_{2}$ is the probability of LOH in the beta-binomial group, and $\gamma$ controls the variability of the beta-binomial group. That is, the likelihood function for this model is given by $\prod_{i=1}^{40}f(x_{i},N_{i}|\eta,\pi_{1},\pi_{2},\gamma)$ for

[TABLE]

where B denotes the beta function and $\omega_{2}:=\frac{\exp(\gamma)}{2\left(1+\exp(\gamma)\right)}$ . The following prior distributions were used for the parameters of interest:

[TABLE]

The resulting target distribution has two non-symmetric and well-separated modes, as depicted in Figure 8, one of which has a significantly bigger weight than the other; below we denote them by mode 1 and 2, respectively. We based our analysis on 200,000 steps of the main algorithm and 500 BFGS runs for the mode-finding stage. The length of the covariance matrix estimation in burn-in algorithm was equal to 3000 iterations for each experiment (chosen automatically). Table 4 summarizes the acceptance rates of jumps between the modes for the three versions of the implementation of the algorithm.

As stated above, the prior distribution for all the variables has its support on a compact set. Since this typically has an adverse effect on both mode-finding and sampling, we decided to work with transformed variables, which live on the real line. For the first three variables we applied the logit transformation, i.e. we transformed them using a function $t_{1}(x)=\log(x)-\log(1-x)$ . For the last variable we used the transformation given by $t_{2}(x)=\log(30+x)-\log(30-x)$ .

The starting points for the optimisation runs were sampled from the prior distribution and transformed the way described above. The number of function and gradient evaluations for the $20\times 500$ BFGS runs varied between 27 and 428, with an average of 73.

12.2 Mixture of Gaussians

The starting points for the optimisations runs were sampled uniformly on $[-2,2]^{d}$ . In Table 5 we gathered information about the number of the target density and its gradient evaluations (jointly) in the BFGS runs. We reported the minimum, the mean and the maximum value required for the optimisation algorithm to converge. The last two columns show the minimum and the maximum number of iterations used for the estimation of the covariance matrices in the burn-in algorithm. These figures show that indeed the computational budget used by our method for dimensions $d=10$ and $d=20$ was significantly smaller than the budget of APT (see Section 5.1).

Figures 9 and 10 illustrate good performance of our method in dimensions $d=130$ and $d=160$ , especially for the deterministic jumps. Interestingly, the Gaussian proposal for jumps gives results of the poorest quality on this example.

12.3 Mixture of $t$ and banana-shaped distributions

The starting points for the optimisation runs were sampled uniformly on $[-2,12]^{d}$ . Table 6 presents analogous information to Table 5, for the mixture of $t$ and banana-shaped distributions considered in Section 5.2. For dimensions $d=50$ and $d=80$ we did not run the mode-finding part, assuming the locations of the modes were known. Overall, our method in all its versions proved to perform well on this high-dimensional example, despite the complicated shapes of the modes. Table 7 shows that, as before, the deterministic jump method ensures best between-mode mixing. However, it can be noticed that given 20 runs of the experiment, a few times this method delivered results that deviated significantly from the truth (see Figure 11).

12.4 Sensor network localisation

Starting points for the BFGS procedures were sampled uniformly on $[0,1]^{16}$ . The number of function and gradient evaluations for these runs varied between 175 and 876, with an average of 400. The starting points for the APT simulations were the 14 modes identified by the BFGS optimiser and 6 points sampled uniformly on $[0,1]^{16}$ .

Recall that the results for APT presented in Section 5.3 were based on 4 temperatures. In Figure 12 we present analogous results to those shown in Figure 7, with the number of temperatures increased to 5 (and the same number of iterations equal to 700,000). Under such settings the APT algorithm mixes better between the modes, however, as illustrated in Figure 12, it still yields less stability than our method. Note that APT required $4\times 700,000=2,800,000$ or $5\times 700,000=3,500,000$ target evaluations, for $4$ and $5$ temperature levels, respectively. Assuming an implementation of JAMS on a standard desktop computer with 8 cores, the computational cost measured by the number of target and gradient evaluations per core would be at most:

•

for mode finding: $10,000/8\times 875$ (as for each BFGS run we had at most 875 such evaluations);

•

for the burn-in-algorithm: $2\times 15,000$ target evaluations (as the estimation of 14 covariance matrices needed to be split across 8 cores);

•

for the main algorithm: $500,000$ target evaluations.

Altogether this would give $1,623,750$ evaluations, and the additional overhead resulting from the communication between cores. This figure shows that APT required a larger computational cost than JAMS in our setup even though the above analysis was carried out under a pessimistic scenario. Firstly, on average there were 400 evaluations per BFGS procedure and plugging this value into the above calculations would decrease the overall number of evaluations to 1,000,030. What is more, typically a user would run our algorithm on a server or a cloud service, which we in fact did as well. This would allow to split the computational cost (in particular, that of BFGS runs) across a much larger number of cores.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ahn et al., [2013] Ahn, S., Chen, Y., and Welling, M. (2013). Distributed and adaptive darting Monte Carlo through regenerations. In Artificial Intelligence and Statistics , pages 108–116.
2Andricioaei et al., [2001] Andricioaei, I., Straub, J. E., and Voter, A. F. (2001). Smart darting Monte Carlo. The Journal of Chemical Physics , 114(16):6994–7000.
3Andrieu and Moulines, [2006] Andrieu, C. and Moulines, E. (2006). On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab. , 16(3):1462–1505.
4Andrieu and Roberts, [2009] Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics , 37(2):697–725.
5Andrieu and Thoms, [2008] Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC. Stat. Comput. , 18(4):343–373.
6Andrieu and Vihola, [2015] Andrieu, C. and Vihola, M. (2015). Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms. The Annals of Applied Probability , 25(2):1030–1077.
7Atchadé and Fort, [2010] Atchadé, Y. and Fort, G. (2010). Limit theorems for some adaptive MCMC algorithms with subgeometric kernels. Bernoulli , 16(1):116–154.
8Atchadé et al., [2011] Atchadé, Y. F., Roberts, G. O., and Rosenthal, J. S. (2011). Towards optimal scaling of Metropolis-coupled Markov chain Monte Carlo. Statistics and Computing , 21(4):555–568.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

A Framework for Adaptive MCMC Targeting Multimodal Distributions

Abstract

keywords:

keywords:

1 Introduction

1.1 Other approaches

1.2 Contribution

2 Jumping Adaptive Multimodal Sampler (JAMS)

2.1 Main algorithm

2.2 Burn-in algorithm

2.2.1 Starting points for the optimisation procedure

2.2.2 Mode finding via an optimisation procedure

2.2.3 Mode merging

2.2.4 Initial covariance matrix estimation

2.3 Further comments

3 Auxiliary Variable Adaptive MCMC

Definition 3.1**.**

3.1 Theoretical results for the class

Theorem 3.2** (Ergodicity – uniform case).**

Theorem 3.3** (Ergodicity – non-uniform case).**

Theorem 3.4** (WLLN).**

Lemma 3.5**.**

3.2 Adaptive Increasingly Rarely version of the class

4 Ergodicity of the Jumping Adaptive Multimodal Sampler

4.1 Overview of the assumptions

4.2 Theoretical results for JAMS

Theorem 4.1**.**

Theorem 4.2**.**

Theorem 4.3**.**

Remark 4.4*.*

Lemma 4.5**.**

Lemma 4.6**.**

Corollary 4.7**.**

5 Examples

5.1 Mixture of Gaussians

5.2 Mixture of ttt and banana-shaped distributions

5.3 Sensor network localisation

6 Summary and discussion

Acknowledgements

7 Proofs for Section 3

Lemma 7.1**.**

Proof.

Lemma 7.2**.**

Proof.

Lemma 7.3**.**

Proof.

Lemma 7.4**.**

Proof.

Lemma 7.5**.**

Proof.

Proof of Theorem 3.2

Proof.

Proof of Theorem 3.3

Proof.

Proof of Theorem 3.4

Proof.

Proof of Lemma 3.5

Proof.

8 Proofs for Section 4

Proof of Theorem 4.1

Proof.

Proof of Theorem 4.2

Assumption b) of Lemma 3.5 (local kernels)

Proof.

Assumption b) of Lemma 3.5 (jump kernels)

Proof.

Assumption c) of Lemma 3.5

Proof.

Proof of Theorem 4.3

Proof.

Proof of Lemma 4.5

Proof.

Proof of Lemma 4.6

Definition 3.1.

Theorem 3.2 (Ergodicity – uniform case).

Theorem 3.3 (Ergodicity – non-uniform case).

Theorem 3.4 (WLLN).

Lemma 3.5.

Theorem 4.1.

Theorem 4.2.

Theorem 4.3.

*Remark 4.4**.*

Lemma 4.5.

Lemma 4.6.

Corollary 4.7.

5.2 Mixture of $t$ and banana-shaped distributions

Lemma 7.1.

Lemma 7.2.

Lemma 7.3.

Lemma 7.4.

Lemma 7.5.

10 Updating $w_{\gamma,i}$ and $a_{\gamma,ik}$

12.3 Mixture of $t$ and banana-shaped distributions