Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo for   Approximate Multimodal Posterior Sampling

Rui Luo; Qiang Zhang; and Yuanyuan Liu

arXiv:1812.01181·stat.ML·December 10, 2018

Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo for Approximate Multimodal Posterior Sampling

Rui Luo, Qiang Zhang, and Yuanyuan Liu

PDF

Open Access

TL;DR

This paper introduces a novel sampler combining parallel tempering with Nosé-Hoover dynamics to efficiently sample from complex, multimodal posterior distributions in large-scale Bayesian learning tasks.

Contribution

The paper presents a new stochastic gradient Hamiltonian Monte Carlo method that integrates parallel tempering and Nosé-Hoover dynamics for improved multimodal posterior sampling.

Findings

01

Effectively samples from complex multimodal distributions.

02

Handles noisy stochastic gradients in large datasets.

03

Facilitates deep Bayesian learning with complex posteriors.

Abstract

We propose a new sampler that integrates the protocol of parallel tempering with the Nos\'e-Hoover (NH) dynamics. The proposed method can efficiently draw representative samples from complex posterior distributions with multiple isolated modes in the presence of noise arising from stochastic gradient. It potentially facilitates deep Bayesian learning on large datasets where complex multimodal posteriors and mini-batch gradient are encountered.

Equations22

π (θ ∣ D) = π (θ) x \in D \prod ℓ (θ; x) .

π (θ ∣ D) = π (θ) x \in D \prod ℓ (θ; x) .

U (θ) = - lo g π (θ ∣ D) = - lo g π (θ) - x \in D \sum lo g ℓ (θ; x) - const .

U (θ) = - lo g π (θ ∣ D) = - lo g π (θ) - x \in D \sum lo g ℓ (θ; x) - const .

\nabla \tilde{U} (θ)

\nabla \tilde{U} (θ)

\frac{d θ _{j}}{d t} = M^{- 1} p_{j}, \frac{d p _{j}}{d t} = - \nabla \tilde{U} (θ_{j}) / T_{j} - ξ p_{j}, \frac{d ξ _{j}}{d t} = [p_{j}^{⊤} M^{- 1} p_{j} - D] / Q,

\frac{d θ _{j}}{d t} = M^{- 1} p_{j}, \frac{d p _{j}}{d t} = - \nabla \tilde{U} (θ_{j}) / T_{j} - ξ p_{j}, \frac{d ξ _{j}}{d t} = [p_{j}^{⊤} M^{- 1} p_{j} - D] / Q,

π_{j} (θ_{j}) \propto e^{- U (θ_{j}) / T_{j}} .

π_{j} (θ_{j}) \propto e^{- U (θ_{j}) / T_{j}} .

π_{j} (θ_{j}) π_{k} (θ_{k}) α [(j, k) \to (k, j)] = π_{j} (θ_{k}) π_{k} (θ_{j}) α [(k, j) \to (j, k)],

π_{j} (θ_{j}) π_{k} (θ_{k}) α [(j, k) \to (k, j)] = π_{j} (θ_{k}) π_{k} (θ_{j}) α [(k, j) \to (j, k)],

α [(i, j) \to (j, i)] = \frac{π _{j} ( θ _{k} ) π _{k} ( θ _{j} )}{π _{j} ( θ _{j} ) π _{k} ( θ _{k} ) + π _{j} ( θ _{k} ) π _{k} ( θ _{j} )} = \frac{1}{1 + e ^{- δ E}},

α [(i, j) \to (j, i)] = \frac{π _{j} ( θ _{k} ) π _{k} ( θ _{j} )}{π _{j} ( θ _{j} ) π _{k} ( θ _{k} ) + π _{j} ( θ _{k} ) π _{k} ( θ _{j} )} = \frac{1}{1 + e ^{- δ E}},

p_{C} = \frac{1}{2 π} \int_{- \infty}^{\infty} \frac{ϕ _{L} ( t )}{ϕ _{N_{σ^{2}}} ( t )} e^{- i x t} d t, \mbox s in ce ϕ_{C} = ϕ_{L} / ϕ_{N_{σ^{2}}},

p_{C} = \frac{1}{2 π} \int_{- \infty}^{\infty} \frac{ϕ _{L} ( t )}{ϕ _{N_{σ^{2}}} ( t )} e^{- i x t} d t, \mbox s in ce ϕ_{C} = ϕ_{L} / ϕ_{N_{σ^{2}}},

\displaystyle\hat{p}_{\mathscr{C}}=\frac{1}{2\pi}\int_{-\infty}^{\infty}\psi\cdot\frac{\phi_{\mathscr{L}}}{\phi_{\mathscr{N}_{\sigma^{2}}}}e^{-itx}\mathop{}\!\mathrm{d}{t}=\frac{1}{2\pi}\int_{-\infty}^{\infty}\bigg{[}\frac{\psi}{\phi_{\mathscr{N}_{\sigma^{2}}}}\bigg{]}\phi_{\mathscr{L}}e^{-ixt}\mathop{}\!\mathrm{d}{t}.

\displaystyle\hat{p}_{\mathscr{C}}=\frac{1}{2\pi}\int_{-\infty}^{\infty}\psi\cdot\frac{\phi_{\mathscr{L}}}{\phi_{\mathscr{N}_{\sigma^{2}}}}e^{-itx}\mathop{}\!\mathrm{d}{t}=\frac{1}{2\pi}\int_{-\infty}^{\infty}\bigg{[}\frac{\psi}{\phi_{\mathscr{N}_{\sigma^{2}}}}\bigg{]}\phi_{\mathscr{L}}e^{-ixt}\mathop{}\!\mathrm{d}{t}.

\frac{ψ}{ϕ _{N_{σ^{2}}}} = e^{- γ^{2} t^{4} + σ^{2} t^{2} /2} = k = 0 \sum \infty \frac{γ ^{k}}{k !} H_{k} (σ^{2} /4 γ) t^{2 k} .

\frac{ψ}{ϕ _{N_{σ^{2}}}} = e^{- γ^{2} t^{4} + σ^{2} t^{2} /2} = k = 0 \sum \infty \frac{γ ^{k}}{k !} H_{k} (σ^{2} /4 γ) t^{2 k} .

\overset{p}{^}_{C} = k = 0 \sum \infty \frac{( - 1 ) ^{k}}{k !} H_{k} (σ^{2} /4 γ) γ^{k} [\frac{1}{2 π} \int_{- \infty}^{\infty} (- i t)^{2 k} ϕ_{L} e^{- i t x} d t] = k = 0 \sum \infty \frac{( - 1 ) ^{k}}{k !} H_{k} (σ^{2} /4 γ) γ^{k} p_{L}^{(2 k)},

\overset{p}{^}_{C} = k = 0 \sum \infty \frac{( - 1 ) ^{k}}{k !} H_{k} (σ^{2} /4 γ) γ^{k} [\frac{1}{2 π} \int_{- \infty}^{\infty} (- i t)^{2 k} ϕ_{L} e^{- i t x} d t] = k = 0 \sum \infty \frac{( - 1 ) ^{k}}{k !} H_{k} (σ^{2} /4 γ) γ^{k} p_{L}^{(2 k)},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Markov Chains and Monte Carlo Methods · Bayesian Methods and Mixture Models

Full text

\theorembodyfont\theoremheaderfont\theorempostheader

: \theoremsep

\jmlrproceedingsAABI 20181st Symposium on Advances in Approximate Bayesian Inference, 2018

Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo for Approximate Multimodal Posterior Sampling

\NameRui Luo\nametag \[email protected]

\NameQiang Zhang\nametag11footnotemark: 1 \[email protected]

\NameYuanyuan Liu \[email protected]

\addrAmerican International Group Equal

Inc

Abstract

We propose a new sampler that integrates the protocol of parallel tempering with the Nosé-Hoover (NH) dynamics. The proposed method can efficiently draw representative samples from complex posterior distributions with multiple isolated modes in the presence of noise arising from stochastic gradient. It potentially facilitates deep Bayesian learning on large datasets where complex multimodal posteriors and mini-batch gradient are encountered.

1 Introduction

In Bayesian inference, one of the fundamental problems is to efficiently draw i.i.d. samples from the posterior distribution $\pi(\theta|\mathscr{D})$ given the dataset $\mathscr{D}=\{x\}$ , where $\theta\in\mathbb{R}^{D}$ denotes the variable of interest. Provided the prior distribution $\pi(\theta)$ and the likelihood per datum $\ell(\theta;x)$ , the posterior to be sampled can be formulated as

[TABLE]

To facilitate posterior sampling, the framework of Markov chain Monte Carlo (MCMC) has been established, which has initiated a broad family of methods that generate Markov chains to propose new sample candidates and then apply tests of acceptance in order to guarantee the condition of detailed balance. Methods like the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970), the Gibbs sampler (Geman and Geman, 1984), and the hybrid/Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2011) are famous representatives for the MCMC family where different generating procedures of Markov chains are adopted; each of those methods has achieved great success on various tasks in statistics and related fields.

Among MCMC methods, HMC, in particular, has attracted attention due to its exploitation of gradient information. In a typical HMC setting (Neal, 2011), the target posterior distribution $\pi(\theta|\mathscr{D})$ is embedded into a virtual physical system fixed at the standard temperature $T=1$ with the potential energy defined in the form of

[TABLE]

The variable of interest $\theta$ is interpreted as the position of the system in the phase space; an auxiliary variable $p\in\mathbb{R}^{D}$ is then introduced as the conjugate momentum corresponding to the kinetic energy $p^{\top}M^{-1}p/2$ . By defining the total energy, i.e. the Hamiltonian, as the sum of the potential and kinetic energy, the Hamiltonian dynamics that governs the physical system can therefore be derived from the Hamilton’s formalism. From the perspective of sampling, new sample candidates are proposed via simulating the Hamiltonian dynamics, where the gradient of potential $\nabla U(\theta)$ is utilized.

Despite possessing numerous advantages against its alternatives within the MCMC family, HMC still suffers, however, from two major issues: 1. gradient noise arising from mini-batches may lead to a severe deviation of the dynamics from the desired orbit; 2. isolated modes may not be correctly sampled or even left undiscovered. Unfortunately, as one deals with deep neural networks training on large datasets, those two problems arise simultaneously: deep neural networks leads to complex posterior distributions for the parameters, which may contain numbers of isolated modes; efficient training on large datasets requires mini-batching, the gradient hence would be quite noisy as is evaluated on a small fraction of dataset.

It has long been known that the tempering mechanism is capable of helping the system to get across high energy barriers and hence improve the ergodicity (Marinari and Parisi, 1992; Earl and Deem, 2005). Recently, the research of incorporating tempering into MCMC methods has provided a practical approach towards efficient multimodal posterior sampling (Graham and Storkey, 2017; Luo et al., 2018). In the meantime, the advances in thermostatting techniques for molecular dynamics (Jones and Leimkuhler, 2011) have shed some light on adaptive control for noisy dynamics. In this paper, we propose a novel method that addresses the two issues previously mentioned for HMC; it combines the protocol of parallel tempering (Swendsen and Wang, 1986; Sugita and Okamoto, 1999) with the dynamics of Nosé-Hoover (NH) thermostat (Nosé, 1984; Hoover, 1985). The simulation shows the advantages w.r.t. the accuracy as well as efficiency of our method against the classic HMC (Neal, 2011) and one of its stochastic variants, Stochastic Gradient Nosé-Hoover Thermostat (SGNHT) (Ding et al., 2014).

2 Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo

The proposed method consists of two alternating subroutines: 1. the parallel dynamics simulation of system replicas, and 2. the configuration exchange between replicas. The first subroutine utilizes the Nosé-Hoover thermostat to adaptively detect and neutralize the noise within mini-batch gradient; the second incorporates a mini-batch acceptance test to ensure the detailed balance during exchanges.

2.1 Parallel Dynamics Simulation of System Replicas

We define an increasing ladder $\{T_{j}\}_{j=1}^{R}$ of temperature with $R$ rungs; the temperature ranges from the standard $T_{1}=1$ to some higher temperature. On each rung $j$ , a replica $(\theta_{j},p_{j})$ of the physical system is initialized and the actual potential energy for that replica is rescaled to $U(\theta_{j})/T_{j}$ .

As the datum $x$ within each mini-batch $\mathscr{S}$ is independently selected at random, the mini-batch gradient can be approximated by a Gaussian variable due to the Central Limit Theorem (CLT):

[TABLE]

To retain the correct trajectory in simulating the system dynamics, we leverage the NH thermostat because of its capability of adaptive control of the gradient noise (Jones and Leimkuhler, 2011; Ding et al., 2014). According to the formulation of Hoover (1985), for each replica $(\theta_{j},p_{j})$ , we augment the system with NH thermostat $\xi_{j}\in\mathbb{R}$ and then modify the dynamics as:

[TABLE]

where $M$ denotes the mass, and $Q$ the thermal inertia. It can be proved that the dynamics in Eq. (4) leads to a stationary distribution w.r.t. $\theta_{j}$ by the Fokker-Planck equation (Risken and Haken, 1989)

[TABLE]

This guarantees that, during the simulation, one can readily recover the desired distribution at a certain temperature $T_{j}$ by simply retaining the position $\theta_{j}$ and discarding the momentum $p_{j}$ as well as the thermostat $\xi_{j}$ . Note that for the replica on rung $1$ , the temperature is fixed at standard $T_{1}=1$ and the position $\theta_{1}=\theta$ is distributed as the target posterior $\pi_{1}(\theta_{1})=e^{-U(\theta_{1})/T_{1}}=e^{-U(\theta)}=\pi(\theta|\mathscr{D})$ .

2.2 Configuration Exchange between Replicas

The principles of statistical physics suggest that high temperature facilitates the physical systems to get across energy barriers, which means replicas at higher temperatures are more likely to traverse among different modes of the distributions. As a consequence, however, the distribution sampled at high temperature has a spread spectrum and is hence biased. To recover an unbiased distribution, we perform configuration exchange between replicas at higher temperatures and the one at the standard.

Consider the configuration exchange between the replicas on rung $i$ and $j$ ; as is a non-physical process, the exchange has to satisfy the condition of detailed balance:

[TABLE]

where the transition probability reads

[TABLE]

and $\delta E=\big{[}U(\theta_{k})-U(\theta_{j})\big{]}\big{[}(T_{k}-T_{j})/T_{j}T_{k}\big{]}$ . It is straightforward to verify that Eq. (6) holds. Note that the transition probability $\alpha[(j,k)\to(k,j)]$ resembles the logistic distribution; such logistic test of acceptance is developed by Barker (1965).

With mini-batching, the potential energy $\tilde{U}(\theta_{j})$ becomes a r.v., and so is the difference $\tilde{U}(\theta_{k})-\tilde{U}(\theta_{j})$ . By CLT, $\delta E$ is asymptotically Gaussian with some certain variance $\sigma^{2}$ . Seita et al. (2017) proposed a mini-batch version of Baker’s logistic test of acceptance such that $\delta E+\mathscr{C}>0$ must hold for the exchange to carry out, where $\mathscr{L}$ denotes an auxiliary correction r.v. that aims to bridge the gap between the logistic distribution and Gaussian. The probability density $p_{\mathscr{C}}$ of this correction variable $\mathscr{C}$ satisfies the convolution equation $p_{\mathscr{C}}*p_{\mathscr{N}_{\sigma^{2}}}=p_{\mathscr{L}}$ ; it is equivalent to solve the Gaussian deconvolution problem w.r.t. the standard logistic distribution.

With the convolution theorem for distributions, it is helpful to convert the Gaussian deconvolution into solving for the inverse Fourier transform w.r.t. quotient of characteristic functions

[TABLE]

where $\phi_{\mathscr{N}_{\sigma^{2}}}$ and $\phi_{\mathscr{L}}$ denote the characteristic functions of $\mathscr{N}(0,\sigma^{2})$ and the standard logistic r.v., respectively. As the logistic distribution has much heavier tails than the Gaussian, the exact solution of $p_{\mathscr{C}}$ does not exist: the “integrand” on the RHS of Eq. (8) is in fact not integrable. We can only approximate $p_{\mathscr{C}}$ by introducing the kernel $\psi=e^{-\gamma^{2}t^{4}}$ of bandwidth $1/\gamma$ (see Fan, 1991) in Eq. (8):

[TABLE]

Using the Hermite polynomials $H_{k}$ (Abramowitz and Stegun, 1965), we now expand the quotient within the brackets of Eq. (9) as

[TABLE]

The correction distribution can be approximated via Fourier’s differential theorem:

[TABLE]

where $p_{\mathscr{L}}^{(j)}$ represents the $(j+1)$ -th derivative of logistic function, which can be efficiently calculated in a recursive fashion (Minai and Williams, 1993).

3 Experiment

We conduct two sets of experiments on synthetic distributions: the first is a mixture of $4$ Gaussians in $1d$ , and the second is a $2d$ Gaussian mixture with $5$ isolated modes. The potential energy as well as its gradient is perturbed by zero-mean Gaussian noise with variance $\sigma^{2}=0.25$ , and all samplers in test have no access to the actual parameters of that noise. We establish a ladder of temperature with $R=10$ rungs ranging from $T_{1}=1$ to $T_{R}=10$ , i.e. totally $10$ replicas are simulated in parallel. The baselines are the classic HMC (Neal, 2011) the adaptive variant SGNHT (Ding et al., 2014). It is demonstrated in Fig. 1 and 2 that, in both synthetic testing cases, our method has accurately sampled the target distributions with multiple isolated modes in the presence of noise within mini-batch gradient, where all baselines failed: SGNHT managed to control the gradient noise but did not discover the isolated modes while the classic HMC appears to be unable to correctly draw samples due to the deviated dynamics. Moreover, the subplot on the left of Fig. 1 illustrates the sampling trajectory of our method, indicating a good mixing property.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abramowitz and Stegun (1965) Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables , volume 55. Courier Corporation, 1965.
2Barker (1965) Av A Barker. Monte carlo calculations of the radial distribution functions for a proton-electron plasma. Australian Journal of Physics , 18(2):119–134, 1965.
3Ding et al. (2014) Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in neural information processing systems , pages 3203–3211, 2014.
4Duane et al. (1987) Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B , 195(2):216–222, 1987.
5Earl and Deem (2005) David J Earl and Michael W Deem. Parallel tempering: Theory, applications, and new perspectives. Physical Chemistry Chemical Physics , 7(23):3910–3916, 2005.
6Fan (1991) Jianqing Fan. On the optimal rates of convergence for nonparametric deconvolution problems. The Annals of Statistics , pages 1257–1272, 1991.
7Geman and Geman (1984) Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence , (6):721–741, 1984.
8Graham and Storkey (2017) Matthew M. Graham and Amos J. Storkey. Continuously tempered hamiltonian monte carlo. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017 , 2017.