Ensemble Quasi-Newton HMC

Xiao-Yong Jin; James C. Osborn

arXiv:1904.10039·hep-lat·April 24, 2019

Ensemble Quasi-Newton HMC

Xiao-Yong Jin, James C. Osborn

PDF

Open Access

TL;DR

This paper introduces an enhanced Hybrid Monte Carlo algorithm that incorporates an approximate inverse Hessian to improve sampling efficiency in lattice gauge theories, demonstrated on 2D U(1) models.

Contribution

It proposes a novel method to exchange information within Markov chain ensembles and integrates a quasi-Newton inspired Hessian into HMC to mitigate critical slowing down.

Findings

01

Improved sampling efficiency in 2D U(1) gauge theory

02

Effective exchange of information within Markov chain ensembles

03

Potential for application to more complex gauge theories

Abstract

We present a modification of the Hybrid Monte Carlo algorithm for tackling the critical slowing down of generating Markov chains of lattice gauge configurations towards the continuum limit. We propose a new method to exchange information within an ensemble of Markov chains, and use it to construct an approximate inverse Hessian matrix of the action inspired from quasi-Newton algorithms for optimization. The kinetic term of the molecular dynamics evolution includes the approximate Hessian for long distance couplings among the momenta. We show the result of applying the new algorithm to the $U (1)$ gauge theory in two dimensions, and discuss our future plans.

Equations22

H (x, p) = S (x) + \frac{1}{2} p^{†} G^{- 1} p, \overset{x}{˙} = G^{- 1} p, \overset{p}{˙} = - \nabla S,

H (x, p) = S (x) + \frac{1}{2} p^{†} G^{- 1} p, \overset{x}{˙} = G^{- 1} p, \overset{p}{˙} = - \nabla S,

G^{-1}=\mathcal{G}^{-1}\big{(}\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}\big{)}.

G^{-1}=\mathcal{G}^{-1}\big{(}\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}\big{)}.

G^{-1}=\mathcal{F}\big{(}\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}\big{)}=\mathcal{G}^{-1}\left(\operatorname{sorted}\big{(}\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}\big{)}\right),

G^{-1}=\mathcal{F}\big{(}\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}\big{)}=\mathcal{G}^{-1}\left(\operatorname{sorted}\big{(}\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}\big{)}\right),

s_{k} = lo g U_{k} U_{k - 1}^{- 1} = lo g U_{k} U_{k - 1}^{†}, y_{k} = \nabla S (U_{k}) - \nabla S (U_{k - 1}), k = 1 \dots L \leq N - 2,

s_{k} = lo g U_{k} U_{k - 1}^{- 1} = lo g U_{k} U_{k - 1}^{†}, y_{k} = \nabla S (U_{k}) - \nabla S (U_{k - 1}), k = 1 \dots L \leq N - 2,

G_{k}^{- 1} = (I - ρ_{k} s_{k} y_{k}^{†}) G_{k - 1}^{- 1} (I - ρ_{k} y_{k} s_{k}^{†}) + ρ_{k} s_{k} s_{k}^{†}, G_{0}^{- 1} = 1/ (2 β),

G_{k}^{- 1} = (I - ρ_{k} s_{k} y_{k}^{†}) G_{k - 1}^{- 1} (I - ρ_{k} y_{k} s_{k}^{†}) + ρ_{k} s_{k} s_{k}^{†}, G_{0}^{- 1} = 1/ (2 β),

G_{k} = G_{k - 1} + \frac{y _{k} y _{k}^{†}}{y _{k}^{†} s _{k}} - \frac{G _{k - 1} s _{k} s _{k}^{†} G _{k - 1}}{s _{k}^{†} G _{k - 1} s _{k}} .

G_{k} = G_{k - 1} + \frac{y _{k} y _{k}^{†}}{y _{k}^{†} s _{k}} - \frac{G _{k - 1} s _{k} s _{k}^{†} G _{k - 1}}{s _{k}^{†} G _{k - 1} s _{k}} .

G_{k}

G_{k}

G_{k}^{- 1}

α_{k}

G_{k} = G_{k - 1} + \frac{y _{k} y _{k}^{†}}{y _{k}^{†} s _{k}} - (1 - λ \frac{s _{k}^{†} s _{k}}{s _{k}^{†} G _{k - 1} s _{k}}) \frac{G _{k - 1} s _{k} s _{k}^{†} G _{k - 1}}{s _{k}^{†} G _{k - 1} s _{k}} .

G_{k} = G_{k - 1} + \frac{y _{k} y _{k}^{†}}{y _{k}^{†} s _{k}} - (1 - λ \frac{s _{k}^{†} s _{k}}{s _{k}^{†} G _{k - 1} s _{k}}) \frac{G _{k - 1} s _{k} s _{k}^{†} G _{k - 1}}{s _{k}^{†} G _{k - 1} s _{k}} .

{0 \leq x_{0} < N_{t} - 1 x_{0} = 0; 0 \leq x_{1} < N_{s} - 1 with μ = \hat{0} for temporal links, with μ = \hat{1} for spatial links.

{0 \leq x_{0} < N_{t} - 1 x_{0} = 0; 0 \leq x_{1} < N_{s} - 1 with μ = \hat{0} for temporal links, with μ = \hat{1} for spatial links.

{U_{x, \hat{0}} \to U_{x, \hat{0}} Λ_{0} U_{x, \hat{1}} \to U_{x, \hat{1}} Λ_{1} for x_{0} = N_{t} - 1, for x_{1} = N_{s} - 1,

{U_{x, \hat{0}} \to U_{x, \hat{0}} Λ_{0} U_{x, \hat{1}} \to U_{x, \hat{1}} Λ_{1} for x_{0} = N_{t} - 1, for x_{1} = N_{s} - 1,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMarkov Chains and Monte Carlo Methods · Protein Structure and Dynamics · Stochastic processes and statistical mechanics

Full text

Ensemble Quasi-Newton HMC

and James C. Osborn

Computational Science Division

Argonne National Laboratory

9700 S. Cass Ave.

Lemont, IL 60439, USA

E-mail

Abstract:

We present a modification of the Hybrid Monte Carlo algorithm for tackling the critical slowing down of generating Markov chains of lattice gauge configurations towards the continuum limit. We propose a new method to exchange information within an ensemble of Markov chains, and use it to construct an approximate inverse Hessian matrix of the action inspired from quasi-Newton algorithms for optimization. The kinetic term of the molecular dynamics evolution includes the approximate Hessian for long distance couplings among the momenta. We show the result of applying the new algorithm to the $U(1)$ gauge theory in two dimensions, and discuss our future plans.

1 Introduction

In generating a Markov chain, we aim at speeding up Monte Carlo simulations, making proposal configurations far from the current configuration in phase space, with relative low cost. Molecular Dynamics (MD) evolution in fictitious time using random momenta naturally extends and mitigates the Langevin-like random walk behavior. This Hybrid Monte Carlo (HMC) algorithm [1] works well in high dimensional systems, such as lattice QCD. Approaching the continuum limit of the lattice theory, some physical modes in MD slows down exponentially, leading to research in Fourier acceleration [2, 3, 4] as a possible remedy. The analogous Riemannian manifold HMC [5] claims success for some probability density functions in guiding the MD evolution through the phase space. Recent efforts [6, 7, 8] surge in analyzing and applying similar techniques to lattice QCD. We focus on employing, as the acceleration kernel, a numerically cheaper approximation of the Hessian matrix from the Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm [9, 10] (L-BFGS), a common quasi-Newton optimization method. This approximation applies to not only the gauge fields but also the pseudo-fermion fields. The HMC, nevertheless, requires changes to adopt such approximation that uses information from multiple configurations.

In this paper we present the ensemble quasi-Newton HMC (QNHMC) method, discuss the characteristics of the method on two-dimensional $U(1)$ lattice gauge theory, and show preliminary results of its effect on the autocorrelation of the average plaquette value and topological charge.

2 Markov chain for assisted MD evolution

The MD evolution in the heart of the HMC algorithm follows from Hamiltonian dynamics,

[TABLE]

where $S$ is the action, $p$ the fictitious momenta, and $G$ a fixed MD mass matrix. A symplectic and reversible discrete integrator advances the state of the Markov chain from $(x,p)$ to $(x^{\prime},p^{\prime})$ over a fictitious time period, $\tau$ , the trajectory length. Using the Hamiltonian as the negative log probability of the enlarged phase space including $x$ and $p$ , the correctness of the HMC demands a positive definite $G$ .

The choice of $G$ affects the performance of HMC. For a general action, a MD mass matrix containing the local information of the Riemann curvature can bring considerable speedups [5] in the efficiency of Markov Chain Monte Carlo methods. The article recommends the Fisher information matrix as $G$ . Fourier acceleration [3, 4] suggests the field Laplacian operator as $G$ . We are interested in using a fixed $G$ during one MD trajectory, for its simplicity and efficiency. Any explicit symplectic reversible integrator for equation (1) would still be applicable. However, this also means that $G$ cannot depend on any configurations from this one whole MD trajectory.

In general we want a proposal of $(x^{\prime},p^{\prime})$ for the next state of the Markov chain from $(x,p)$ following a symplectic and reversible discretization of the MD evolution (1) where $G^{-1}$ comes from our choice of a function, $\mathcal{G}^{-1}$ ,

[TABLE]

whose argument is a list of $\mathcal{N}$ configurations $\mathcal{X}_{i}$ , which are all different from $x$ , $x^{\prime}$ , or any configuration along the discretized path of this particular MD evolution. Assuming a particular choice of $G^{-1}$ could help through out the Markov chain generation, we can fix $\mathcal{G}^{-1}$ and $\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}$ , optimally picked from available configurations, and the HMC procedure remains the same except for the additional mass matrix $G$ .

In this paper, we focus on building $G^{-1}$ that is fixed during one MD evolution but changes after each trajectory. We use $\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}$ from a set of parallel streams of Markov chains. We update one of the streams using information from the others, suggested in references [11, 12]. We can have multiple ways to generate such Markov chains, and obtain $\mathcal{G}$ of $\mathcal{X}$ from neighboring streams. The following is the base case with provable reversibility. We use an arbitrary information exchange kernel $\mathcal{F}$ to generalize the ensemble assisted Markov chain.

Let $\mathbb{N}$ be the number of coupled parallel streams, each labeled $\mathbb{X}_{j}$ , for $j=0$ to $\mathbb{N}-1$ . Let $\mathcal{F}$ be a function on a unordered set of configurations, $\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}$ , with $\mathcal{N}=\mathbb{N}-1$ . Let $\mathcal{U}$ be a symplectic and reversible mapping that generates the next state of one Markov chain, from $\mathbb{X}_{j}$ to $\mathbb{X}^{\prime}_{j}=\mathcal{U}\left(\mathcal{F}\big{(}\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}\big{)}\right)\mathbb{X}_{j}$ . Given a fixed set of $\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}$ , $\mathcal{U}$ depends on the value of $\mathcal{F}$ , and satisfies the detailed balance, $\pi(\mathbb{X}_{j})P(\mathbb{X}_{j}|\mathbb{X}^{\prime}_{j})=\pi(\mathbb{X}^{\prime}_{j})P(\mathbb{X}^{\prime}_{j}|\mathbb{X}_{j})$ , where $\pi$ is probability density we want to simulate and $P$ the transition probability. We give the definition of $\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}$ within the steps of the ensemble assisted Markov chain described in the following.

When updating $\mathbb{X}_{k}$ for each $k$ from [math] to $\mathbb{N}-1$ :

(a)

Setting the list, $\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}$ from the list $\{\mathbb{X}_{j}\}_{j\neq k}$ , with $\mathcal{N}=\mathbb{N}-1$ . 2. (b)

Evolve $\mathbb{X}_{k}$ according to $\mathcal{U}\left(\mathcal{F}\big{(}\{\mathcal{X}_{i}\}_{i=0}^{\mathcal{N}-1}\big{)}\right)$ . 2. 2.

After generating one trajectory for each $\mathbb{N}$ parallel streams, Set the new $\{\mathbb{X}^{\prime}_{j}\}$ to the reversed sequence of it, $\mathbb{X}^{\prime}_{j}\leftarrow\mathbb{X}^{\prime}_{\mathbb{N}-1-j}$ . This is for the purpose of reversibility.

Figure 1 illustrates an example of one update with $\mathbb{N}=4$ streams. The current configurations from these four streams are labeled with [math], $1$ , $2$ , and $3$ . We update those successively, each time with information obtained from the function $\mathcal{F}$ applied to configurations from other streams. After updating each stream to $0^{\prime}$ , $1^{\prime}$ , $2^{\prime}$ , and $3^{\prime}$ , we reverse the ordering of those, and complete this one update. We can see that when we reverse the Markov chain, we would update $3^{\prime}$ first with the information from $\mathcal{F}(\{2^{\prime},1^{\prime},0^{\prime}\})$ , because $\mathcal{F}$ does not depend on the ordering of $\mathcal{X}_{i}$ , we can reproduce the reversed Markov chain.

For the purpose of assisting MD evolution, where $\mathcal{U}$ represents the procedure of refreshing $p$ , integrating the equation (1), and finishing with a Metropolis-Hastings accept/reject step, we use

[TABLE]

where a special routine $\operatorname{sorted}$ sort the set of $\mathcal{X}_{i}$ before applying the function $\mathcal{G}^{-1}$ , because our choice of $\mathcal{G}$ in equation (2) depends on the ordering of $\mathcal{X}_{i}$ , and sorting guarantees the same $G^{-1}$ with a reversed procedure. In addition to the simple case presented here, we can build more involved Markov chains by using more states per parallel stream for the exchange kernel $\mathcal{F}$ , or by decoupling some of the parallel streams for parallel evolution.

3 L-BFGS approximated Hessian

Among $\mathbb{N}$ coupled parallel HMC streams, for each stream, we use the latest configurations from the other streams to construct the approximate Hessian with the L-BFGS algorithm. For these $\mathbb{N}-1$ configurations, we compute the site-wise finite differences of lattice gauge fields, $\mathcal{U}_{k}$ ,

[TABLE]

where $s_{k}$ and $y_{k}$ are fields of the elements of the Lie algebra, $L$ is the length of the L-BFGS memory, and the inequality comes from selecting only those field pairs with $s_{k}^{\dagger}y_{k}=y_{k}^{\dagger}s_{k}>0$ (the inner product implicitly traces over the color indices and sums over lattice) for the positive definiteness of the approximated Hessian. As required in equation (3) we sort the field pairs according to $s_{k}^{\dagger}y_{k}$ .

The L-BFGS algorithm gives the inverse Hessian as a recursively defined operator,

[TABLE]

where $2\beta$ as an initial value comes from the diagonal term of the Hessian matrix of the 2-D $U(1)$ action in the weak coupling limit. The associated L-BFGS Hessian matrix can be expressed as,

[TABLE]

This rank-2 update has a symmetric product form [13], showing the explicit positive definiteness,

[TABLE]

In our test with the unmodified L-BFGS algorithm, the low eigenvalues of the approximated Hessian matrix decreases rapidly as the L-BFGS memory length increases, even after removing all the exact zero modes from the theory described in the next section. While the largest eigenvalues are stable, the condition number increases and the Hessian matrix becomes singular with a modest L-BFGS memory length, because the approximate action surface spanned by the samples we draw for the L-BFGS algorithm has zero modes and may even be concave. Since the MD evolution involves the inverse of the Hessian matrix, the evolution becomes unstable with near zero modes. A straightforward method to regulate the approximated Hessian would be to add a small term to the diagonal of $\mathcal{G}_{k}$ . It nevertheless breaks the rank-2 update iteration, invalidates the simple inversion formula and the symmetric decomposition for the square root of $\mathcal{G}$ . This would require a conjugate gradient inversion and a rational approximation of the square root of $\mathcal{G}$ .

Investigating the determinant behavior from the symmetric product form (7) leads us to one solution: adding a small term to one of the rank-1 updates in equation (6),

[TABLE]

This still invalidates the iteration formula (5). The symmetric product form (7), however, remains applicable with minimal changes. The complexity of iterating the symmetric product form is always linear in lattice volume and, in terms of $L$ , $O(L)$ in space and $O(L^{2})$ in time.

4 $U(1)$ gauge theory on a 2-D lattice

We use the Wilson plaquette action for the $U(1)$ gauge theory on a two-dimensional lattice with periodic boundary conditions. As the QNHMC algorithm uses approximated Hessian, we first need to remove all the exact zero modes of the Hessian from the theory, in order to improve the stability of the MD evolution.

The gauge degrees of freedom form the exact zero modes of the Hessian. With periodic boundary conditions, that is $N_{s}\times N_{t}-1$ zero modes for a lattice with spatial and temporal extent $N_{s}$ and $N_{t}$ . We fix the gauge using a maximal tree of links which we set the gauge variables $U_{x,\mu}$ to unity. The maximal tree includes lattice sites $x=(x_{0},x_{1})$ and directions $\mu\in\{\hat{0},\hat{1}\}$ satisfying

[TABLE]

There are two global gauge degrees of freedom, due to the abelian nature of the theory,

[TABLE]

where $\Lambda_{0}$ and $\Lambda_{1}$ are elements of the $U(1)$ group. Thus we fix two more gauge links, $U_{(N_{t}-1,0),\hat{0}}=U_{(0,N_{s}-1),\hat{1}}=1$ , during the MD evolution of QNHMC to remove these two zero modes.

We are interested in observables that are slow to evolve in the Markov chain, particularly the topological charge. We use the definition of the topological charge [14, 15], $Q=(\sum_{x}\operatorname{Arg}P_{x})/(2\pi)$ , where the complex argument $\operatorname{Arg}$ takes the principle value of $(-\pi,\pi)$ . This definition does not apply to exceptional configurations (with no contribution to the partition function in the continuum limit) where $P_{x}=-1$ for some $x$ . On a two dimensional lattice with periodic boundary conditions, this definition of topological charge gives exact integer values.

5 Current status, and future plans

We implement the $U(1)$ gauge theory in the QEX framework [16]. We use PRIMME [17] to study the eigenmodes of the exact Hessian matrix and the L-BFGS approximated one.

The results below come from the 2-D $U(1)$ theory at $\beta=4.5$ on a lattice of size $24\times 24$ , with the number of coupled parallel HMC streams, $\mathbb{N}=10$ and $20$ , using the number of configurations from 8192 to 65536, depending on the trajectory length. The MD evolution uses the Omelyan’s second order minimum norm integrator [18]. We keep the number of steps per MD trajectory fixed at 8, 16, 32, or 64, while tuning for optimal trajectory length separately for conventional HMC and QNHMC.

Figure 2 shows the integrated autocorrelation length of topologic charge squared (left) and the average plaquette (right). To include the cost of generating the Markov chain, we multiply the integrated autocorrelation length by the number of MD steps in an HMC trajectory, converting the correlation length from the unit of configuration to the unit of MD steps. We refer to this quantity as the cost of generating configurations for uncorrelated measurable quantities, in terms of the force evaluations, and we tune simulation parameters to lower the cost. The apparent increase of the cost for the average plaquette after the trajectory length grows longer than three is due to the fact that the autocorrelation becomes minimal between successive configurations and the cost here becomes linearly proportional to the MD steps.

Comparing the conventional HMC with and without gauge fixing, we see that the topological quantity shows about a factor of two increased cost, going from no gauge fixing to gauge fixing. The autocorrelation of the average plaquette value however depends less on the gauge fixing. Using QNHMC with $\lambda=0.1$ shows no improvement for the topological quantity, and the cost worsens with $\lambda=0.01$ . On the other hand, QNHMC reduces the cost for the average plaquette with $\lambda=0.1$ , and more so with increased number of coupled Markov chains, from $\mathbb{N}=10$ to $20$ .

Moving forward, we will do more tuning and testing with the QNHMC algorithm, studying the scaling behavior toward the continuum limit. On the other hand, we will also look for other approaches to approximate the Hessian. L-BFGS is designed for its efficiency in iterative optimizations. Since we have an ensemble of Markov chains, we will look for other ways to approximate the Hessian matrix [19, 20].

Acknowledgments.

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Duane, A. Kennedy, B. Pendleton and D. Roweth, Hybrid Monte Carlo , Phys.Lett. B 195 (1987) 216 . · doi ↗
2[2] G. G. Batrouni, G. R. Katz, A. S. Kronfeld, G. P. Lepage, B. Svetitsky and K. G. Wilson, Langevin Simulations of Lattice Field Theories , Phys. Rev. D 32 (1985) 2736 . · doi ↗
3[3] S. Duane, R. Kenway, B. J. Pendleton and D. Roweth, Acceleration of Gauge Field Dynamics , Phys. Lett. B 176 (1986) 143 . · doi ↗
4[4] S. Duane and B. J. Pendleton, Gauge Invariant Fourier Acceleration , Phys. Lett. B 206 (1988) 101 . · doi ↗
5[5] M. Girolami and B. Calderhead, Riemann manifold langevin and hamiltonian monte carlo methods , Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (2011) 123 . · doi ↗
6[6] G. Cossu, P. Boyle, N. Christ, C. Jung, A. Jüttner and F. Sanfilippo, Testing algorithms for critical slowing down , EPJ Web Conf. 175 (2018) 02008 [ 1710.07036 ]. · doi ↗
7[7] N. H. Christ and E. W. Wickenden, Fourier acceleration, the HMC algorithm and renormalizability , Po S LATTICE 2018 (2018) 025 [ 1812.05281 ].
8[8] Y. Zhao, Numerical Implementation of Gauge-Fixed Fourier Acceleration , Po S LATTICE 2018 (2018) 026 [ 1812.05790 ].

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Ensemble Quasi-Newton HMC

Abstract:

1 Introduction

2 Markov chain for assisted MD evolution

3 L-BFGS approximated Hessian

4 U(1)U(1)U(1) gauge theory on a 2-D lattice

5 Current status, and future plans

Acknowledgments.

4 $U(1)$ gauge theory on a 2-D lattice