Detecting and Mitigating Mode-Collapse for Flow-based Sampling of Lattice Field Theories
Kim A. Nicoli, Christopher J. Anders, Tobias Hartung, Karl, Jansen, Pan Kessel, Shinichi Nakajima

TL;DR
This paper investigates mode-collapse in normalizing flows used for lattice field theory sampling, revealing that it shifts the tunneling problem to training and proposing metrics and strategies to mitigate bias caused by mode-collapse.
Contribution
It identifies mode-collapse as a training issue in normalizing flows for lattice field theories and introduces metrics and mitigation methods to reduce bias.
Findings
Mode-collapse occurs during training, not sampling.
A metric to quantify mode-collapse is proposed.
Mitigation strategies improve estimation of thermodynamic observables.
Abstract
We study the consequences of mode-collapse of normalizing flows in the context of lattice field theory. Normalizing flows allow for independent sampling. For this reason, it is hoped that they can avoid the tunneling problem of local-update MCMC algorithms for multi-modal distributions. In this work, we first point out that the tunneling problem is also present for normalizing flows but is shifted from the sampling to the training phase of the algorithm. Specifically, normalizing flows often suffer from mode-collapse for which the training process assigns vanishingly low probability mass to relevant modes of the physical distribution. This may result in a significant bias when the flow is used as a sampler in a Markov-Chain or with Importance Sampling. We propose a metric to quantify the degree of mode-collapse and derive a bound on the resulting bias. Furthermore, we propose various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsNormalizing Flows
Detecting and Mitigating Mode-Collapse for Flow-based Sampling of Lattice Field Theories
Kim A. Nicoli
Transdisciplinary Research Area (TRA) Matter, University of Bonn, Germany
Helmholtz Institute for Radiation and Nuclear Physics (HISKP), Bonn, Germany
Berlin Institute for the Foundations of Learning and Data (BIFOLD), Berlin, Germany
Christopher J. Anders
Berlin Institute for the Foundations of Learning and Data (BIFOLD), Berlin, Germany
Machine Learning Group, Technische Universität Berlin, Berlin, Germany
Tobias Hartung
Northeastern University - London, London, United Kingdom
Karl Jansen
CQTA, Deutsches Elektronen-Synchrotron DESY, Zeuthen, Germany
Pan Kessel
Prescient Design, gRED, Roche, Basel, Switzerland
Shinichi Nakajima
Berlin Institute for the Foundations of Learning and Data (BIFOLD), Berlin, Germany
Machine Learning Group, Technische Universität Berlin, Berlin, Germany
RIKEN Center for AIP, Tokyo, Japan
Abstract
We study the consequences of mode-collapse of normalizing flows in the context of lattice field theory. Normalizing flows allow for independent sampling. For this reason, it is hoped that they can avoid the tunneling problem of local-update MCMC algorithms for multi-modal distributions. In this work, we first point out that the tunneling problem is also present for normalizing flows but is shifted from the sampling to the algorithm’s training phase. Specifically, normalizing flows often suffer from mode-collapse for which the training process assigns vanishingly low probability mass to relevant modes of the physical distribution. This may result in a significant bias when the flow is used as a sampler in a Markov-Chain or with Importance Sampling. We propose a metric to quantify the degree of mode-collapse and derive a bound on the resulting bias. Furthermore, we propose various mitigation strategies in particular in the context of estimating thermodynamic observables, such as the free energy.
I Introduction
Using normalizing flows for sampling in lattice field theory has gained significant attention over the last few years. Several works have been carried out in the domain of scalar field theories [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [11, 12] and [13, 14, 14, 15] pure gauge theories, and fermionic gauge theories [16, 17]. This rapid development is attributed to the appealing conceptual properties of flow-based sampling. A well-trained flow approximately acts as a trivializing map [18] and therefore can significantly reduce the integrated autocorrelation time of physical observables. The practical obstruction to harnessing this conceptual advantage is that the training process becomes increasingly challenging as the dimensionality of the lattice increases, resulting in poor volume scaling [19, 20, 21, 22]. Furthermore, it is well-known that generative models struggle to learn long-range correlations [23] which is crucial as a critical point is approached. When the continuum limit of the theory is taken, both challenges manifest simultaneously: the volume needs to be increased as the critical point is approached. As a result, it remains an open question whether useful architectures can be found for addressing critical slowing down in the continuum limit.
Another conceptually appealing property of normalizing flows is that they allow for independent sampling, thus making flow densities suitable for being combined with Metropolis-Hastings accept-reject schemes. This approach is often referred to as Neural-MCMC [24, 1, 25, 26]. As a result, it may be hoped that they can avoid the tunneling problem which arises when local update MCMC algorithms are applied to theories that have degenerate minima separated by high action barriers. However, normalizing flows are typically trained by self-sampling in the context of lattice field theory [1]. As we will discuss, this bears the risk that the training will assign vanishing low probability mass to some of the modes of the theory [4, 2, 3], since the training objective will not strongly penalize this. If mode-collapse happens, certain modes of the theory will not be probed by the sampler. This problem, therefore, leads to substantially biased estimators of physical observables as shown in fig. 1.
In our work we study mode-collapse and the more general mode-mismatch phenomenon, both theoretically and numerically. We first discuss in detail the mode-seeking nature of the standard self-sampling-based training procedure which corresponds to minimizing the reverse Kullback-Leibler (KL) divergence [3]. We compare this to an alternative training procedure which is based on minimizing the forward (as opposed to the reverse) KL divergence and review why it is equivalent to maximum likelihood training. This objective has the advantage that it is substantially less vulnerable to mode-collapse but has the disadvantage that it requires representative configurations sampled from the theory. In many applications, this prevents this objective from being of any use since if such configurations are available, we can directly measure physical observables on them and a flow is not necessary. However, we point out that there is an important exception to this: for thermodynamic observables, such as the free energy, it is still useful to train a flow. This is because these observables are typically obtained by integration through the parameter space of the theory and thus require a significant number of Markov chains along a discretized trajectory in the parameter space. By training a flow on samples generated at a single point in parameter space, we can completely avoid the need for these additional Markov chains. In this important scenario, it is thus sensible and, as we argue, advisable to use forward KL training for the flow to significantly reduce mode-collapse. Besides modifying the training procedure, we also propose to mitigate mode-collapse by combining two flow-based estimators for the free energy. As a side remark, we note that concurrent works have been proposing strategies, alternative to the Forward KL objective, trying to mitigate mode collapse. These include more stable path gradient estimators [27, 28], learning deformed target distributions [29] and annealed importance sampling [30].
We then study the bias induced by mode-collapse theoretically. Specifically, we derive a bound on the bias of the estimator for physical observables. This allows us to propose a natural metric to quantify the degree of mode-collapse of the sampler.
The effectiveness of our proposed methods is then demonstrated on a two-dimensional scalar theory.
We stress that our study focuses on the estimation of the free energy, an example of thermodynamic quantity involving the partition function 111Other examples of such thermodynamic observables not directly accessible with HMC are, for instance, entropy and pressure., a crucial subset of physical observables in lattice field theory [32]. Estimating these observables with standard Markov-Chain-based methods requires sampling configurations at many different values in parameter space and integrating free energy differences from a known reference value, see previous works for more details [4, 2]. This approach is computationally expensive since it often requires a significant number of HMC chains along the trajectory in parameter space, and crucially leads to high uncertainty, as errors from each chain accumulate upon integration. This problem becomes more severe when one needs to cross a phase transition. There, integrated autocorrelation times explode, thus resulting in larger errors for each Markov chain. For this reason, training a normalizing flow using a forward KL objective can often be advantageous: training a normalizing flow requires samples from only a single Markov chain at the target point in parameter space and thus allows us to circumvent the need for any additional chains along the trajectory through parameter space.
We emphasize that the intricacies of training a normalizing flow for multimodal distributions in the context of lattice field theories have been already discussed in [3]. Our work builds on this reference but is different in the sense that we consider thermodynamic observables. As explained above, these observables cannot be estimated on the Markov-Chain samples at the target point without the need for additional Markov chains for different coupling values. As a result, training of normalizing flows using the forward KL objective is particularly natural for the estimation of thermodynamic observables.
II Training a Generative Model
Normalizing flows [33, 34, 35] are a particular class of generative models giving access to an analytic form of the likelihood. While this work focuses on flows for concreteness, we stress that the theoretical arguments made in the following sections hold for any generative model allowing for exact likelihood estimation. For such models, a variational density , the sampler, parameterized by a set of weights , is optimized to approximate the target density of the lattice field theory
[TABLE]
where is the partition function and is the action of the theory.
During the training of a normalizing flow, an efficient transformation to map a base density into a non-trivial target is learned. In practice, the base distribution is chosen such that it allows for efficient sampling. Common choices for the base density are therefore normal or uniform distributions.
The flow uses a diffeomorphism between the base space and the configuration space hence
[TABLE]
The diffeomorphism is a composition of bijective transformations referred to as coupling blocks. Each of these blocks satisfies the following requirements:
is a bijection, 2. 2.
both and its inverse are in . 3. 3.
the determinant of the Jacobian is efficient to evaluate.
The inverse of the transformation therefore always exists by construction. Leveraging these properties, an analytic expression for the likelihood of the flow-based model reads
[TABLE]
Different coupling blocks satisfying the requirements above have been proposed; these include Non-Linear Independent Component Estimation (NICE) [36], Real Non-Volume Preserving (RealNVP) [37], and Generative flow (GLOW) [38]. We refer to [35, 34] for an overview of the existing coupling blocks and further technical details.
II.1 The Forward- and Reverse-KL Divergences
During training, the normalizing flow is optimized by density matching. It is common practice to minimize the so-called KL divergences to this end although other types of generalized divergences can be used [39, 40, 41, 42, 43, 44, 45, 46]. As we will discuss in this section, and in section III, choosing an appropriate divergence is crucial to ensure successful training.
The so-called reverse-KL divergence reads
[TABLE]
where represents the measure of a high-dimensional integral. It is worth stressing that the KL divergence is not symmetric hence
[TABLE]
The right-hand side of eq. 5 is usually referred to as the forward-KL which can be written as an expectation value with respect to the target density
[TABLE]
These two choices for the divergence lead to different training procedures, as we will discuss in sections section II.1.1 and section II.1.2. We also note that the use of reverse and forward KL is not mutually exclusive. In the context of Quantum Chemistry [47], for instance, a combination of the two is typically chosen.
II.1.1 Reverse-KL: training by self-sampling
The reverse KL divergence is the standard choice for training normalizing flows on the lattice. This is because lattice field theory comes with an action which is known in closed form – in contrast to many other machine learning applications.
The reverse KL divergence (4) can be approximated by a Monte-Carlo estimate [24, 2] as follows
[TABLE]
Here, the field configurations are sampled from the flow, i.e. , and thus the training relies on self-sampling. In particular, the partition function contributes by a shift term that is constant with respect to the parameters of the flow and thus can be ignored for optimization by gradient descent.
Training using a reverse-KL is therefore very efficient because it does not require samples from the target density due to self-sampling. Unfortunately, this comes at a cost as this objective is known to be prone to mode-collapse. The sketch on the left-hand side of fig. 2 shows that a reverse-KL-trained flow tends to focus its support on a subset of the modes when the target density is multimodal. This undesirable behavior strongly affects the reverse KL as the dropped modes are not probed in the self-sampling process.
This mode-seeking nature of the reverse KL represents a major drawback and limits the applicability of this framework when the physical density to be learned has more than one mode [3, 4, 29].
II.1.2 Forward-KL: training by maximum likelihood
The forward KL divergences can be written as an expectation value with respect to the target density and thus also be approximated by Monte-Carlo
[TABLE]
In contrast to the reverse KL divergence, the samples are here to be drawn from the density of the theory, i.e. . As can be seen from the equation, the minimization of the forward KL corresponds to maximizing the likelihood of the model. In the machine learning literature, the forward training procedure is thus also known as maximum likelihood training. Indeed, this has already been explored in the context of lattice field theory [3].
This training procedure has the advantage that it is mode-covering since all modes of the physical target density will necessarily be probed in training. It has however the disadvantage that it requires samples from the target . In lattice applications, these are typically generated by a Monte-Carlo algorithm, such as HMC. However, if these configurations are available, one can directly measure physical observables on them and there is therefore no need to train a flow in the first place.
One may thus wonder if this training procedure is of any use then. For thermodynamic observables however, such as the free energy, one typically does not only require a single Markov chain for the target density but a whole series of Markov chains along a discretized trajectory in the parameter space of the theory. As we will review in the next section, a flow allows us to completely avoid the need for these additional Markov chains. For the important class of thermodynamic observables, forward KL training is thus well-justified and, as we will show, advisable.
III Reliable Estimators in Presence of Mode-Collapse
Combining deep generative models, e.g. normalizing flows, with neural importance sampling (NIS) has been shown to be a fruitful approach for estimating thermodynamic observables in lattice field theory [2, 4], statistical mechanics [24], and chemistry [47, 48, 49]. This approach enables direct estimation of the free energy as well as other thermodynamic observables because flow-based sampling allows for estimating those observables at arbitrary points in the parameter space. Remarkably, this is in stark contrast to standard Markov-Chain Monte-Carlo methods which instead require non-trivial integration in the parameter space [2]. More specifically, NIS allows the computation of a direct Monte-Carlo estimate of the partition function which is crucial for many thermodynamic observables [2] such as entropy and free energy
[TABLE]
In the following, we revise two different estimators of the free energy, namely the p-estimator and the q-estimator. These allow estimation using samples drawn from the target and the generative model respectively [4]. We stress that these estimators are applicable for any generative model that has a tractable likelihood, such as normalizing flows [34], autoregressive neural networks [50], and diffusion models [51].
III.1 Different Estimators for the Partition Function
Given a generative model, irrespective of whether it has been trained using reverse-KL or forward-KL, the resulting sampler is an approximation for the target density . Leveraging this, recent works proposed to estimate the partition function of a physical system [24, 2] directly at a given point in parameter space. This approach samples lattice configurations from the generative model and estimates the partition function with a so-called q-estimator with
[TABLE]
Alternatively, when samples from the target are available, i.e., using a thermalized Markov chain, one can estimate the inverse partition function with the so-called p-estimator
[TABLE]
Combining these results with eq. 9, one immediately derives corresponding estimators for the free energy
[TABLE]
Both the p-estimator and the q-estimator can be shown to be asymptotically consistent under the assumption that the supports of the flow and the target density match, i.e. [24]. By construction, the learned density, , has full support over the entire domain of the base distribution – at least from a purely theoretical point of view. This implies always holds in theory. However, in practice it is not unlikely to have regions of the domain where the density is vanishingly small. Hence, for a finite number of samples, it can effectively be zero. This leads to incorrect estimation of expectation values of physical observables for any reasonable number of samples . Furthermore, ensuring that a normalizing flow is invertible also in practice, i.e. to numerical precision, can be very challenging [52].
To analyze the resulting implications for the estimation process, it is useful to define the following generalized notion of the support of the variational density
Definition 1**.**
The effective support of the variational density relative to is given by
[TABLE]
for a given numerical threshold . The mode dropping set is then given by
[TABLE]
This definition is useful for the following reason: if the flow is effectively mode dropping, i.e., the mode-dropping set is non-empty, the importance weighted estimator, with a finite number of samples , will miss a contribution from the mass with approximately the probability . We note that defining the threshold for the effective support relative to the target distribution is pivotal. This is because the absolute definition would have no meaning with regard to mode-dropping. As an example, suppose we have an area of size and in that area. Then the corresponding area with probability mass would be considered “-mode-dropped” even though is an exact copy of . We therefore choose a definition for which mode-dropping only exists when approximately vanishes relative to .
It is also useful to define the effective sampler distribution
[TABLE]
where represents the multiplicative renormalization factor necessary to guarantee the normalization of , i.e., the probability mass out of the effective support is redistributed to the effective support proportionally to the original density . With this definition, we express the practical situation where the importance weighted estimator for a physical observable typically misses the contribution from the mode-dropping set , as the assumption that the following approximation holds:
[TABLE]
where , for the sample size large enough for Monte Carlo sampling but not too large to assume that , i.e., the probability that all samples drawn from lie within the effective support is close to one. Since has the full support, it holds that . Throughout the manuscript, we will indicate by a hat a (finite sample) estimator, by a bar the expectation over the effective distribution – which corresponds to the average over typical samples – and by an asterisk the expectation over the original distribution . Note that, under our assumption of mode-dropping (17), i.e., , the typical values of the estimator can be significantly different from the true expectation value
[TABLE]
A more detailed discussion of the effective relative support is provided in section .2.
We also remark that is also interesting to consider, as implies effective “fake” modes that are present in but not in .
The following theorem holds:
Theorem 2**.**
Suppose the trained model is mode dropping, i.e., the approximation (17) holds. Then the -estimator and the -estimator for the free energy approximate and , respectively, being bounds on the true free energy as
[TABLE]
Furthermore, if it follows
[TABLE]
and similarly if
[TABLE]
We prove in section .3 that the estimators serve as upper and lower bounds of the free energy.
In the presence of mode-collapse, the flow has smaller effective support than the target, i.e.,
[TABLE]
Crucially, this may also happen when the variational density is a very bad approximation of the true density , see bottom left of fig. 5. While this is strictly not a common manifestation of mode collapse, the following discussion holds for such badly-trained models where the overlap between the support of and is very small. In this case, the q-estimator in eq. 12 may thus lead to (possibly strongly) biased results since it may not have full effective support under the assumption that the approximation (17) holds. On the other hand, the q-estimator has the advantage that it is typically more efficient to sample directly from the flow while the p-estimator requires the (possibly costly) generation of configurations by a Markov Chain. Nevertheless, it is advisable to estimate the free energy with both estimators if there is a risk of mode mismatch and ensure that both lead to consistent results. The phenomenon of mode-collapse is a widely known issue in the field of density estimation [53, 54, 55, 56]. In particular, when deploying generative models for physical systems, this becomes crucial as neglecting subsets of the modes of a target density would inevitably lead to highly biased estimation of physical quantities. Moreover, this may sometimes not even be detected unless appropriate estimators are used [4]. We want to stress that this problem is not restricted to lattice field theories [3, 4] but is also found within other contexts, such as molecular systems [47, 57, 30]. Having an estimator which quantifies the amount of probability mass being missed by a variational ansatz is therefore highly desirable for more reliable and unbiased estimation of physical quantities.
When the trained model neglects some modes of the target density, hence missing full effective support over the target domain, estimates of physical observables may be biased. A desirable property of our framework is to detect such bias by providing reliable bounds on the error when the model is mode-dropping. When has full effective support on the domain of the expected value of the importance weights reads
[TABLE]
This expectation value thus measures the degree to which the support of the target density is covered by the sampler . Statistically, in the limit of infinite measurements, is always equal to one. However, if the sampler is mode-dropping, hence the approximation (17) holds, then the estimator will be in providing us with a natural quantity to measure the sampler’s ability to probe the entire support of the target density .
We will now derive an estimator for this expectation value. To this end, we rewrite the above expression as
[TABLE]
and we note that the expectation value is now taken with respect to . As shown in the last section, the partition function can be approximated by the p-estimator (11) when samples from the target density are available. Thus, under the assumption that the approximation (17) holds, the following Monte-Carlo estimate approximates eq. 20, i.e.,
[TABLE]
where and are sampled from the flow and the target density respectively.
III.2 Bounding the Bias of Physical Observables
Following [2], given a physical observable , our goal is to compute the importance weighted estimator , defined in the left-hand side of (17), which approximates the expectation value over the effective sampler distribution (16). This estimator is not necessarily unbiased to the true value (18), if the model is affected by mode-collapse, i.e., for the approximation (17) to hold. Similarly, the bias of the estimator evaluated over a finite number of trials should approximate
[TABLE]
i.e. the bias arises due to the insufficient effective support of the sampler.
IV The Mode-Dropping Estimator
In the following, we aim to derive a bound on this bias. In the process, we will also obtain a natural measure for the degree of mode-collapse. To this end, we foliate the sampling space by disjunct sets
[TABLE]
for and . We refer to fig. 3 for a visual illustration of the foliation. For fixed , we can define the following weights
[TABLE]
Leveraging these definitions, we derive a bound on the bias in section .5 which is summarized in the following theorem.
Theorem 3**.**
Let the action of the theory and the observable be polynomially bounded, i.e.
[TABLE]
for some and
[TABLE]
for some . The bias, i.e., the difference between the expectation value and the true value , then satisfies
[TABLE]
The bias is therefore bounded by a weighted sum over the . The weighting of each summand depends on the observable of interest. We note that the present discussion is only relevant for non-compact variables. Indeed, continuous functions living on compact manifolds are integrable, and therefore no additional care is required to handle indefinite forms of the type . In particular, it follows that for a compact variable the bound is straightforward
[TABLE]
We note that many physical observables are simple powers of fields, i.e. for . It can be shown that the foliation (23) along with the polynomial bound of the action implies that
[TABLE]
where we have defined . We refer to section .4 for more details. For such observables, the bias can thus be bounded by
[TABLE]
The theorem also naturally relates to the quantity introduced in the last section which quantifies the degree of mode-collapse. In order to provide a single number for the degree of the mode-collapse of the sampler, it is natural to choose a uniform, i.e. observable agnostic, weighing. It then follows from the definition of the , see (24), that this weighting measures the mismatch in support between the sampler and the target density
[TABLE]
and is thus directly related to the mode-dropping estimator derived in the last section.
V Numerical Experiments
We evaluate our proposed methods to detect and mitigate mode-collapse using the two-dimensional scalar -theory with action
[TABLE]
where is the bare coupling while is the hopping parameter. We refer to [2] for more details on this hopping parameterization of the action. Throughout all our experiments we keep the bare coupling fixed at and vary such that the theory crosses the phase transition due to the spontaneous breaking of its symmetry, i.e. . As the hopping parameter increases, spontaneous magnetization is observed. This is illustrated in fig. 4 for which the hopping parameter takes values through the critical region around . The curves show the density (top) and log density (bottom) of the normalized magnetization with different colors referring to different values of the hopping parameter . Spontaneous symmetry breaking is observed as the distribution of the magnetization changes from a wider single-mode to a bi-modal density with a suppressed tunneling probability between the two modes. This suppression is accentuated as the value of increases.
V.1 Free Energy Estimators
Our first numerical experiment analyzes the performance of two normalizing flows trained with both objectives described in section II.1. We refer to those flows as the forward-KL flow and the reverse-KL flow if they were trained with maximum likelihood or self-sampling respectively. We train for a hopping parameter such that the theory is in its broken phase, see fig. 5. For maximum likelihood training, we use 50M samples generated by an overrelaxed HMC. Following [2], we choose an architecture for the (reverse-KL) normalizing flows such that those models are manifestly invariant under symmetry (blue). In order to highlight the effects of mode-collapse, we also train reverse-KL flows without the inductive bias (green) thus expecting these models to be prone to mode-collapse.
The reference estimates for the true free energy were obtained via HMC simulations. Similarly to the approach followed in [2] such estimates are obtained by discretizing the hopping parameter space so that free energy differences can be estimated via HMC along the trajectory. Those contributions are added, integrating such trajectory up to the desired point at which the free energy needs to be estimated. Further technical details on how deep generative models were trained and HMC reference values obtained, can be found in section .6.
As can be seen on the left-hand side of fig. 5, the forward-KL flow (orange) very closely reproduces the reference distribution by HMC (pink). For the reverse-KL trained flows, we see that for smaller systems (top row in fig. 5), leveraging the inductive bias leads to a good approximation (blue) while the non- equivariant flow (green) fails to capture both modes. For larger systems, instead, both equivariant, and non-equivariant, flows are not able to capture most of the support of the target density , thereby resulting in poor approximations.
On the right-hand side of fig. 5, we estimate the free energy density of the system using different flows. We use the same color scheme as on the left-hand side and measure the free energy using both the p-estimator (circle) eq. 13 and the q-estimator (square) eq. 12 of the free energy. Our numerical results indeed agree with the theoretical prediction of theorem 2. Specifically, with the model trained with maximum likelihood, both estimators lead to compatible predictions with the HMC estimator. This is consistent with the left-hand side of the plot which suggests that no mode-collapse took place for this model.
For the reverse-KL flow, however, such an agreement may not be expected as the left-hand side of fig. 5 shows a mismatch in the support. The right-hand side plots show that the non- equivariant flow (green) in the (top row) case is dropping the left-hand mode while its -equivariant counterpart (blue) covers both modes. Nonetheless, as the dimensionality increases the density estimation task becomes increasingly more challenging thus preventing the -equivariant reverse-KL flows to train effectively for . As a result, we find that the q-estimator overestimates the true value F for both lattice sizes (top and bottom rows), while for the case the p-estimator substantially underestimates F, i.e. , as predicted by theorem 2. This latter situation suggests that the reverse-KL flows (green and blue) are limited in approximating of the target density resulting in very different effective supports. While strictly speaking this is not usually referred to as mode-collapse, it can be understood through the same lenses.
These experiments thus illustrate that: a) a mode-covering objective such as the forward-KL is more resilient when the target density is multimodal and shows sparse effective support and b) when the effective support does not match, both p- and q-estimators of the free energy from section III give upper and lower bound respectively. Moreover, we note that training using a forward KL objective does not worsen the performance compared to using a reverse KL. Practically, if the variational distribution presents some “fake” modes, s.t. , field configurations sampled in these regions will always be exponentially suppressed in the reweighting phase. We emphasize once again that a significant drawback of training using the pure form of forward KL is the necessity for training samples. Although this limitation applies in general, it does not pose a problem for our specific task of estimating thermodynamic observables.
We repeated this analysis for a number of values of the hopping parameter . The results are summarized in fig. 6 for the larger lattice with . We evaluate the gap between the neural importance sampling (NIS) estimate and the HMC reference normalized by the total standard error. Namely, if the normalized gap is within the range , both estimators are compatible (see inset in the top plot of fig. 6). Dashed curves connect q-estimates (12), while solid curves connect p-estimates (13), of the free energy at different values of . The results obtained with -equivariant reverse-KL and forward-KL flows are shown in blue and orange respectively. The inset shows close agreement of both estimators and both flow models for . However, deep in the broken phase, e.g. , the two modes of the target distribution start to lay further apart resulting in a failure of the mode-seeking objective, i.e. the reverse-KL, to properly capture the target density, see also bottom left plot of fig. 5. As a result, the probability mass transport induced by the normalizing flow fails to reproduce the correct target distribution , leading to a larger gap between the p- and q-estimators. When using the forward-KL trained flow instead, the support of the sampler is closely matching the support of the target hence making free energy compatible with the HMC reference even at the higher values of the hopping parameter , when eq. 17 holds. This effect is shown in the inset where there is a good agreement between both estimators of the free energy. This observation suggests that the mode-covering nature of the forward-KL is crucial to ensure that the flow leads to unbiased estimates of physical observables.
In fig. 6, it is also shown that our proposed mode-dropping estimator (III.1) correlates well with the observed gap in the free energy estimation. Lastly, we use our estimator to evaluate the support-mismatch of forward and reverse flow models trained at several values and different lattice sizes as shown in fig. 7. The top and bottom plots refer to lattices of size and respectively. These results demonstrate that the quality of the sampler very quickly deteriorates in the broken phase due to mode-collapse for the model trained by self-sampling. This is not the case for models trained with the forward KL. Indeed, as shown in fig. 7, these models scale significantly better in the volume of the system. Furthermore, the non- equivariant reverse-KL flow (green), is manifestly mode dropping for , see fig. 5, with values of around for values . This agrees with the left-hand side of fig. 5 where only half of the support is covered by the learned variational density in the top row.
VI Outlook and Summary
Mode-collapse presents a significant limitation to flow-based sampling on the lattice because it may lead to inaccurate approximations of the target density, either partially or completely. Intuitively, it can be understood as being in a loose relation to the tunneling problem in local MCMC algorithms. Specifically, the algorithmic challenges in sampling from multi-modal distributions are shifted from the sampling to the training phase for normalizing flows. In this work, we have studied this important limitation of flow-based sampling in great detail. We argue that in the important case of thermodynamic observables, there are practical and theoretically grounded mitigation strategies available. Specifically, the flow can be trained using the forward KL divergence and the free energy can be evaluated with two estimators that bound the true value. Furthermore, we have analyzed mode-mismatch theoretically and derived a bound on its induced bias as well as a quantitative measure for its severity. Normalizing flows are currently only limited to toy models. Encouragingly, we also observed as a side-product of our analysis, that the forward KL objective leads to better scaling in the system size. This observation may be worthwhile to be studied further as part of future work.
Acknowledgements
The authors thank the referee for stimulating discussions and useful suggestions that significantly improved the manuscript. K.A.N., C.J.A., S.N., and P.K. are supported by the German Ministry for Education and Research (BMBF) as BIFOLD - Berlin Institute for the Foundations of Learning and Data under the grant BIFOLD23B. K.A.N. has been partially supported by the Einstein Research Unit Quantum (ERU) Project under grant ERU-2020-607. This work is supported with funds from the Ministry of Science, Research, and Culture of the State of Brandenburg within the Centre for Quantum Technologies and Applications (CQTA). This work is funded by the European Union’s Horizon Europe Framework Program (HORIZON) under the ERA Chair scheme with grant agreement No. 101087126. This work is funded by the European Union’s HORIZON MSCA Doctoral Networks programme and the AQTIVATE project (101072344). The authors acknowledge Lena Funcke and Paolo Stornati for helpful discussions.
Appendix
.1 Forward-KL training
Training a normalizing flow with forward-KL in the context of lattice field theory requires pre-generated samples at a given point in parameter space. Before training a flow model, one should instantiate a thermalized Markov chain at a fixed value of the coupling parameters and generate a sufficient number of Monte-Carlo configurations which are then used to train the flow. A pseudo-code for this approach is presented in algorithm 1. We note that practically this approach may not always be feasible. For example, the number of pre-generated configurations needed for training a flow to an acceptable accuracy increases as the size of the lattice grows. For instance, training a flow for a lattice in the context of the field theory, in the broken phase, requires already more than fifty million samples. This problem, therefore, limits the practical deployment of forward-KL training schemes at larger scales. Moreover, another limitation of such an approach is that generating samples with HMC may not always be possible. Indeed, in the proximity of a phase transition, long-range autocorrelation will prevent to samples a necessary large amount of uncorrelated samples in time. One would therefore need to be very careful in generating a suitable dataset of HMC configurations to avoid incorporating any additional unwanted bias when training the flow.
.2 Relative effective support
Let be the open ball centered at with radius . A point is called -dropped if and only if
[TABLE]
By the Lebesgue differentiation theorem, this implies that holds for for almost every -dropped , and if and are continuous, it actually means . We recall the definition of effective relative support eq. 14
[TABLE]
Setting and assuming means that the importance weighted estimator with samples lacks a contribution from the mass with probability
[TABLE]
.3 Proof of Theorem 2
Theorem**.**
Suppose the trained model is mode dropping, i.e., the approximation (17) holds. Then the -estimator and the -estimator for the free energy approximate and , respectively, being bounds on the true free energy as
[TABLE]
Furthermore, if it follows
[TABLE]
and similarly if
[TABLE]
Proof.
From the definition of the free energy we first note that is equivalent to . Using the fact that , we obtain 222To make the notation more compact, in the integrals, we drop the subscript in the effective supports of both and .
[TABLE]
where the last inequality holds because . Thus, we conclude with the corollary that implies equality .
Similarly,
[TABLE]
shows , in general, and given .
Hence, by combining the inequalities we can conclude
[TABLE]
∎
.4 Bound on the configuration
Let us assume to be polynomial bounded
[TABLE]
with non negative coefficients . The left-hand and right-hand sides represent the lower and the upper bounds on the action . One needs to find appropriate coefficients such that the inequalities are satisfied. We now do a foliation of the sampling space
[TABLE]
which can be seen as a re-distribution of the lattice configurations into infinitely many buckets labeled by the index . Combining eq. 36 and eq. 37 one can rewrite a condition on the norm of 333We take the -norm for the field configuration and drop the subscript for notation convenience. We implicitly assume .. For a configuration
[TABLE]
This implies
[TABLE]
for . On the other side, it follows that
[TABLE]
which implies that
[TABLE]
for . Combining eq. 39 and eq. 41 we obtain the following bounds on the norm of the lattice configuration
[TABLE]
In particular, this can be shown to imply a bound on the volume of , e.g., making the volume finite
[TABLE]
where we used the volume of the ball,
[TABLE]
.5 Proof of Theorem 3
We now leverage the result from section .4 to derive a bound on the bias for the general observable when and therefore the importance sampling estimator may not be unbiased when eq. 17 holds true.
Theorem**.**
Let the action of the theory and the observable be polynomially bounded, i.e.
[TABLE]
for and
[TABLE]
for . Then, the bias between the estimated observable and the true value is given by
[TABLE]
where .
Proof.
Let’s assume the generic observable to be polynomially bounded
[TABLE]
When , eq. 46 can be bounded as
[TABLE]
where is the coefficient defined for each bucket, i.e.,
[TABLE]
such that the following relation holds
[TABLE]
One, therefore, concludes that the bias is bounded by the following series
[TABLE]
In order to obtain convergence of this series, we observe that polynomial boundedness of the observable implies
[TABLE]
i.e., grows polynomially in . Similarly, from eq. 52, it follows
[TABLE]
showing that decays exponentially in . Thus, decays exponentially in . This implies convergence of the series in the right-hand side of eq. 54.
It is important to note that each is weighted by the maximum of the observable on the corresponding volume which makes the bias inherently dependent on the observable while the coefficients are universal and represent the amount of mode dropping per bucket. ∎
As an explicit example, let us consider and the observable . This means that the true value of the observable is
[TABLE]
If we assume a mode dropping model , with , then
[TABLE]
For the definition of the , we can choose and thus and obtain . We note that since and the observable is it follows that
[TABLE]
Hence, the bias is within the bound given by the theorem
[TABLE]
.6 Details on the Numerical Experiments
In the following, we summarize the details and setup used to perform the training of both forward- and reverse-KL normalizing flows as well as to estimate the HMC reference values. For our experiments, we focused on the action from section V as a function of while keeping the coupling fixed throughout the analysis.
.6.1 HMC sampling
For estimating the HMC reference values of the free energy density reported in section V, we followed the same approach as in [2]. The idea is to discretize the trajectory in the -space (hopping parameter) into a sequence of finite steps where free energy differences can be calculated by running an HMC. The target free energy at an arbitrary point is then obtained by summing up all the free energy differences from to . We note that the higher the kappa values, the more steps one needs to make in order to discretize the trajectory up to the target point in parameter space. It follows that the uncertainty on the estimates also grows when increases as more terms are combined to obtain the free energy at the desired . Specifically, in our experiments, we chose a regular step-size between two subsequent values in the trajectory to be . Such step-size is used to discretize the trajectory starting from all the way up to the target. For instance, measuring the free energy density at would therefore require thirty steps, hence 30 independent HMC chains. Each of these chains is initialized around the vacuum expectation value (vev), has an overrelaxation every 10 steps, and a total of 10k thermalization steps, i.e. discarded configuration updates, followed by 500k sampling steps. Those configurations, from the equilibrium distribution, are used to estimate the free energy difference at a single given point of the trajectory. The total number of HMC samples needed to estimate the free energy at an arbitrary point thus depends on . For instance, referring to the previous example, 30 chains with 500k steps each add up to 15M HMC samples.
.6.2 Reverse-KL Flow
To train the reverse-KL normalizing flows we followed the same strategy presented in [2] with the same setup of hyperparameters. We used a batch size of 8k samples and a learning rate update according to the ReduceLROnPlateau scheduler of PyTorch with an initial learning rate of and patience of 3k steps. The flows have the same number of coupling blocks and the same type of checkerboard partitioning discussed in [2]. Models were trained for 700k steps in total and the last saved checkpoint is used for sampling. Every reverse-KL model was trained on two GPUs (in parallel), either P100 or A100 NVIDIA devices. Depending on the lattice volume and the model type the training took up to 50hrs of wall time.
.6.3 Forward-KL Flow
Training a forward-KL flow requires a different procedure which was outlined in section .1. For every flow model, we used 50M pre-sampled HMC configurations as input data. These were sampled in batches of 100 independent HMC chains each of which had 10k equilibration (discarded) and 500k sampling (stored) steps. The stored configurations from each chain in the batch were concatenated to generate the full training set.
At the stage of training, the 50M configurations are loaded in batches of 8k samples per iteration (training step). When the entire dataset is processed once, the full set of configurations is reshuffled and reused (as it is standard practice in deep learning) until the desired number of training iterations is reached. Again, forward-KL models were trained for 700k steps on two GPUs (in parallel), either P100 or A100 NVIDIA devices. Depending on the lattice volume and the model type the training took up to 55hrs of wall time.
.6.4 Flow sampling
For sampling configurations from both forward- and reverse-KL normalizing flows we proceed as follows. In order to have a fair comparison with HMC one would need to sample as many configurations as those needed to integrate the trajectory in the hopping parameter space, as discussed in section .6.1. However, for our flow estimates, we took only 1M configurations and used the estimators for mean and variance introduced in [24, 2] and proposed in section III.1. Though 1M is in general a lower bound on the total amount of configurations used to compute HMC estimates, we empirically observed this was sufficient to obtain estimates with errors several orders of magnitude smaller than HMC. Therefore, we took this as a sufficient number of samples for comparing the two sampling approaches.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Albergo et al. [2019] M. S. Albergo, G. Kanwar, and P. E. Shanahan, Phys. Rev. D 100 , 034515 (2019) . · doi ↗
- 2Nicoli et al. [2021 a] K. A. Nicoli, C. J. Anders, L. Funcke, T. Hartung, K. Jansen, P. Kessel, S. Nakajima, and P. Stornati, Phys. Rev. Lett. 126 , 032001 (2021 a) . · doi ↗
- 3Hackett et al. [2021] D. C. Hackett, C.-C. Hsieh, M. S. Albergo, D. Boyda, J.-W. Chen, K.-F. Chen, K. Cranmer, G. Kanwar, and P. E. Shanahan, Ar Xiv e-prints (2021), ar Xiv:2107.00734 .
- 4Nicoli et al. [2021 b] K. A. Nicoli, C. J. Anders, L. Funcke, T. Hartung, K. Jansen, P. Kessel, S. Nakajima, and P. Stornati, in 38th International Symposium on Lattice Field Theory (2021) ar Xiv:2111.11303 [hep-lat] .
- 5Gerdes et al. [2022] M. Gerdes, P. de Haan, C. Rainone, R. Bondesan, and M. C. Cheng, Ar Xiv e-prints (2022), ar Xiv:2207.00283 .
- 6Caselle et al. [2022 a] M. Caselle, E. Cellini, A. Nada, and M. Panero, JHEP 2022 (7), 15 . · doi ↗
- 7Caselle et al. [2022 b] M. Caselle, E. Cellini, A. Nada, and M. Panero, Ar Xiv e-prints (2022 b), ar Xiv:2210.03139 .
- 8Singha et al. [2023] A. Singha, D. Chakrabarti, and V. Arora, Phys. Rev. D 107 , 014512 (2023) . · doi ↗
