Fast and Unified Path Gradient Estimators for Normalizing Flows
Lorenz Vaitl, Ludwig Winkler, Lorenz Richter, Pan Kessel

TL;DR
This paper introduces a fast, scalable path gradient estimator for normalizing flows that reduces variance, improves training efficiency, and is applicable to maximum likelihood training, enhancing performance in scientific applications.
Contribution
The authors develop a computationally efficient path gradient estimator applicable to all practical normalizing flow architectures, enabling scalable maximum likelihood training with variance reduction.
Findings
Significantly improved computational efficiency of path gradient estimators.
Reduced variance in gradient estimates across multiple applications.
Enhanced performance in natural sciences tasks using the new estimator.
Abstract
Recent work shows that path gradient estimators for normalizing flows have lower variance compared to standard estimators for variational inference, resulting in improved training. However, they are often prohibitively more expensive from a computational point of view and cannot be applied to maximum likelihood training in a scalable manner, which severely hinders their widespread adoption. In this work, we overcome these crucial limitations. Specifically, we propose a fast path gradient estimator which improves computational efficiency significantly and works for all normalizing flow architectures of practical relevance. We then show that this estimator can also be applied to maximum likelihood training for which it has a regularizing effect as it can take the form of a given target energy function into account. We empirically establish its superior performance and reduced variance for…
Peer Reviews
Decision·ICLR 2024 poster
+Fast pathwise gradients are certainly necessary for normalizing flows, and the current work provides this with a large improvement over the prior work in terms of computational speed. +The method improves in both walltime and efficiency. +The method allows both forward and reverse KL training.
-The literature review is a bit misleading, as pathwise gradients have been around for a long time, e.g., see [L'Ecuyer, P. (1991). An overview of derivative estimation] where it is referred to as "infinitesimal perturbation analysis". Moreover, reparameterization gradients are a type of pathwise gradient, and there are other works discussing it, e.g., [Jankowiak & Obermeyer, 2018] or [Parmas & Sugiyama, 2021]. The current work is mainly referring to pathwise gradients in the context of normaliz
The paper technically precise and, to my knowledge, presents valuable original work with immediate applications. The experiments were generally informative. Its major contribution is reducing the computational complexity for calculating path gradients of both forward and reverse KL when $\log p(x) + c$ is queriable. The theoretical results appear sound after some inspection. I believe the overall contribution is valuable enough to share with the broader ICLR community, though I was surprised t
I had some difficulty reading this work, despite some prior exposure to the subject matter. It took me several passes to make sense of what the key contribution was, and I wished for additional clarity. The key idea behind "path gradients" (dropping a term that has zero expectation value) from the empirical estimation of the gradient is easy enough to understand, but took some time to distill from the intro [1]. Regarding the experiments, at least one sentence introducing effective sample size
- The method obtains significant improvement in speed in practice, especially for the case of flows that require implicit differentiation for inversion. - The method obtains improved generalization for the forward KL training relative to - Incoporating the energy function of the target in the forward KL training is novel. And having a loss with the “sticking the landing” property for the forward KL is useful.
- The speedup for explicitly invertible flows (which are more common) is relatively minor. - The authors emphasise that an advantage of their method relative to those from Vaitl et al. for the estimation of the forward KL is that their method does not require reweighting. However, their method uses samples from the target, while the method from Vaitl et al. uses samples from the flow - hence the two methods are not directly comparable as they are for different situations. I think this is somewh
Videos
Taxonomy
TopicsFluid Dynamics and Turbulent Flows · Advanced Image Processing Techniques · Model Reduction and Neural Networks
MethodsNormalizing Flows
