Faster Inference of Flow-Based Generative Models via Improved Data-Noise Coupling
Aram Davtyan, Leello Tadesse Dadi, Volkan Cevher, Paolo Favaro

TL;DR
LOOM-CFM improves data-noise coupling in flow-based models by optimizing across minibatches, leading to faster, higher-quality image and video generation, and better high-resolution synthesis.
Contribution
The paper introduces LOOM-CFM, a novel method extending minibatch optimal transport to optimize data-noise assignments across training, enhancing inference speed and quality.
Findings
Consistent speed-quality improvements across datasets
Enhanced high-resolution synthesis capabilities
Better initialization for distillation processes
Abstract
Conditional Flow Matching (CFM), a simulation-free method for training continuous normalizing flows, provides an efficient alternative to diffusion models for key tasks like image and video generation. The performance of CFM in solving these tasks depends on the way data is coupled with noise. A recent approach uses minibatch optimal transport (OT) to reassign noise-data pairs in each training step to streamline sampling trajectories and thus accelerate inference. However, its optimization is restricted to individual minibatches, limiting its effectiveness on large datasets. To address this shortcoming, we introduce LOOM-CFM (Looking Out Of Minibatch-CFM), a novel method to extend the scope of minibatch OT by preserving and optimizing these assignments across minibatches over training time. Our approach demonstrates consistent improvements in the sampling speed-quality trade-off across…
Peer Reviews
Decision·ICLR 2025 Poster
* This paper has a clear focus and motivation. Coupling methods with little computational cost that can improve the performance of CFMs, especially in the low-step regime, have the potential to become widely adopted. The proposed method is scalable and the extra I/O cost seems reasonable. * The method is clearly presented and natural. While the method remains unable to guarantee an optimal pairing (a fact which holds for any method which does not scan the entire dataset at once), it provides a r
* It seems somewhat limited to use only a fixed set of pre-sampled noise to train the model, in contrast to the typical approach of drawing fresh noise at every iteration. I would guess that the need for using caching for small datasets is related to the fact that the noise is not refreshed, leaving the method with a bit of a loose end. The work claims that in practice this is not an issue for large enough datasets. Nonetheless, I am curious if it is possible to define a variant of the method wh
This paper introduces an iterative approach to more accurately approximate the global optimal assignment between noise and data samples, resulting in a more precise estimation of the global OT plan compared to minibatch-CFM. It also present a convergence analysis of LOOM-CFM.
1. The paper lacks an analysis of the training time overhead and convergence rate of LOOM-CFM compared to minibatch-CFM. Additionally, while finite convergence is intuitive, providing a detailed analysis of the convergence rate is crucial—especially to understand the scalability of this method for large-scale text-to-image (T2I) and text-to-video (T2V) diffusion model training. 2. While the use of multiple noise caches empirically helps reduce overfitting, the one-to-many correspondence may aff
**Originality** : this paper introduces an alternative (LOOM) to the Hungarian algorithm or Sinkhorn for approximating the OT coupling between two empirical measures. LOOM is suitable for conditional flow matching / minibatch flow matching in the sense that it uses local OT coupling to improve the global OT coupling. **Clarity** : method and experiment results are clearly presented. I had no problem following the writing. **Quality** : proposed method is supported with both theory and experime
**Weak Performance** : the only weakness of the paper stopping me from giving Accept is the weak empirical performance of the proposed method compared to relevant baselines such as [1] and [2] (which also happen to be missing from Section 4). For instance, [1] achieves 1.97 FID on CIFAR10 with 35 NFEs, whereas LOOM achieves 4.41 FID with 134 NFEs. On ImageNet-64, [2] achieves around 1.4 FID with 63 NFEs while LOOM yields 2.75 FID with 133 NFEs. [1] Elucidating the Design Space of Diffusion-Base
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Lattice Boltzmann Simulation Studies
