Terminal Velocity Matching
Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song

TL;DR
Terminal Velocity Matching (TVM) is a novel generative modeling approach that improves efficiency and fidelity by modeling transitions between diffusion steps and regularizing at the terminal time, achieving state-of-the-art results.
Contribution
TVM introduces a new framework for high-fidelity one- and few-step generative modeling, with architectural modifications for stability and a fused attention kernel for efficiency.
Findings
Achieves 3.29 FID with 1 NFE on ImageNet-256x256
Achieves 1.99 FID with 4 NFEs on ImageNet-256x256
State-of-the-art performance for one/few-step models from scratch
Abstract
We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the -Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves…
Peer Reviews
Decision·ICLR 2026 Poster
1. TVM reframes the problem of learning long-horizon ODE jumps as a terminal velocity condition (Eq. 6–7), providing a clean theoretical link between displacement error and velocity matching. 2. Theorem 1 establishes a distribution-level guarantee ($W_2$ upper bound) without requiring multiple particles (unlike IMM). 3. Duality with MeanFlow is clearly articulated: The paper shows MeanFlow matches initial velocity while TVM matches terminal velocity (Appendix E.1), offering a compelling symmetry
1. While inference is fast, training cost (FLOPs, GPU-hours) vs. MeanFlow, sCT, or IMM is not reported.
1. The theoretical formulation is elegant and well-motivated, linking Flow Matching to a Wasserstein upper bound through terminal velocity constraints. 2. The method achieves excellent efficiency–quality trade-offs, outperforming existing one-step and few-step baselines (e.g., Consistency Models, MeanFlow) on ImageNet-256. 3. The architectural refinements (semi-Lipschitz normalization, FlashAttention JVP) are practically valuable contributions that could generalize to other diffusion or flow-bas
1. While theoretically grounded, the intuition behind “terminal velocity” could be elaborated further—especially how it differs in practice from midpoint or integral matching. 2. The scope of evaluation is limited to class-conditional ImageNet-256. Demonstrating robustness on higher-resolution or unconditional datasets (e.g., ImageNet-512, COCO) would strengthen generality claims. 3. The paper relies on a single architecture (DiT-XL/2). It is unclear whether TVM’s benefits extend to U-Net–based
1. **Boundary condition at terminal time** By enforcing velocity matching at the terminal rather than initial timestep, the method avoids evaluating JVPs involving guided velocities that often exhibit large norms and high variance during training. This could be more beneficial for stabilizing training when scaling to larger dimensionality, where the guided velocities often exhibit even larger norms and higher variance. 2. **Simple and stable training recipe** The paper achieves stable one-sta
1. **Backpropagation through JVP.** Unlike prior continuous-time consistency models where JVP terms are detached from the gradient graph (sCM, MeanFlow, etc.), TVM explicitly backpropagates through the JVP term introducing additional computational cost. This could become prohibitive for large-scale models. Providing quantitative analysis (e.g., runtime, memory, or gradient-cost overhead relative to MeanFlow) would help stress the concern. 2. **Insufficient ablations.** While the design choice
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Model Reduction and Neural Networks
