TL;DR
Salt introduces a novel training method for fast, high-quality real-time video generation by regularizing denoising processes and leveraging cache-aware training, improving output quality at low computational budgets.
Contribution
The paper proposes Self-Consistent Distribution Matching Distillation (SC-DMD) and cache-conditioned training to enhance low-NFE video generation quality across various models.
Findings
Improved video quality at low inference budgets across multiple backbones.
Effective regularization of denoising composition to prevent drift.
Compatibility with diverse cache memory mechanisms.
Abstract
Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
