TL;DR
This paper introduces Causal Forcing++, a scalable method for real-time interactive video generation that significantly reduces latency and improves quality in few-step autoregressive diffusion models.
Contribution
It proposes causal consistency distillation for efficient, scalable initialization of few-step AR models, surpassing state-of-the-art methods in speed and quality.
Findings
Outperforms SOTA 4-step chunk-wise Causal Forcing in 2-step frame-wise setting.
Reduces first-frame latency by 50%.
Cuts Stage 2 training cost by approximately 4 times.
Abstract
Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
