TL;DR
This paper introduces Causal Forcing, a novel method for distilling autoregressive video diffusion models that bridges architectural gaps and achieves superior real-time video generation quality.
Contribution
It proposes Causal Forcing, a new distillation technique that effectively bridges the gap between bidirectional and autoregressive models for high-quality video synthesis.
Findings
Outperforms all baselines across all metrics.
Surpasses the SOTA Self Forcing by 19.3% in Dynamic Degree.
Achieves 8.7% improvement in VisionReward and 16.7% in Instruction Following.
Abstract
To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing, which uses an autoregressive teacher for ODE initialization to bridge the architectural gap, and then applies the same DMD procedure as in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Human Motion and Animation
