Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, Eli Shechtman

TL;DR
Self Forcing presents a new training paradigm for autoregressive video diffusion models that reduces exposure bias and enables real-time high-quality video generation by conditioning on self-generated outputs during training.
Contribution
It introduces Self Forcing, a novel autoregressive training method with holistic sequence supervision, efficient KV caching, and stochastic gradient truncation for fast, high-quality video synthesis.
Findings
Achieves real-time streaming video generation with sub-second latency.
Matches or surpasses the quality of slower, non-causal diffusion models.
Effectively balances computational cost and performance with new training strategies.
Abstract
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Image and Video Quality Assessment
MethodsDiffusion
