Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang; Zhengqi Li; Guande He; Mingyuan Zhou; Eli Shechtman

arXiv:2506.08009·cs.CV·November 11, 2025

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, Eli Shechtman

PDF

Open Access 1 Models 1 Datasets

TL;DR

Self Forcing presents a new training paradigm for autoregressive video diffusion models that reduces exposure bias and enables real-time high-quality video generation by conditioning on self-generated outputs during training.

Contribution

It introduces Self Forcing, a novel autoregressive training method with holistic sequence supervision, efficient KV caching, and stochastic gradient truncation for fast, high-quality video synthesis.

Findings

01

Achieves real-time streaming video generation with sub-second latency.

02

Matches or surpasses the quality of slower, non-causal diffusion models.

03

Effectively balances computational cost and performance with new training strategies.

Abstract

We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
gdhe17/Self-Forcing
model· 3.1k dl· ♡ 126
3.1k dl♡ 126

Datasets

PencilHu/SelfForcing-Instance
dataset· 179 dl
179 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Image and Video Quality Assessment

MethodsDiffusion