End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, Dahua Lin

TL;DR
This paper introduces Resampling Forcing, an end-to-end, teacher-free training framework for autoregressive video diffusion models that improves temporal consistency and scalability by simulating inference errors and dynamically retrieving relevant history frames.
Contribution
The paper proposes a novel self-resampling scheme and history routing mechanism enabling scalable, end-to-end training of autoregressive video diffusion models without external teachers.
Findings
Achieves comparable performance to distillation-based methods.
Exhibits superior temporal consistency on longer videos.
Supports efficient long-horizon video generation.
Abstract
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
