History-Guided Video Diffusion

Kiwhan Song; Boyuan Chen; Max Simchowitz; Yilun Du; Russ Tedrake; Vincent Sitzmann

arXiv:2502.06764·cs.LG·July 25, 2025

History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, Vincent Sitzmann

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces DFoT, a novel video diffusion architecture with a training objective that supports flexible history conditioning, and proposes History Guidance methods that improve video quality, temporal consistency, and motion dynamics, enabling long and diverse video generation.

Contribution

The paper presents DFoT, a theoretically grounded architecture supporting variable-length history conditioning, and introduces History Guidance techniques that enhance video diffusion performance.

Findings

01

Significant improvement in video quality and temporal consistency with vanilla history guidance.

02

Enhanced motion dynamics and out-of-distribution generalization with advanced guidance methods.

03

Stable generation of extremely long videos using the proposed techniques.

Abstract

Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sensetime-fvg/opendwm
pytorch

Models

🤗
kiwhansong/DFoT
model· 66k dl· ♡ 10
66k dl♡ 10

Videos

History-Guided Video Diffusion· slideslive

Taxonomy

TopicsVideo Coding and Compression Technologies · Advanced Image Processing Techniques · Advanced Vision and Imaging

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Adam · Dropout · Diffusion · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding