SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video   Generation

Yining Hong; Beide Liu; Maxine Wu; Yuanhao Zhai; Kai-Wei Chang; Linjie; Li; Kevin Lin; Chung-Ching Lin; Jianfeng Wang; Zhengyuan Yang; Yingnian Wu,; Lijuan Wang

arXiv:2410.23277·cs.CV·November 4, 2024

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie, Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, Yingnian Wu,, Lijuan Wang

PDF

Open Access

TL;DR

SlowFast-VGen introduces a dual-speed learning framework combining slow world dynamics modeling with fast episodic memory storage, significantly improving long video generation consistency and long-horizon planning.

Contribution

It proposes a novel slow-fast learning system with a slow world dynamics model and fast episodic memory update, enabling more coherent long video generation.

Findings

01

Outperforms baselines with an FVD score of 514 versus 782.

02

Achieves more consistent long videos with fewer scene cuts (0.37 vs. 0.89).

03

Enhances long-horizon planning performance.

Abstract

Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Cell Image Analysis Techniques

MethodsDiffusion · Focus