SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie, Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, Yingnian Wu,, Lijuan Wang

TL;DR
SlowFast-VGen introduces a dual-speed learning framework combining slow world dynamics modeling with fast episodic memory storage, significantly improving long video generation consistency and long-horizon planning.
Contribution
It proposes a novel slow-fast learning system with a slow world dynamics model and fast episodic memory update, enabling more coherent long video generation.
Findings
Outperforms baselines with an FVD score of 514 versus 782.
Achieves more consistent long videos with fewer scene cuts (0.37 vs. 0.89).
Enhances long-horizon planning performance.
Abstract
Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Cell Image Analysis Techniques
MethodsDiffusion · Focus
