Mode Seeking meets Mean Seeking for Fast Long Video Generation

Shengqu Cai; Weili Nie; Chao Liu; Julius Berner; Lvmin Zhang; Nanye Ma; Hansheng Chen; Maneesh Agrawala; Leonidas Guibas; Gordon Wetzstein; Arash Vahdat

arXiv:2602.24289·cs.CV·March 2, 2026

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat

PDF

Open Access

TL;DR

This paper introduces a novel training paradigm combining Mode Seeking and Mean Seeking for efficient long video generation, leveraging a Decoupled Diffusion Transformer to improve coherence and fidelity over extended durations.

Contribution

It proposes a unified framework that decouples local fidelity from long-term coherence using a dual-head approach with supervised flow matching and mode-seeking divergence.

Findings

01

Effective long-range coherence in minute-scale videos

02

Improved local sharpness and motion realism

03

Significant reduction in generation time

Abstract

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Advanced Vision and Imaging