TL;DR
FlowLong introduces a training-free, inference-time method for generating long videos by blending overlapping window predictions with Tweedie matching, ensuring temporal consistency and high visual quality.
Contribution
It proposes a novel, architecture-agnostic inference approach that generates longer videos without additional training, outperforming existing methods in quality and consistency.
Findings
Generates videos several times longer than native window length.
Outperforms training-free and autoregressive baselines in quality and temporal consistency.
Extends to audio-video joint generation and text-to-3DGS without fine-tuning.
Abstract
Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
