FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion
Yu Lu, Yi Yang

TL;DR
FreeLong++ is a training-free, multi-scale spectral fusion framework that significantly improves the quality and temporal consistency of long video generation from existing short-video models without additional training.
Contribution
It introduces a novel multi-branch, multi-scale frequency fusion architecture that enhances long video generation quality without extra training.
Findings
Outperforms previous methods on longer video generation tasks.
Enables coherent multi-prompt video generation with smooth transitions.
Supports controllable video generation using depth or pose sequences.
Abstract
Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
