TL;DR
FreeSpec introduces a spectral reconstruction framework that enhances training-free long video generation by preserving spatial details and temporal dynamics, addressing issues like content drift and inconsistency.
Contribution
It proposes a novel spectral decomposition approach using singular value decomposition to improve long-video synthesis without additional training.
Findings
Improves temporal consistency and spatial detail preservation in long videos.
Effectively maintains visual quality and dynamic motion in generated videos.
Demonstrates superior performance on Wan2.1 and LTX-Video datasets.
Abstract
Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
