FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation
Yunfeng Wu, Jiayi Song, Zhenxiong Tan, Zihao He, Songhua Liu

TL;DR
FreeSwim introduces a training-free, sliding-window attention method for ultra-high-resolution video generation using pretrained Diffusion Transformers, ensuring detailed, coherent videos efficiently without additional training.
Contribution
It proposes a novel, training-free approach leveraging sliding window and cross-attention mechanisms to generate high-resolution videos from pretrained models.
Findings
Produces ultra-high-resolution videos with fine details
Achieves superior performance on VBench compared to training-based methods
Ensures global coherence through a dual-path attention strategy
Abstract
The quadratic time and memory complexity of the attention mechanism in modern Transformer based video generators makes end-to-end training for ultra high resolution videos prohibitively expensive. Motivated by this limitation, we introduce a training-free approach that leverages video Diffusion Transformers pretrained at their native scale to synthesize higher resolution videos without any additional training or adaptation. At the core of our method lies an inward sliding window attention mechanism, which originates from a key observation: maintaining each query token's training scale receptive field is crucial for preserving visual fidelity and detail. However, naive local window attention, unfortunately, often leads to repetitive content and exhibits a lack of global coherence in the generated results. To overcome this challenge, we devise a dual-path pipeline that backs up window…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Advanced Image Processing Techniques
