FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers
Haopeng Jin

TL;DR
FreqFormer introduces a frequency-aware attention framework for long-sequence video diffusion transformers, reducing computational costs by spectrally structured heterogeneous attention and adaptive routing.
Contribution
It presents a novel spectral attention method with adaptive routing and a GPU execution plan, improving efficiency for long video sequences.
Findings
Substantially reduces attention FLOPs and memory traffic in simulations.
Supports spectrally structured heterogeneous attention as a practical approach.
Achieves efficient long-sequence video processing with a hardware-friendly pattern.
Abstract
Long-sequence video diffusion transformers hit a quadratic self-attention cost that dominates runtime and memory for very long token sequences. Most efficient attention methods use one approximation everywhere, yet video features are spectrally structured: low frequencies carry global layout and coarse motion; high frequencies carry texture and fine detail. We present FreqFormer, a frequency-aware heterogeneous attention framework. Token features are split into spectral bands with different operators: dense global attention on compressed low-frequency content, structured block-sparse attention on mid frequencies, and sliding-window local attention on high frequencies. A lightweight spectral routing network allocates heads across bands using layer statistics and the diffusion timestep, shifting compute toward global structure early in denoising and detail later. Cross-band summary tokens…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
