FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers

Haopeng Jin

arXiv:2604.22808·cs.CV·April 28, 2026

FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers

Haopeng Jin

PDF

TL;DR

FreqFormer introduces a frequency-aware attention framework for long-sequence video diffusion transformers, reducing computational costs by spectrally structured heterogeneous attention and adaptive routing.

Contribution

It presents a novel spectral attention method with adaptive routing and a GPU execution plan, improving efficiency for long video sequences.

Findings

01

Substantially reduces attention FLOPs and memory traffic in simulations.

02

Supports spectrally structured heterogeneous attention as a practical approach.

03

Achieves efficient long-sequence video processing with a hardware-friendly pattern.

Abstract

Long-sequence video diffusion transformers hit a quadratic self-attention cost that dominates runtime and memory for very long token sequences. Most efficient attention methods use one approximation everywhere, yet video features are spectrally structured: low frequencies carry global layout and coarse motion; high frequencies carry texture and fine detail. We present FreqFormer, a frequency-aware heterogeneous attention framework. Token features are split into spectral bands with different operators: dense global attention on compressed low-frequency content, structured block-sparse attention on mid frequencies, and sliding-window local attention on high frequencies. A lightweight spectral routing network allocates heads across bands using layer statistics and the diffusion timestep, shifting compute toward global structure early in denoising and detail later. Cross-band summary tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.