Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Xiaoran Liu; Siyang He; Qiqi Wang; Ruixiao Li; Yuerong Song; Zhigeng Liu; Linlin Li; Qun Liu; Zengfeng Huang; Qipeng Guo; Ziwei He; Xipeng Qiu

arXiv:2506.11886·cs.CL·June 16, 2025

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Xiaoran Liu, Siyang He, Qiqi Wang, Ruixiao Li, Yuerong Song, Zhigeng Liu, Linlin Li, Qun Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

PDF

Open Access 3 Reviews

TL;DR

FourierAttention introduces a memory-efficient, training-free method for large language models that uses Fourier basis projections to better handle long-context dependencies, improving accuracy and deployment efficiency.

Contribution

The paper presents FourierAttention, a novel approach exploiting heterogeneous transformer head roles with Fourier basis projections, enhancing long-context modeling without training or accuracy loss.

Findings

01

Achieves state-of-the-art long-context accuracy on benchmarks.

02

Enables efficient deployment with a custom Triton kernel.

03

Maintains performance while reducing memory usage.

Abstract

Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The empirical finding that different head dimensions serve distinct roles (local vs. global context) is interesting and well-validated through ablation studies - Leveraging the HiPPO framework provides mathematical rigor, and the adaptation from complex to real-valued representations is sensible for practical implementation. - The custom Triton kernel shows practical engineering effort and is very nice.

Weaknesses

- (General) The title and abstract claim this is "KV cache compression," but the actual mechanism is better described as lossy approximation or dimensionality reduction. For this reason, while comparing to Palu (dimensionality reduction) makes sense, comparing to SnapKV and PyramidKV is kind of strange as these perform KV Cache eviction in a different setting. The authors should (a) discuss the differences between KV Cache compression, quantization, reduction in a clear way and (b) include quant

Reviewer 02Rating 4Confidence 4

Strengths

- Interesting analysis about how latent dimensions can be characterised into long-context-sensitive/insensitive (also from a mechinterp point of view) - Strong results compared with competitive baselines like SnapKV, PyramidKV, and Palu at comparable budgets on LongBench and NIAH - Interesting custom FlashFourierAttention kernel

Weaknesses

- Absolute gains are *very* modest -- are they statistically significant? - I was not able to find quantitative results on latency, please let me know if I missed those - Experiments only on Llama3-based backbones

Reviewer 03Rating 8Confidence 3

Strengths

+ This paper discovers that the different dimensions of Q and K in attention computation play different roles. Initially, this finding seemed counterintuitive to me, as I typically assumed that different dimensions were homogeneous—this was because I had overlooked the effect of ROPE. The paper innovatively leverages this insight by applying different compression strategies to different dimensional ranges, achieving better compression efficiency. + The paper also implements the proposed method'

Weaknesses

+ The paper only conducts experiments on LLaMA. It would be better to include comparisons with other open-source models as well.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Network Packet Processing and Optimization

MethodsLLaMA