TL;DR
FreqKV introduces a frequency domain-based, parameter-free method for compressing key-value caches in large language models, significantly extending context windows while preserving performance.
Contribution
It proposes a novel, architecture-agnostic approach that compresses KV caches in the frequency domain, enabling efficient processing of ultra-long contexts without retraining.
Findings
Extends LLaMA-2's context window to 256K tokens.
Outperforms existing KV cache compression methods.
Maintains stable perplexity on long-context benchmarks.
Abstract
Existing key-value (KV) cache compression methods for large language models (LLMs) often rely on token eviction, which risks losing critical local information in both long prefilling and decoding scenarios. When extrapolating beyond the pretrained context length, their performance degrades sharply on long-context benchmarks. Motivated by the observation in the frequency domain that the context information is concentrated in the low-frequency components, we propose FreqKV, a parameter-free and architecture-agnostic approach. It iteratively compresses the increasing KV cache in the frequency domain, allowing models to process lengthy contexts efficiently. With minimal training at 8K length, FreqKV extends the context window of LLaMA-2-7B up to 256K tokens while maintaining stable perplexity. Extensive experiments across prefilling and decoding demonstrate that FreqKV enables robust…
Peer Reviews
Decision·ICLR 2026 Poster
1. Clear idea with strong intuition that low-frequency energy concentration in KV states 2. Comprehensive experiments on multiple benchmarks illustrated the effectiveness of the proposed method. 3. The ablation study provides more detailed information on frequency choice.
1. While the low-frequency concentration is a great motivation, the paper doesn't explore why this happens or what information is stored in which frequency bands. For instance, is the low-frequency "global context" and the high-frequency "local token-specific details"? Authors may provide a deeper analysis here to provide valuable insights 2. Attention heads, layers may carry different spectral content. A per-head adaptive $\gamma$ or power-based cutoff might outperform fixed ratios. Do the aut
* The idea of applying frequency-domain compression to the KV cache is both intuitive and novel. The similarity to JPEG compression makes the concept easy to understand, and the empirical results demonstrate that this approach is competitive with or superior to existing baselines. * The paper includes extensive comparisons with a wide range of prior methods on multiple datasets.
* The paper lacks a detailed analysis or intuitive explanation of why low-frequency components dominate in the KV cache. Moreover, the large magnitude of low-frequency components does not necessarily imply that they are semantically important, yet the paper seems to make this assumption. While the empirical evidence supports the method’s motivation and effectiveness, a more analytical approach would strengthen the argument. * Although the authors claim speed improvements in both the prefill and
- KV cache compression is an important topic due to the efficiency gains and power consumption concerns of modern transformers. - In addition to compression the cache, the method also seems to cause an efficiency gain in the decoding attention operation.
- Llama 3 is used as a baseline model. This is important because I believe the only reason some baselines show poor performance is because they have exceeded the number of training positional embeddings. However, Llama 3 is already 1.5 years old at this point. There are already 3 releases of Llama3 which go up to 3.3 and have 131K native positional embeddings. Can FreqKV be applied to these models and show the same good performance past 131K? - There is no comparison of latency with baselines s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
