TL;DR
TriAttention introduces a novel trigonometric series-based method to efficiently compress key-value caches in large language models, enabling long reasoning with reduced memory and increased throughput.
Contribution
It leverages Q/K concentration in pre-RoPE space and trigonometric series to estimate key importance, improving efficiency without sacrificing accuracy.
Findings
Achieves 2.5x higher throughput on AIME25 with 32K tokens
Reduces KV memory by 10.7x compared to full attention
Enables deployment on a single consumer GPU for long context tasks
Abstract
Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
