TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao; Xi Lin; Wei Huang; Yuxin Xie; Tianfu Fu; Bohan Zhuang; Song Han; Yukang Chen

arXiv:2604.04921·cs.CL·April 7, 2026

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen

PDF

1 Repo 2 Models

TL;DR

TriAttention introduces a novel trigonometric series-based method to efficiently compress key-value caches in large language models, enabling long reasoning with reduced memory and increased throughput.

Contribution

It leverages Q/K concentration in pre-RoPE space and trigonometric series to estimate key importance, improving efficiency without sacrificing accuracy.

Findings

01

Achieves 2.5x higher throughput on AIME25 with 32K tokens

02

Reduces KV memory by 10.7x compared to full attention

03

Enables deployment on a single consumer GPU for long context tasks

Abstract

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

weianmao/triattention
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.