Transactional Attention: Semantic Sponsorship for KV-Cache Retention
Abhinaba Basu

TL;DR
The paper introduces Transactional Attention, a novel mechanism that preserves critical tokens like credentials in KV-cache, significantly improving retention where existing methods fail, with minimal latency impact.
Contribution
Transactional Attention is a new sponsorship mechanism that protects essential tokens from eviction, enhancing credential retention in KV-cache beyond prior approaches.
Findings
TA achieves 100% credential retrieval at K=16, outperforming six baselines.
TA-Fast reduces memory overhead by 52% and is compatible with existing systems.
TA adds less than 1% latency overhead and is orthogonal to other compression methods.
Abstract
At K=16 tokens (0.4% of a 4K context), every existing KV-cache compression method achieves 0% on credential retrieval. The failure mode is dormant tokens: credentials, API keys, and configuration values that receive near-zero attention but become essential at generation time. Because these tokens lack the statistical signals that eviction policies rely on, no method based on attention scores, reconstruction loss, or learned retention gates retains them. We introduce Transactional Attention (TA), a sponsorship mechanism in which structural anchor patterns (e.g., "key:", "password:") protect adjacent value-bearing tokens from eviction. TA achieves 100% credential retrieval at K=16 where six baselines (H2O, TOVA, SnapKV, StreamingLLM, PyramidKV, DynamicKV) achieve 0%, and sustains 100% accuracy across 200 function-calling trials. TA-Fast, an attention-free variant, reduces memory overhead…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
