TL;DR
MAC-Attention is a novel method that accelerates long-context decoding in large language models by reusing prior attention computations, significantly reducing KV accesses and latency while preserving fidelity.
Contribution
It introduces a match-amend-complete scheme that reuses attention computations for similar recent queries, improving speed and efficiency without degrading quality.
Findings
Reduces KV accesses by up to 99%
Cuts token generation latency by over 60% at 128K context length
Achieves over 14.3x speedups in attention phase
Abstract
Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
