MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Jinghan Yao; Sam Ad\'e Jacobs; Walid Krichene; Masahiro Tanaka; Dhabaleswar K Panda

arXiv:2604.00235·cs.LG·April 2, 2026

MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Jinghan Yao, Sam Ad\'e Jacobs, Walid Krichene, Masahiro Tanaka, Dhabaleswar K Panda

PDF

1 Repo

TL;DR

MAC-Attention is a novel method that accelerates long-context decoding in large language models by reusing prior attention computations, significantly reducing KV accesses and latency while preserving fidelity.

Contribution

It introduces a match-amend-complete scheme that reuses attention computations for similar recent queries, improving speed and efficiency without degrading quality.

Findings

01

Reduces KV accesses by up to 99%

02

Cuts token generation latency by over 60% at 128K context length

03

Achieves over 14.3x speedups in attention phase

Abstract

Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YJHMITWEB/MAC-Attention.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.