MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning
Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu

TL;DR
The paper introduces Memory-Keyed Attention (MKA), a hierarchical attention mechanism that efficiently manages multi-level caches for long-context language modeling, significantly improving speed without sacrificing accuracy.
Contribution
MKA is a novel hierarchical attention framework that dynamically routes attention across multi-level caches, enhancing efficiency in long-context modeling.
Findings
FastMKA achieves up to 5x faster training throughput.
FastMKA maintains comparable perplexity to MLA.
Evaluation shows improved efficiency and accuracy trade-offs.
Abstract
As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and Multi-Latent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches (local, session, and long-term) and learns to route attention across them dynamically. We further introduce Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources before attention computation for improved efficiency. Experiments on different sequence lengths show that FastMKA achieves a favorable accuracy-efficiency trade-off: comparable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Machine Learning in Healthcare · Topic Modeling
