MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Dong Liu; Yanxuan Yu; Ben Lengerich; Ying Nian Wu

arXiv:2603.20586·cs.LG·March 25, 2026

MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu

PDF

Open Access

TL;DR

The paper introduces Memory-Keyed Attention (MKA), a hierarchical attention mechanism that efficiently manages multi-level caches for long-context language modeling, significantly improving speed without sacrificing accuracy.

Contribution

MKA is a novel hierarchical attention framework that dynamically routes attention across multi-level caches, enhancing efficiency in long-context modeling.

Findings

01

FastMKA achieves up to 5x faster training throughput.

02

FastMKA maintains comparable perplexity to MLA.

03

Evaluation shows improved efficiency and accuracy trade-offs.

Abstract

As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and Multi-Latent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches (local, session, and long-term) and learns to route attention across them dynamically. We further introduce Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources before attention computation for improved efficiency. Experiments on different sequence lengths show that FastMKA achieves a favorable accuracy-efficiency trade-off: comparable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning in Healthcare · Topic Modeling