Multipole Attention for Efficient Long Context Reasoning

Coleman Hooper; Sebastian Zhao; Luca Manolache; Sehoon Kim; Michael W. Mahoney; Yakun Sophia Shao; Kurt Keutzer; Amir Gholami

arXiv:2506.13059·cs.CL·December 16, 2025

Multipole Attention for Efficient Long Context Reasoning

Coleman Hooper, Sebastian Zhao, Luca Manolache, Sehoon Kim, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

PDF

Open Access 1 Video

TL;DR

This paper introduces Multipole Attention, a novel method that accelerates long-context reasoning in large models by selectively computing exact attention for important tokens and approximating others, achieving significant speedups without sacrificing accuracy.

Contribution

The paper proposes Multipole Attention, a clustering-based approach that efficiently accelerates autoregressive reasoning by combining exact and approximate attention computations.

Findings

01

Achieves up to 4.5× speedup in attention computation.

02

Maintains high accuracy on complex reasoning tasks.

03

Demonstrates effectiveness on models like Qwen-8B.

Abstract

Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Additionally, prior methods often pre-process the input to make it easier to identify the important prompt tokens when computing attention during generation, and this pre-processing is challenging to perform online for newly generated reasoning tokens. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multipole Attention for Efficient Long Context Reasoning· slideslive

Taxonomy

TopicsGeophysical Methods and Applications · Indoor and Outdoor Localization Technologies · Speech Recognition and Synthesis