Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing

Dan Peng; Zhihui Fu; Zewen Ye; Zhuoran Song; Jun Wang

arXiv:2505.19578·cs.LG·May 27, 2025

Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing

Dan Peng, Zhihui Fu, Zewen Ye, Zhuoran Song, Jun Wang

PDF

Open Access

TL;DR

This paper introduces a novel sparse attention mechanism for long-context LLMs that shares accurate attention patterns across heads, significantly improving efficiency without sacrificing accuracy.

Contribution

The paper proposes a new sparse attention method that leverages inter-head pattern similarity to enhance speed and accuracy in long-context inference.

Findings

01

Achieves superior or comparable speedup to state-of-the-art methods.

02

Maintains high accuracy by capturing true attention dynamics.

03

Reduces full attention computations to a small subset of heads.

Abstract

Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Service-Oriented Architecture and Web Services

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings