Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing
Dan Peng, Zhihui Fu, Zewen Ye, Zhuoran Song, Jun Wang

TL;DR
This paper introduces a novel sparse attention mechanism for long-context LLMs that shares accurate attention patterns across heads, significantly improving efficiency without sacrificing accuracy.
Contribution
The paper proposes a new sparse attention method that leverages inter-head pattern similarity to enhance speed and accuracy in long-context inference.
Findings
Achieves superior or comparable speedup to state-of-the-art methods.
Maintains high accuracy by capturing true attention dynamics.
Reduces full attention computations to a small subset of heads.
Abstract
Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Service-Oriented Architecture and Web Services
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
