FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient   Long-Sequence Inference

Xunhao Lai; Jianqiao Lu; Yao Luo; Yiyuan Ma; Xun Zhou

arXiv:2502.20766·cs.LG·March 3, 2025

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou

PDF

Open Access 1 Repo

TL;DR

FlexPrefill introduces a dynamic, context-aware sparse attention mechanism for large language models, significantly improving efficiency and accuracy during long-sequence inference by adaptively adjusting attention patterns in real-time.

Contribution

It proposes a novel flexible sparse pre-filling method that adaptively adjusts attention patterns and computational resources based on input and attention head requirements.

Findings

01

Significant speed improvements over prior methods.

02

Enhanced inference accuracy with adaptive attention patterns.

03

Demonstrated effectiveness on long-sequence tasks.

Abstract

Large language models (LLMs) encounter computational challenges during long-sequence inference, especially in the attention pre-filling phase, where the complexity grows quadratically with the prompt length. Previous efforts to mitigate these challenges have relied on fixed sparse attention patterns or identifying sparse attention patterns based on limited cases. However, these methods lacked the flexibility to efficiently adapt to varying input demands. In this paper, we introduce FlexPrefill, a Flexible sparse Pre-filling mechanism that dynamically adjusts sparse attention patterns and computational budget in real-time to meet the specific requirements of each input and attention head. The flexibility of our method is demonstrated through two key innovations: 1) Query-Aware Sparse Pattern Determination: By measuring Jensen-Shannon divergence, this component adaptively switches between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/FlexPrefill
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Data Quality and Management

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings