MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang, Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing, Yang, Lili Qiu

TL;DR
This paper introduces MInference, a dynamic sparse attention method that accelerates long-context LLM pre-filling by exploiting unique attention matrix patterns, achieving up to 10x speedup without accuracy loss.
Contribution
We propose a novel dynamic sparse attention technique that identifies and leverages specific patterns in long-context attention matrices for efficient GPU computation.
Findings
Up to 10x reduction in pre-filling latency on A100 GPUs.
Effective across various models and downstream tasks.
Maintains accuracy while significantly speeding up inference.
Abstract
The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Qwen/Qwen3-30B-A3B-Instruct-2507model· 1.0M dl· ♡ 7951.0M dl♡ 795
- 🤗Qwen/Qwen3-235B-A22B-Instruct-2507model· 178k dl· ♡ 770178k dl♡ 770
- 🤗Qwen/Qwen3-235B-A22B-Thinking-2507model· 78k dl· ♡ 40378k dl♡ 403
- 🤗Qwen/Qwen3-30B-A3B-Thinking-2507model· 1.0M dl· ♡ 3711.0M dl♡ 371
- 🤗AIDXteam/Qwen3-235B-A22B-Thinking-2507-AWQmodel· 4 dl4 dl
- 🤗AmirHaz/Affine-yollloooomodel· 19 dl19 dl
- 🤗Mungert/Qwen3-30B-A3B-Thinking-2507-GGUFmodel· 234 dl234 dl
- 🤗Mungert/Qwen3-30B-A3B-Instruct-2507-GGUFmodel· 145 dl· ♡ 2145 dl♡ 2
- 🤗Intellicia/Sullivanmodel
- 🤗chutesai/Qwen3-235B-A22B-Instruct-2507-1Mmodel· 2 dl· ♡ 12 dl♡ 1
Videos
Taxonomy
TopicsAdvancements in Photolithography Techniques · Medical Imaging Techniques and Applications · Innovative Microfluidic and Catalytic Techniques Innovation
MethodsSoftmax · Attention Is All You Need
