MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via   Dynamic Sparse Attention

Huiqiang Jiang; Yucheng Li; Chengruidong Zhang; Qianhui Wu; Xufang; Luo; Surin Ahn; Zhenhua Han; Amir H. Abdi; Dongsheng Li; Chin-Yew Lin; Yuqing; Yang; Lili Qiu

arXiv:2407.02490·cs.CL·October 31, 2024·3 cites

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang, Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing, Yang, Lili Qiu

PDF

Open Access 2 Repos 10 Models 1 Video

TL;DR

This paper introduces MInference, a dynamic sparse attention method that accelerates long-context LLM pre-filling by exploiting unique attention matrix patterns, achieving up to 10x speedup without accuracy loss.

Contribution

We propose a novel dynamic sparse attention technique that identifies and leverages specific patterns in long-context attention matrices for efficient GPU computation.

Findings

01

Up to 10x reduction in pre-filling latency on A100 GPUs.

02

Effective across various models and downstream tasks.

03

Maintains accuracy while significantly speeding up inference.

Abstract

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention· slideslive

Taxonomy

TopicsAdvancements in Photolithography Techniques · Medical Imaging Techniques and Applications · Innovative Microfluidic and Catalytic Techniques Innovation

MethodsSoftmax · Attention Is All You Need