CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling   Acceleration in LLMs

Junlin Lv; Yuan Feng; Xike Xie; Xin Jia; Qirong Peng; and Guiming Xie

arXiv:2409.12490·cs.CL·September 24, 2024

CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs

Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, and Guiming Xie

PDF

Open Access 1 Repo

TL;DR

CritiPrefill introduces a segment-wise criticality-based method to accelerate the prefilling phase in long-context LLM inference, achieving up to 3x speedup with minimal quality loss.

Contribution

The paper proposes CritiPrefill, a novel segment-wise criticality estimation approach that prunes non-critical computations during prefilling in long-context LLMs.

Findings

01

Up to 2.7x speedup on Llama3-8B

02

Up to 3.0x speedup on Yi-9B

03

Minimal quality degradation observed

Abstract

Large language models have achieved notable success across various domains, yet efficient inference is still limited by the quadratic computation complexity of the attention mechanism. The inference consists of prefilling and decoding phases. Although several attempts have been made to accelerate decoding, the inefficiency of the prefilling phase, especially for long-context tasks, remains a challenge. In this paper, we observe a locality in query criticality during the prefilling phase of long-context processing: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache. Based on this observation, we propose CritiPrefill, a criticality-based segment-wise prefilling method. This method partitions the input sequence's queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. By pruning non-critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

66ring/critiprefill
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Access Control and Trust

MethodsSoftmax · Attention Is All You Need · Pruning · Focus