CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, and Guiming Xie

TL;DR
CritiPrefill introduces a segment-wise criticality-based method to accelerate the prefilling phase in long-context LLM inference, achieving up to 3x speedup with minimal quality loss.
Contribution
The paper proposes CritiPrefill, a novel segment-wise criticality estimation approach that prunes non-critical computations during prefilling in long-context LLMs.
Findings
Up to 2.7x speedup on Llama3-8B
Up to 3.0x speedup on Yi-9B
Minimal quality degradation observed
Abstract
Large language models have achieved notable success across various domains, yet efficient inference is still limited by the quadratic computation complexity of the attention mechanism. The inference consists of prefilling and decoding phases. Although several attempts have been made to accelerate decoding, the inefficiency of the prefilling phase, especially for long-context tasks, remains a challenge. In this paper, we observe a locality in query criticality during the prefilling phase of long-context processing: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache. Based on this observation, we propose CritiPrefill, a criticality-based segment-wise prefilling method. This method partitions the input sequence's queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. By pruning non-critical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Access Control and Trust
MethodsSoftmax · Attention Is All You Need · Pruning · Focus
