UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

Qihang Fan; Huaibo Huang; Zhiying Wu; Bingning Wang; Ran He

arXiv:2605.06221·cs.CL·May 8, 2026

UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

Qihang Fan, Huaibo Huang, Zhiying Wu, Bingning Wang, Ran He

PDF

1 Repo

TL;DR

UniPrefill is a versatile prefill acceleration framework that enhances long-context inference efficiency across various model architectures, achieving up to 2.1x speedup in Time-To-First-Token.

Contribution

It introduces a model-agnostic prefill acceleration method compatible with modern inference engines and architectures, extending vLLM to support seamless integration.

Findings

01

Achieves up to 2.1x speedup in TTFT.

02

Effective across diverse model architectures.

03

Supports continuous batching and tensor parallelism.

Abstract

As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context processing, several novel low-complexity hybrid architectures have recently been proposed, effectively alleviating the computational burden of long-context inference. However, existing research on long-context prefill acceleration remains predominantly focused on sparse attention mechanisms, which achieve their maximum speedup only on full-attention models. When transferred to emerging architectures--such as linear/full attention hybrids or sliding window/full attention hybrids--these prefill acceleration approaches suffer significant performance degradation. Furthermore, such methods are generally incompatible with continuous batching, making them difficult to integrate into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qhfan/UniPrefill
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.