Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Mengqi Liao; Lu Wang; Chaoyun Zhang; Bo Qiao; Si Qin; Qingwei Lin; Saravan Rajmohan; Dongmei Zhang; Huaiyu Wan

arXiv:2603.08743·cs.DC·March 11, 2026

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Mengqi Liao, Lu Wang, Chaoyun Zhang, Bo Qiao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan

PDF

Open Access

TL;DR

This paper introduces Zipage, a high-concurrency LLM inference engine that uses Compressed PagedAttention to efficiently manage memory and maintain high request throughput during reasoning tasks.

Contribution

The paper presents Compressed PagedAttention and a scheduling strategy, enabling high concurrency and memory efficiency in LLM inference, which is practical for industrial applications.

Findings

01

Achieves 95% of full KV inference performance on large-scale tasks.

02

Provides over 2.1× speedup in inference throughput.

03

Supports prefix caching and asynchronous compression for efficiency.

Abstract

With reasoning becoming the generative paradigm for large language models (LLMs), the memory bottleneck caused by KV cache during the decoding phase has become a critical factor limiting high-concurrency service. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency LLM inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95\% of the performance of Full KV inference engines while delivering over 2.1 $\times$ speedup.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Big Data and Digital Economy