Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU

He Sun; Li Li; Mingjun Xiao; Chengzhong Xu

arXiv:2506.20187·cs.OS·July 3, 2025

Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU

He Sun, Li Li, Mingjun Xiao, Chengzhong Xu

PDF

Open Access

TL;DR

LeoAM introduces an adaptive hierarchical KV management system for long-context LLM inference on a single commodity GPU, significantly reducing latency while preserving response quality.

Contribution

It presents LeoAM, the first importance-aware, adaptive KV management system for efficient long-context LLM inference on a single commodity GPU.

Findings

01

Achieves 3.46x average inference speedup

02

Up to 5.47x speedup with larger batch sizes

03

Maintains comparable LLM response quality

Abstract

Advanced Large Language Models (LLMs) have achieved impressive performance across a wide range of complex and long-context natural language tasks. However, performing long-context LLM inference locally on a commodity GPU (a PC) with privacy concerns remains challenging due to the increasing memory demands of the key-value (KV) cache. Existing systems typically identify important tokens and selectively offload their KV data to GPU and CPU memory. The KV data needs to be offloaded to disk due to the limited memory on a commodity GPU, but the process is bottlenecked by token importance evaluation overhead and the disk's low bandwidth. In this paper, we present LeoAM, the first efficient importance-aware long-context LLM inference system for a single commodity GPU with adaptive hierarchical GPU-CPU-Disk KV management. Our system employs an adaptive KV management strategy that partitions KV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Topic Modeling