Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU
He Sun, Li Li, Mingjun Xiao, Chengzhong Xu

TL;DR
LeoAM introduces an adaptive hierarchical KV management system for long-context LLM inference on a single commodity GPU, significantly reducing latency while preserving response quality.
Contribution
It presents LeoAM, the first importance-aware, adaptive KV management system for efficient long-context LLM inference on a single commodity GPU.
Findings
Achieves 3.46x average inference speedup
Up to 5.47x speedup with larger batch sizes
Maintains comparable LLM response quality
Abstract
Advanced Large Language Models (LLMs) have achieved impressive performance across a wide range of complex and long-context natural language tasks. However, performing long-context LLM inference locally on a commodity GPU (a PC) with privacy concerns remains challenging due to the increasing memory demands of the key-value (KV) cache. Existing systems typically identify important tokens and selectively offload their KV data to GPU and CPU memory. The KV data needs to be offloaded to disk due to the limited memory on a commodity GPU, but the process is bottlenecked by token importance evaluation overhead and the disk's low bandwidth. In this paper, we present LeoAM, the first efficient importance-aware long-context LLM inference system for a single commodity GPU with adaptive hierarchical GPU-CPU-Disk KV management. Our system employs an adaptive KV management strategy that partitions KV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Topic Modeling
