vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving
Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu, Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi, Guo, Jingwen Leng

TL;DR
vTensor introduces a GPU virtual memory-based tensor structure that decouples computation from memory management, significantly improving LLM inference speed and memory efficiency across various models and scenarios.
Contribution
The paper presents vTensor, a novel tensor framework that enhances LLM inference by decoupling computation from memory management using GPU virtual memory, enabling dynamic extensibility and fragmentation-free operation.
Findings
Achieves an average speedup of 1.86x across models.
Provides up to 2.42x speedup in multi-turn chat scenarios.
Frees approximately 71.25% of GPU memory compared to vLLM.
Abstract
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Lattice Boltzmann Simulation Studies · Computational Physics and Python Applications
MethodsSoftmax · Attention Is All You Need · Fragmentation
