vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

Jiale Xu; Rui Zhang; Cong Guo; Weiming Hu; Zihan Liu; Feiyang Wu; Yu; Feng; Shixuan Sun; Changxu Shao; Yuhong Guo; Junping Zhao; Ke Zhang; Minyi; Guo; Jingwen Leng

arXiv:2407.15309·cs.DC·July 23, 2024

vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu, Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi, Guo, Jingwen Leng

PDF

Open Access 1 Repo

TL;DR

vTensor introduces a GPU virtual memory-based tensor structure that decouples computation from memory management, significantly improving LLM inference speed and memory efficiency across various models and scenarios.

Contribution

The paper presents vTensor, a novel tensor framework that enhances LLM inference by decoupling computation from memory management using GPU virtual memory, enabling dynamic extensibility and fragmentation-free operation.

Findings

01

Achieves an average speedup of 1.86x across models.

02

Provides up to 2.42x speedup in multi-turn chat scenarios.

03

Frees approximately 71.25% of GPU memory compared to vLLM.

Abstract

Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intelligent-machine-learning/glake
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Lattice Boltzmann Simulation Studies · Computational Physics and Python Applications

MethodsSoftmax · Attention Is All You Need · Fragmentation