BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng; Xin Ji; Taosong Fang; Fanghao Zhou; Chuanjie Liu; Gang Peng

arXiv:2412.03594·cs.CL·April 23, 2026

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng

PDF

1 Repo

TL;DR

BatchLLM enhances large batched LLM inference by globally sharing prefixes, optimizing request scheduling, and increasing GPU utilization, significantly outperforming existing solutions.

Contribution

It introduces a novel approach to identify and schedule shared prefixes globally, reordering requests, and applying memory-centric token batching for better GPU efficiency.

Findings

01

BatchLLM achieves 1.3x to 10.8x speedup over vLLM and SGLang.

02

It effectively reuses KV context through global prefix sharing.

03

The method improves GPU utilization in large batched LLM inference.

Abstract

Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indicator for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may be prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/MixLLM/tree/batchllm_vllm_064
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.