TL;DR
BatchLLM enhances large batched LLM inference by globally sharing prefixes, optimizing request scheduling, and increasing GPU utilization, significantly outperforming existing solutions.
Contribution
It introduces a novel approach to identify and schedule shared prefixes globally, reordering requests, and applying memory-centric token batching for better GPU efficiency.
Findings
BatchLLM achieves 1.3x to 10.8x speedup over vLLM and SGLang.
It effectively reuses KV context through global prefix sharing.
The method improves GPU utilization in large batched LLM inference.
Abstract
Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indicator for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may be prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
