BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
Wanyi Zheng, Minxian Xu, Shengye Song, Kejiang Ye

TL;DR
BucketServe is a dynamic batching framework for LLM inference that groups requests by size, reducing padding and memory waste, and adapts in real-time to workload fluctuations to improve throughput and SLO compliance.
Contribution
The paper introduces BucketServe, a novel bucket-based dynamic batching method that optimizes GPU memory usage and adapts to workload changes for efficient LLM serving.
Findings
Achieves up to 3.58x throughput improvement over UELLM.
Handles 1.93x more requests under 80% SLO attainment.
Demonstrates 1.975x higher system load capacity compared to UELLM.
Abstract
Large language models (LLMs) have become increasingly popular in various areas, traditional business gradually shifting from rule-based systems to LLM-based solutions. However, the inference of LLMs is resource-intensive or latency-sensitive, posing significant challenges for serving systems. Existing LLM serving systems often use static or continuous batching strategies, which can lead to inefficient GPU memory utilization and increased latency, especially under heterogeneous workloads. These methods may also struggle to adapt to dynamic workload fluctuations, resulting in suboptimal throughput and potential service level objective (SLO) violations. In this paper, we introduce BucketServe, a bucket-based dynamic batching framework designed to optimize LLM inference performance. By grouping requests into size-homogeneous buckets based on sequence length, BucketServe minimizes padding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Digital Rights Management and Security
