BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving

Wanyi Zheng; Minxian Xu; Shengye Song; Kejiang Ye

arXiv:2507.17120·cs.DC·January 6, 2026

BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving

Wanyi Zheng, Minxian Xu, Shengye Song, Kejiang Ye

PDF

Open Access

TL;DR

BucketServe is a dynamic batching framework for LLM inference that groups requests by size, reducing padding and memory waste, and adapts in real-time to workload fluctuations to improve throughput and SLO compliance.

Contribution

The paper introduces BucketServe, a novel bucket-based dynamic batching method that optimizes GPU memory usage and adapts to workload changes for efficient LLM serving.

Findings

01

Achieves up to 3.58x throughput improvement over UELLM.

02

Handles 1.93x more requests under 80% SLO attainment.

03

Demonstrates 1.975x higher system load capacity compared to UELLM.

Abstract

Large language models (LLMs) have become increasingly popular in various areas, traditional business gradually shifting from rule-based systems to LLM-based solutions. However, the inference of LLMs is resource-intensive or latency-sensitive, posing significant challenges for serving systems. Existing LLM serving systems often use static or continuous batching strategies, which can lead to inefficient GPU memory utilization and increased latency, especially under heterogeneous workloads. These methods may also struggle to adapt to dynamic workload fluctuations, resulting in suboptimal throughput and potential service level objective (SLO) violations. In this paper, we introduce BucketServe, a bucket-based dynamic batching framework designed to optimize LLM inference performance. By grouping requests into size-homogeneous buckets based on sequence length, BucketServe minimizes padding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Digital Rights Management and Security