Multi-Bin Batching for Increasing LLM Inference Throughput
Ozgur Guldogan, Jackson Kunde, Kangwook Lee, Ramtin Pedarsani

TL;DR
This paper introduces Multi-Bin Batching, a novel method for grouping LLM inference requests by predicted execution times to improve throughput, supported by theoretical analysis and real-world experiments.
Contribution
It proposes a new batching strategy that groups requests into bins based on predicted durations, enhancing inference throughput for large language models.
Findings
Significant throughput improvements over standard batching methods.
Theoretical proof of throughput optimality of the proposed method.
Validated effectiveness through real-world LLM inference scenarios.
Abstract
As large language models (LLMs) grow in popularity for their diverse capabilities, improving the efficiency of their inference systems has become increasingly critical. Batching LLM requests is a critical step in scheduling the inference jobs on servers (e.g. GPUs), enabling the system to maximize throughput by allowing multiple requests to be processed in parallel. However, requests often have varying generation lengths, causing resource underutilization, as hardware must wait for the longest-running request in the batch to complete before moving to the next batch. We formalize this problem from a queueing-theoretic perspective, and aim to design a control policy which is throughput-optimal. We propose Multi-Bin Batching, a simple yet effective method that can provably improve LLM inference throughput by grouping requests with similar (predicted) execution times into predetermined…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The proposed algorithm is simple and the paper is easy to follow. 2. The paper provides a good and comprehensive theoretical analysis of the algorithm.
1. Some assumptions made in the paper seem to be far from realistic scenario. 2. Lack of evaluation on more realistic use cases of multi-bin batching algorithm. 3. Lack of sensitivity analysis on the effects of accuracy of length prediction.
1 - The paper is well-written. 2 - Extensive theoretical analysis is provided. 3 - Both simulations and evaluations on testbeds are conducted.
1 - The proposed approach has a significant gap: how do you predict the output lengths or prompt service time? I can only find the term "Output Length Predictor" in Figure 2, but I have no idea how you implemented this module. Note that this is actually a significant challenge that most of the works in Paragraph 2, Section 2 focused on. 2 - In the introduction, the paper mentioned that "continuous batching requires fine-grained control of hardware, which is not always feasible." This assumption
1. Thank you for submitting your work to ICLR. The paper is very well written generally. 2. The paper has a solid queuing theoretic models for the performance of the proposal system.
I truly enjoyed reading the paper. However, I could not help but wonder on a few more practical questions: 1. When you do the latency derivation, you assume you are operating at the low utilization regime. However, in your simulations, e.g., in Fig.4 you show results mostly on the high utilization regime (as \lambda increases). You also derive a model when the system is at maximum throughput. I am afraid that as you basically get into the higher regimes, the assumptions do not hold and all of a
1. This paper conducts a theoretical analysis, demonstrating the proposed batching method's correctness, effectiveness, and limitations. 2. It provides insights into achieving responsive and efficient LLM inference serving by incorporating answer length estimation.
1. The experiments in this paper are limited and do not fully support the claims of "various settings" (L87) or "comprehensive experiments" (L103). The experiments are confined to a single model, a single device, and a fixed batch size, which limits the ability to demonstrate the robustness and generalization of the approach across diverse scenarios. The baseline for comparison is restricted to standard batching inference, without evaluating the proposed approach against existing methods that i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Scientific Computing and Data Management · Digital Rights Management and Security
