Multi-Bin Batching for Increasing LLM Inference Throughput

Ozgur Guldogan; Jackson Kunde; Kangwook Lee; Ramtin Pedarsani

arXiv:2412.04504·cs.CL·December 9, 2024

Multi-Bin Batching for Increasing LLM Inference Throughput

Ozgur Guldogan, Jackson Kunde, Kangwook Lee, Ramtin Pedarsani

PDF

Open Access 4 Reviews

TL;DR

This paper introduces Multi-Bin Batching, a novel method for grouping LLM inference requests by predicted execution times to improve throughput, supported by theoretical analysis and real-world experiments.

Contribution

It proposes a new batching strategy that groups requests into bins based on predicted durations, enhancing inference throughput for large language models.

Findings

01

Significant throughput improvements over standard batching methods.

02

Theoretical proof of throughput optimality of the proposed method.

03

Validated effectiveness through real-world LLM inference scenarios.

Abstract

As large language models (LLMs) grow in popularity for their diverse capabilities, improving the efficiency of their inference systems has become increasingly critical. Batching LLM requests is a critical step in scheduling the inference jobs on servers (e.g. GPUs), enabling the system to maximize throughput by allowing multiple requests to be processed in parallel. However, requests often have varying generation lengths, causing resource underutilization, as hardware must wait for the longest-running request in the batch to complete before moving to the next batch. We formalize this problem from a queueing-theoretic perspective, and aim to design a control policy which is throughput-optimal. We propose Multi-Bin Batching, a simple yet effective method that can provably improve LLM inference throughput by grouping requests with similar (predicted) execution times into predetermined…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

1. The proposed algorithm is simple and the paper is easy to follow. 2. The paper provides a good and comprehensive theoretical analysis of the algorithm.

Weaknesses

1. Some assumptions made in the paper seem to be far from realistic scenario. 2. Lack of evaluation on more realistic use cases of multi-bin batching algorithm. 3. Lack of sensitivity analysis on the effects of accuracy of length prediction.

Reviewer 02Rating 3Confidence 4

Strengths

1 - The paper is well-written. 2 - Extensive theoretical analysis is provided. 3 - Both simulations and evaluations on testbeds are conducted.

Weaknesses

1 - The proposed approach has a significant gap: how do you predict the output lengths or prompt service time? I can only find the term "Output Length Predictor" in Figure 2, but I have no idea how you implemented this module. Note that this is actually a significant challenge that most of the works in Paragraph 2, Section 2 focused on. 2 - In the introduction, the paper mentioned that "continuous batching requires fine-grained control of hardware, which is not always feasible." This assumption

Reviewer 03Rating 5Confidence 4

Strengths

1. Thank you for submitting your work to ICLR. The paper is very well written generally. 2. The paper has a solid queuing theoretic models for the performance of the proposal system.

Weaknesses

I truly enjoyed reading the paper. However, I could not help but wonder on a few more practical questions: 1. When you do the latency derivation, you assume you are operating at the low utilization regime. However, in your simulations, e.g., in Fig.4 you show results mostly on the high utilization regime (as \lambda increases). You also derive a model when the system is at maximum throughput. I am afraid that as you basically get into the higher regimes, the assumptions do not hold and all of a

Reviewer 04Rating 3Confidence 4

Strengths

1. This paper conducts a theoretical analysis, demonstrating the proposed batching method's correctness, effectiveness, and limitations. 2. It provides insights into achieving responsive and efficient LLM inference serving by incorporating answer length estimation.

Weaknesses

1. The experiments in this paper are limited and do not fully support the claims of "various settings" (L87) or "comprehensive experiments" (L103). The experiments are confined to a single model, a single device, and a fixed batch size, which limits the ability to demonstrate the robustness and generalization of the approach across diverse scenarios. The baseline for comparison is restricted to standard batching inference, without evaluating the proposed approach against existing methods that i

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Scientific Computing and Data Management · Digital Rights Management and Security