BATON: Enhancing Batch-wise Inference Efficiency for Large Language   Models via Dynamic Re-batching

Peizhuang Cong; Qizhi Chen; Haochen Zhao; Tong Yang

arXiv:2410.18701·cs.LG·October 25, 2024

BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching

Peizhuang Cong, Qizhi Chen, Haochen Zhao, Tong Yang

PDF

Open Access

TL;DR

BATON introduces a dynamic re-batching method for large language model inference that minimizes idle computations and resource usage, significantly improving query processing efficiency.

Contribution

This paper presents BATON, a novel dynamic re-batching scheme that enhances batch-wise inference efficiency for LLMs without additional resource costs.

Findings

01

BATON achieves up to 1.75x faster query processing compared to Orca.

02

It reduces idle computations during batch processing.

03

The method maintains inference correctness while optimizing resource utilization.

Abstract

The advanced capabilities of Large Language Models (LLMs) have inspired the development of various interactive web services or applications, such as ChatGPT, which offer query inference services for users. Unlike traditional DNN model, the inference of LLM entails different iterations of forward computation for different queries, which result in efficiency challenges for existing run-to-completion batch-wise inference. Hence, some methods refine batch-wise inference to iteration-level by duplicating all nonlinear layers of LLM. However, this approach not only increases resource usage but also introduces idle computations to the batch due to the prefilling of newly added queries. Therefore, we propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch, which can achieve near-zero idle computations without incurring additional resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need · ALIGN