Batch Prompting: Efficient Inference with Large Language Model APIs

Zhoujun Cheng; Jungo Kasai; Tao Yu

arXiv:2301.08721·cs.CL·October 25, 2023·1 cites

Batch Prompting: Efficient Inference with Large Language Model APIs

Zhoujun Cheng, Jungo Kasai, Tao Yu

PDF

Open Access 2 Repos

TL;DR

Batch prompting enables large language models to process multiple samples simultaneously, significantly reducing inference costs while maintaining or improving performance across various tasks and models.

Contribution

The paper introduces batch prompting, a simple method that allows inference on multiple samples at once, reducing costs and extending applicability to different reasoning methods and models.

Findings

01

Up to 5x reduction in token and time costs with batch prompting.

02

Maintains or improves performance across diverse datasets and models.

03

Effective for state-of-the-art chat-based LLMs like GPT-3.5 and GPT-4.

Abstract

Performing inference on large volumes of samples with large language models (LLMs) can be computationally and financially costly in industry and real-world use. We propose batch prompting, a simple yet effective prompting approach that enables the LLM to run inference in batches, instead of one sample at a time. Our method reduces both token and time costs while retaining downstream performance. We theoretically demonstrate that under a few-shot in-context learning setting, the inference costs decrease almost inverse linearly with the number of samples in each batch. We extensively validate the effectiveness of batch prompting on ten datasets across commonsense QA, arithmetic reasoning, and NLI/NLU: batch prompting significantly~(up to 5x with six samples in batch) reduces the LLM (Codex) inference token and time costs while achieving better or comparable performance. For…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Byte Pair Encoding · Dropout · Weight Decay · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Cosine Annealing