BatchPrompt: Accomplish more with less

Jianzhe Lin; Maurice Diesendruck; Liang Du; Robin Abraham

arXiv:2309.00384·cs.CL·July 16, 2024·1 cites

BatchPrompt: Accomplish more with less

Jianzhe Lin, Maurice Diesendruck, Liang Du, Robin Abraham

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces BatchPrompt, a method to improve large language model prompting efficiency by batching data, and proposes techniques like BPE and SEAS to mitigate performance loss, achieving high accuracy with fewer LLM calls.

Contribution

It presents the first technical approach to enhance prompting efficiency in large language models through batching and novel techniques to maintain performance.

Findings

01

BatchPrompt reduces LLM calls to 9-16% of single prompting.

02

BPE significantly improves batch prompting performance across NLP tasks.

03

BatchPrompt achieves comparable or better accuracy with fewer tokens and calls.

Abstract

As the ever-increasing token limits of large language models (LLMs) have enabled long context as input, prompting with single data samples might no longer an efficient way. A straightforward strategy improving efficiency is to batch data within the token limit (e.g., 8k for gpt-3.5-turbo; 32k for GPT-4), which we call BatchPrompt. We have two initial observations for prompting with batched data. First, we find that prompting with batched data in longer contexts will inevitably lead to worse performance, compared to single-data prompting. Second, the performance of the language model is significantly correlated with the positions and order of the batched data, due to the corresponding change in decoder context. To retain efficiency and overcome performance loss, we propose Batch Permutation and Ensembling (BPE), and a novel Self-reflection-guided EArly Stopping (SEAS) technique. Our…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- Authors propose a robust method that uses larger batch size, more voting rounds (eg. 5+) and a self-reflection guided early stopping approach. - The early stopping method also uses a pruning strategy to prune away confident predictions leaving fewer/harder samples for later rounds. In the process, the harder samples might also become easier to predict, due to smaller effective batch size in later rounds. - via experiments, authors show that voting is most successful when the baseline perfor

Weaknesses

- Authors chose small number of tasks (only 3 simple tasks (yes/no QnA, paraphrase detection and entailment detection) -> these tasks may be too easy for gpt3.5 and gpt4 systems - Results are shown using few experiments (~300 dataset queries each for the 3 datasets); typically a validation on more tasks and more datasets would have helped get a more confident understanding of the approach. - this is a nice applied research paper with good results and a principled approach for improving cost

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. Batchprompt could highly improve token-resource utilization 2. BPE could effectively It can effectively reduce the error rate caused by the different position in a batch. 3. SEAS could effectively reduce the amount of unnecessary calculations

Weaknesses

1. It seems that each item in the new batch (with only one prompt) could not be computed parallelly as original. Whether it will increase the time cost? It might be better to add time and flops metrics in the experiments. 2. I think the “batchprompt” could be used in both training and test phases, right? 3. In BPE, the weight for confidence is directly 1. What about to generate the weights scores directly by the LLM without whether confident?

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

The paper advocates the use of batching for prompting, and may be successful in setting a new trend in that direction.

Weaknesses

I worry about running so many experiments. The plots in Figure 3 suggest that there are patterns to the results, but even so, if we run lots and lots of experiments and report the best values, the best value could be the result of randomness. On the other hand, to make the case for trends, we may need to run even more experiments over more benchmarks, models, batch sizes and so on. It would be nice to fit some kind of smooth regression to the results to help with interpretation. Can you say

Code & Models

Repositories

microsoft/batchprompt
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Early Stopping · Layer Normalization · Attention Dropout · WordPiece · Softmax · Dense Connections · Linear Warmup With Linear Decay