Efficient multi-prompt evaluation of LLMs

Felipe Maia Polo; Ronald Xu; Lucas Weber; M\'irian Silva; Onkar; Bhardwaj; Leshem Choshen; Allysson Flavio Melo de Oliveira; Yuekai Sun,; Mikhail Yurochkin

arXiv:2405.17202·cs.CL·November 1, 2024·3 cites

Efficient multi-prompt evaluation of LLMs

Felipe Maia Polo, Ronald Xu, Lucas Weber, M\'irian Silva, Onkar, Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun,, Mikhail Yurochkin

PDF

Open Access 2 Repos 2 Datasets 1 Video

TL;DR

This paper introduces PromptEval, a method for efficiently estimating the performance distribution of large language models across many prompt variations, improving robustness and reproducibility of evaluations.

Contribution

PromptEval provides a novel approach to estimate performance distributions over numerous prompts, enabling more reliable and comprehensive LLM evaluation under limited budgets.

Findings

01

Accurately estimates performance quantiles across 100 prompts on MMLU

02

Consistently estimates performance distribution across benchmarks

03

Effective in LLM-as-a-judge and prompt selection applications

Abstract

Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

Efficient multi-prompt evaluation of LLMs· slideslive

Taxonomy

TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · VLSI and Analog Circuit Testing

MethodsSparse Evolutionary Training