Efficient multi-prompt evaluation of LLMs
Felipe Maia Polo, Ronald Xu, Lucas Weber, M\'irian Silva, Onkar, Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun,, Mikhail Yurochkin

TL;DR
This paper introduces PromptEval, a method for efficiently estimating the performance distribution of large language models across many prompt variations, improving robustness and reproducibility of evaluations.
Contribution
PromptEval provides a novel approach to estimate performance distributions over numerous prompts, enabling more reliable and comprehensive LLM evaluation under limited budgets.
Findings
Accurately estimates performance quantiles across 100 prompts on MMLU
Consistently estimates performance distribution across benchmarks
Effective in LLM-as-a-judge and prompt selection applications
Abstract
Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · VLSI and Analog Circuit Testing
MethodsSparse Evolutionary Training
