Active Evaluation Acquisition for Efficient LLM Benchmarking
Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, Graham Horwood

TL;DR
This paper proposes an RL-based method for selecting a subset of prompts in LLM benchmarking, reducing evaluation costs while maintaining accuracy by modeling dependencies among test examples.
Contribution
It introduces a novel reinforcement learning policy that leverages dependencies among test examples to improve evaluation efficiency in large language model benchmarks.
Findings
Significantly reduces the number of prompts needed for evaluation.
Maintains accurate performance estimates with fewer evaluation costs.
Outperforms previous subset selection methods.
Abstract
As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, and time. In this work, we investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples based on the outcomes of the selected ones. Consequently, we only need to acquire the actual evaluation outcomes for the selected subset. We rigorously explore various subset selection policies and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Advancements in Photolithography Techniques · Advanced Data Storage Technologies
