On Speeding Up Language Model Evaluation
Jin Peng Zhou, Christian K. Belardi, Ruihan Wu, Travis Zhang, Carla P., Gomes, Wen Sun, Kilian Q. Weinberger

TL;DR
This paper introduces an adaptive, bandit-based evaluation method that significantly reduces the resources needed to assess prompt-based strategies for large language models, saving up to 95% of evaluation costs.
Contribution
It presents a novel adaptive evaluation framework combining multi-armed bandits and low-rank matrix factorization for efficient LLM prompt assessment.
Findings
Achieves 85-95% cost savings in LLM evaluation
Identifies top-performing methods with only 5-15% of resources
Effective across multiple benchmark problems
Abstract
Developing prompt-based methods with Large Language Models (LLMs) requires making numerous decisions, which give rise to a combinatorial search problem over hyper-parameters. This exhaustive evaluation can be time-consuming and costly. In this paper, we propose an approach to explore this space. We are exploiting the fact that often only few samples are needed to identify clearly superior or inferior settings, and that many evaluation tests are highly correlated. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate and utilize low-rank matrix factorization to fill in missing evaluations. We carefully assess the efficacy of our approach on several competitive benchmark problems and show that it can identify the top-performing method using only 5-15% of the typical resources -- resulting in 85-95% LLM cost…
Peer Reviews
Decision·ICLR 2025 Poster
- The proposed algorithms can be used for a variety of evaluation use cases, not limited to LLMs. - The paper provides enough and clear description of relevant concepts on which the proposed solution is built. - The paper is very well-written. - The proposed approaches show great money and time reduction on large evaluation datasets. - The approach was evaluated on variety of tasks, setups, methods, and was evaluated using a thoughtful evaluation approach.
- Nothing major to report here
The paper introduces UCB-E and its variant UCB-E-LRF. The authors conducted extensive experiments across multiple datasets and performed repeated random seeds, which enhance the stability of the results.
- Some descriptions in the paper are unclear. For instance, Figure 3, which presents key experimental results, lacks a legend, making it difficult to interpret. - Additionally, the paper does not clearly define the baseline methods used in the experiments. - Some results also lack in-depth discussion. For example, Figure 3 shows that UCB-E and UCB-E-LRF perform inconsistently across different datasets. The authors attribute this to varying dataset difficulty; however, when comparing dataset
- To the best of my knowledge, this is the first paper to be using the multi-armed bandit for LLM model/setup evaluation. - The idea is solid, and useful. Especially, with the ever growing number of models, size of models and knobs that you can tweak to improve the performance for a specific/custom task. This framework can substantially reduce resources when practitioners have to choose a best model for their use-case. - The algorithms are clearly outlined, making the understanding and reproduct
- In section 3.2, low-rank factorization: “Intuitively, if the method-examples are vert correlated, there should exist….” while I do agree with the intuition, it would be nice to have a citation here. At least the citation from the appendix: “Chen et al., 2020; Cai et al., 2019”. - Even though the information is in the paper, it requires going back and forth to find it. For example, the figure captions are lacing information that is present elsewhere in the text, or not present at all. Some redu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
