Precise Model Benchmarking with Only a Few Observations

Riccardo Fogliato; Pratik Patil; Nil-Jana Akpinar; Mathew Monfort

arXiv:2410.05222·cs.LG·October 8, 2024

Precise Model Benchmarking with Only a Few Observations

Riccardo Fogliato, Pratik Patil, Nil-Jana Akpinar, Mathew Monfort

PDF

Open Access 1 Video

TL;DR

This paper introduces an empirical Bayes estimator to improve the precision of subgroup accuracy estimates for large language models, reducing variance and bias compared to traditional methods.

Contribution

It proposes a simple empirical Bayes approach that balances direct and regression estimates, enhancing subgroup performance estimation across multiple data modalities.

Findings

01

EB estimator reduces mean squared error in accuracy estimates

02

Confidence intervals with EB are narrower and have near-nominal coverage

03

Method generalizes well to tabular and vision datasets

Abstract

How can we precisely estimate a large language model's (LLM) accuracy on questions belonging to a specific topic within a larger question-answering dataset? The standard direct estimator, which averages the model's accuracy on the questions in each subgroup, may exhibit high variance for subgroups (topics) with small sample sizes. Synthetic regression modeling, which leverages the model's accuracy on questions about other topics, may yield biased estimates that are too unreliable for large subgroups. We prescribe a simple yet effective solution: an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of subgroup-level estimates of model performance. Our experiments on multiple datasets show that this approach consistently provides more precise estimates of the LLM performance compared to the direct and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Precise Model Benchmarking with Only a Few Observations· underline

Taxonomy

TopicsReservoir Engineering and Simulation Methods · Rough Sets and Fuzzy Logic · Advanced Database Systems and Queries