Active Testing of Large Language Models via Approximate Neyman Allocation

Zeli Liu; Jiancheng Zhang; Cong Liu; Yinglun Zhu

arXiv:2605.10075·cs.AI·May 20, 2026

Active Testing of Large Language Models via Approximate Neyman Allocation

Zeli Liu, Jiancheng Zhang, Cong Liu, Yinglun Zhu

PDF

TL;DR

This paper presents a new active testing algorithm for large language models that efficiently estimates evaluation metrics for generative tasks, reducing costs and improving accuracy over existing methods.

Contribution

The authors introduce a novel active testing approach using semantic entropy and Neyman allocation tailored for generative tasks, outperforming baseline sampling methods.

Findings

01

Up to 28% MSE reduction over uniform sampling.

02

Average of 22.9% savings in evaluation budget.

03

Significant improvements across multiple benchmarks.

Abstract

Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.