Cost-Optimal Active AI Model Evaluation
Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agarwal, Adam Fisch

TL;DR
This paper introduces cost-aware, active evaluation strategies for generative AI that optimally balance cheap, inaccurate weak ratings with expensive, accurate strong ratings to improve evaluation efficiency.
Contribution
It develops novel, cost-optimal policies for allocating annotation budgets between weak and strong raters, enhancing evaluation efficiency in AI systems.
Findings
Policies outperform prior methods in high-variability tasks.
Significant reduction in annotation costs while maintaining accuracy.
Effective in synthetic and real-world data scenarios.
Abstract
The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, rapid iteration often makes it necessary to rely on synthetic annotation data because of the low cost, despite the potential for substantial bias. In this paper, we develop novel, cost-aware methods for actively balancing the use of a cheap, but often inaccurate, weak rater -- such as a model-based autorater that is designed to automatically assess the quality of generated content -- with a more expensive, but also more accurate, strong rater alternative such as a human. More specifically, the goal of our approach is to produce a low variance, unbiased estimate of the mean of the target "strong" rating, subject to some total annotation budget. Building on recent work in active and prediction-powered statistical…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The objective of minimizing estimator error subject to an annotation budget is formalized, yielding a closed-form $ \pi_{\text{random}} $ in terms of costs and weak-rater MSE, and an adaptive $ \pi_{\text{active}} \propto \sqrt{u(x)} $ with principled clipping to respect $ \pi(x)\in(0,1] $ with clear derivation. 2. The transfer and burn-in strategies provide workable recipes, and the paper reports effective budget and cost-savings curves that are easy to interpret. 3. Experiments on real-d
1. The method extends prediction-powered/active inference by optimizing cost-constrained policies and addressing clipping, but much of the estimator form and sequential setup follows prior work. 2. The active policy depends on a non-convex 1-D optimization over $ \tau $, and the paper does not report sensitivity to $ \tau $, mis-estimated $ u(x) $, or misspecified cost ratios, which are likely in practice. 3. The burn-in approach assigns the first $ n_b $ items to the strong rater to estimate
1. The paper provides a rigorous theoretical framework for active evaluation, extending beyond prior work. Instead of just improving efficiency for a fixed number of expensive annotations, it derives truly cost-optimal policies ($\pi_{random}$ and $\pi_{active}$) that explicitly solve for the best sampling strategy to minimize error given a fixed monetary or computational budget. 2. The work addresses a critical bottleneck in the GenAI lifecycle: the high cost of evaluation. By providing a princ
1. The theoretically-derived policies, $\pi_{random}$ and $\pi_{active}$, depend on several distributional properties like $Var(H)$, $MSE(H,G)$, and the conditional error $u(x)$. Since these are unknown in a real-world setting, the policies cannot be used out of the box. The paper's practical solutions (burn-in and transfer) are approximations that either require a separate, related dataset or incur an initial "burn-in" cost before any savings can be realized. 2. The benefit of the active policy
1. The problem—cost-aware AI evaluation—is timely, practical, and underexplored. The authors correctly identify inefficiencies in current model evaluation practices that rely heavily on costly human or LLM raters. 2. The extension of prediction-powered inference with explicit cost constraints is technically sound. The derivation of closed-form policies (Propositions 1–2) is clear and builds on well-established statistical theory. 3. The Gaussian/Bernoulli experiments in Section 3 are carefully d
1. The real-world experiments are narrow. Most results are on one dataset (Chatbot Arena) with two scenarios, both focused on text-based preference evaluations. There’s little diversity in task type or domain (e.g., no multimodal or structured data). The empirical results, while consistent, are modest—often showing ~40–50% budget savings under ideal transfer, which may shrink with realistic uncertainty estimation. 2. While theoretically elegant, the framework’s impact on real-world evaluation pi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Machine Learning and Algorithms
