TL;DR
The paper introduces R-AutoEval+, an adaptive framework that guarantees reliable AI model evaluation with improved sample efficiency by dynamically balancing synthetic and real data reliance, validated across multiple LLM tasks.
Contribution
It proposes a novel adaptive evaluation method with finite-sample guarantees that outperforms traditional approaches by adjusting reliance on synthetic data based on autoevaluator accuracy.
Findings
R-AutoEval+ provides reliable model evaluation with finite-sample guarantees.
The framework improves sample efficiency over conventional methods.
Experiments confirm effectiveness across various LLM evaluation tasks.
Abstract
Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference (PPI) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose R-AutoEval+, a novel framework that provides finite-sample reliability guarantees on the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
