TL;DR
ProEval is a proactive evaluation framework for generative AI that uses transfer learning and Gaussian Processes to efficiently estimate performance and discover failure cases, reducing resource costs.
Contribution
It introduces a novel Bayesian quadrature approach with pre-trained GPs for efficient performance estimation and failure discovery in generative AI evaluation.
Findings
ProEval requires 8-65x fewer samples than baselines for accurate estimates.
It uncovers more diverse failure cases under limited evaluation budgets.
ProEval is theoretically unbiased and bounded in its estimates.
Abstract
Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
