ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Yizheng Huang; Wenjun Zeng; Aditi Kumaresan; Zi Wang

arXiv:2604.23099·cs.LG·April 28, 2026

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang

PDF

1 Repo

TL;DR

ProEval is a proactive evaluation framework for generative AI that uses transfer learning and Gaussian Processes to efficiently estimate performance and discover failure cases, reducing resource costs.

Contribution

It introduces a novel Bayesian quadrature approach with pre-trained GPs for efficient performance estimation and failure discovery in generative AI evaluation.

Findings

01

ProEval requires 8-65x fewer samples than baselines for accurate estimates.

02

It uncovers more diverse failure cases under limited evaluation budgets.

03

ProEval is theoretically unbiased and bounded in its estimates.

Abstract

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/proeval
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.