Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation

Yuxuan Gao; Megan Wang; Yi Ling Yu

arXiv:2605.19779·cs.AI·May 20, 2026

Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation

Yuxuan Gao, Megan Wang, Yi Ling Yu

PDF

1 Repo

TL;DR

This paper introduces distribution-free methods for uncertainty quantification in continuous AI agent evaluation, ensuring reliable coverage guarantees and robust multi-agent pipeline assessments.

Contribution

It adapts conformal prediction techniques to continuous evaluation, develops compositional uncertainty bounds, and introduces abstention rules with FDR correction, validated on real-world data.

Findings

01

Conformal intervals achieve calibration error below 0.02 at 24h horizon.

02

ACI adjusts intervals by 35% after agent updates and then reconverges.

03

Per-agent coverage is concentrated around the nominal level, with divergence predicting instability.

Abstract

We adapt split conformal prediction and adaptive conformal inference (ACI) to continuous AI agent evaluation, providing distribution-free coverage guarantees for forecasted quality scores. Conformal intervals achieve calibration error below 0.02 across all nominal levels at the 24h horizon, while ACI correctly widens intervals by 35% following agent releases then reconverges. We further develop compositional uncertainty bounds for multi-agent pipelines (validated via simulation across inter-stage correlations rho in [-0.5, 0.9]), a conformal abstention rule for pairwise rankings with controlled false-ranking rate, and FDR-corrected abstention for leaderboard-scale multiple testing. Evaluating 50 agents via 18 real-time signals collected hourly, we show that per-agent conditional coverage is well-concentrated around the nominal level (mean 80.4%, 90% of agents within [72%, 90%]), and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.