TL;DR
This paper introduces distribution-free methods for uncertainty quantification in continuous AI agent evaluation, ensuring reliable coverage guarantees and robust multi-agent pipeline assessments.
Contribution
It adapts conformal prediction techniques to continuous evaluation, develops compositional uncertainty bounds, and introduces abstention rules with FDR correction, validated on real-world data.
Findings
Conformal intervals achieve calibration error below 0.02 at 24h horizon.
ACI adjusts intervals by 35% after agent updates and then reconverges.
Per-agent coverage is concentrated around the nominal level, with divergence predicting instability.
Abstract
We adapt split conformal prediction and adaptive conformal inference (ACI) to continuous AI agent evaluation, providing distribution-free coverage guarantees for forecasted quality scores. Conformal intervals achieve calibration error below 0.02 across all nominal levels at the 24h horizon, while ACI correctly widens intervals by 35% following agent releases then reconverges. We further develop compositional uncertainty bounds for multi-agent pipelines (validated via simulation across inter-stage correlations rho in [-0.5, 0.9]), a conformal abstention rule for pairwise rankings with controlled false-ranking rate, and FDR-corrected abstention for leaderboard-scale multiple testing. Evaluating 50 agents via 18 real-time signals collected hourly, we show that per-agent conditional coverage is well-concentrated around the nominal level (mean 80.4%, 90% of agents within [72%, 90%]), and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
