Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
Harsh Raj, Niranjan Orkat, Suvrorup Mukherjee, Aritra Guha, Cheryl Flynn, Subhabrata Majumdar

TL;DR
This paper introduces a rigorous framework for measuring AI agent reliability through statistical methods, emphasizing the importance of consistency at output and trajectory levels across diverse conditions.
Contribution
It develops a foundational measurement science using $U$-statistics and kernel metrics to evaluate and diagnose AI agent reliability and robustness.
Findings
Trajectory-level metrics outperform pass@1 rates in diagnostics.
The framework distinguishes core capability from robustness.
Experiments validate the effectiveness of the proposed metrics.
Abstract
This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging -statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
