Towards a Science of AI Agent Reliability

Stephan Rabanser; Sayash Kapoor; Peter Kirgis; Kangheng Liu; Saiteja Utpala; Arvind Narayanan

arXiv:2602.16666·cs.AI·February 24, 2026·2 cites

Towards a Science of AI Agent Reliability

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan

PDF

Open Access 1 Video

TL;DR

This paper introduces a comprehensive set of twelve metrics to evaluate AI agent reliability across four key dimensions, revealing that recent capability improvements have only marginally enhanced reliability and exposing persistent limitations.

Contribution

It proposes a holistic reliability assessment framework with twelve metrics, addressing gaps in current evaluations by capturing consistency, robustness, predictability, and safety.

Findings

01

Recent AI models show limited reliability improvements.

02

Persistent operational flaws remain despite capability gains.

03

Metrics reveal critical weaknesses in agent safety and robustness.

Abstract

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Deadline Day for Autonomous AI Weapons & Mass Surveillance· youtube

Taxonomy

TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)