Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
Aaditya Khanal, Yangyang Tao, Junxiu Zhou

TL;DR
This paper introduces a new framework for assessing the reliability of long-horizon LLM agents, revealing how reliability diverges from capability over time and proposing metrics to evaluate this aspect.
Contribution
It presents a reliability science framework with four metrics for evaluating long-horizon LLM agents, addressing a gap in existing benchmarks focused solely on capability.
Findings
Reliability decay varies significantly across domains.
High Variance Amplification Factor indicates capability tier, not instability.
Reliability and capability rankings often diverge at long horizons.
Abstract
Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-stratified -- SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
