TL;DR
This paper introduces HORIZON, a benchmark for diagnosing long-horizon failures in LLM-based agents, analyzing degradation patterns across domains and proposing a scalable failure attribution method.
Contribution
It presents a cross-domain diagnostic benchmark, a large dataset of agent trajectories, and a trajectory-grounded LLM-as-a-Judge pipeline for failure analysis.
Findings
Identified horizon-dependent degradation patterns in SOTA agents.
Validated a human-annotated failure attribution method with strong agreement.
Provided practical guidance for building more reliable long-horizon agents.
Abstract
Large language model (LLM) agents perform strongly on short- and mid-horizon tasks, but often break down on long-horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long-horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross-domain diagnostic benchmark for systematically constructing tasks and analyzing long-horizon failure behaviors in LLM-based agents. Using HORIZON, we evaluate state-of-the-art (SOTA) agents from multiple model families (GPT-5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon-dependent degradation patterns. We further propose a trajectory-grounded LLM-as-a-Judge pipeline for scalable and reproducible failure attribution,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
