The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Xinyu Jessica Wang; Haoyue Bai; Yiyou Sun; Haorui Wang; Shuibai Zhang; Wenjie Hu; Mya Schroder; Bilge Mutlu; Dawn Song; Robert D Nowak

arXiv:2604.11978·cs.AI·April 15, 2026

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, Robert D Nowak

PDF

1 Repo

TL;DR

This paper introduces HORIZON, a benchmark for diagnosing long-horizon failures in LLM-based agents, analyzing degradation patterns across domains and proposing a scalable failure attribution method.

Contribution

It presents a cross-domain diagnostic benchmark, a large dataset of agent trajectories, and a trajectory-grounded LLM-as-a-Judge pipeline for failure analysis.

Findings

01

Identified horizon-dependent degradation patterns in SOTA agents.

02

Validated a human-annotated failure attribution method with strong agreement.

03

Provided practical guidance for building more reliable long-horizon agents.

Abstract

Large language model (LLM) agents perform strongly on short- and mid-horizon tasks, but often break down on long-horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long-horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross-domain diagnostic benchmark for systematically constructing tasks and analyzing long-horizon failure behaviors in LLM-based agents. Using HORIZON, we evaluate state-of-the-art (SOTA) agents from multiple model families (GPT-5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon-dependent degradation patterns. We further propose a trajectory-grounded LLM-as-a-Judge pipeline for scalable and reproducible failure attribution,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://xwang2775.github.io/horizon-leaderboard
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.