Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

Wilson Y. Lee

arXiv:2602.19008·cs.CL·February 24, 2026

Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks

Wilson Y. Lee

PDF

Open Access

TL;DR

This paper investigates why language agents fail on complex tasks despite being capable, revealing that failures often stem from stochastic deviations from a canonical solution path, which can be mitigated by simple monitoring interventions.

Contribution

It introduces the concept of canonical path deviation as a causal mechanism of agent failure and demonstrates how adherence to this path influences success in long-horizon tasks.

Findings

01

Successful runs closely follow the canonical solution path.

02

Off-canonical tool calls significantly increase failure probability.

03

A simple restart monitor improves success rates by nearly 9%.

Abstract

Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction. We analyze trajectories from the Toolathlon benchmark: 22 frontier models each attempt 108 real-world tool-use tasks across 3 independent runs, yielding 515 model $\times$ task units where the same model succeeds on some runs and fails on others due to LLM sampling stochasticity alone. Within these units,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Mobile Crowdsensing and Crowdsourcing