Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
Wilson Y. Lee

TL;DR
This paper investigates why language agents fail on complex tasks despite being capable, revealing that failures often stem from stochastic deviations from a canonical solution path, which can be mitigated by simple monitoring interventions.
Contribution
It introduces the concept of canonical path deviation as a causal mechanism of agent failure and demonstrates how adherence to this path influences success in long-horizon tasks.
Findings
Successful runs closely follow the canonical solution path.
Off-canonical tool calls significantly increase failure probability.
A simple restart monitor improves success rates by nearly 9%.
Abstract
Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction. We analyze trajectories from the Toolathlon benchmark: 22 frontier models each attempt 108 real-world tool-use tasks across 3 independent runs, yielding 515 modeltask units where the same model succeeds on some runs and fails on others due to LLM sampling stochasticity alone. Within these units,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Mobile Crowdsensing and Crowdsourcing
