Hell or High Water: Evaluating Agentic Recovery from External Failures
Andrew Wang, Sophia Hager, Adi Asija, Daniel Khashabi, Nicholas Andrews

TL;DR
This paper introduces a benchmark to evaluate how well language model agents recover from external failures during complex planning tasks, revealing current models' struggles to adapt and pursue backup strategies.
Contribution
The paper presents a novel agentic planning benchmark focused on external failures, providing systematic analysis of model performance and failure modes.
Findings
Models often identify correct functions but fail to adapt to feedback.
Performance degrades with larger search spaces and smaller models.
Scaling models improves some aspects but does not fully solve adaptation issues.
Abstract
As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? We devise a specialized agentic planning benchmark to study this question. Each planning problem is solved via combinations of function calls. The agent searches for relevant functions from a set of over four thousand possibilities, and observes environmental feedback in the form of function outputs or error messages. Our benchmark confronts the agent with external failures in its workflow, such as functions that suddenly become unavailable. At the same time, even with the introduction of these failures, we guarantee that the task remains solvable. Ideally, an agent's performance on the planning task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
