SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents

Alejandro Cuadron; Pengfei Yu; Yang Liu; Arpit Gupta

arXiv:2512.07850·cs.LG·December 10, 2025

SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents

Alejandro Cuadron, Pengfei Yu, Yang Liu, Arpit Gupta

PDF

Open Access

TL;DR

This paper analyzes the impact of mutating actions on LLM agent failures, introduces a safeguard method called extsc{Saber} to improve robustness, and releases a revised benchmark to better evaluate long-horizon tasks.

Contribution

It provides an action-level analysis of failure causes in LLM agents, introduces a novel safeguard method extsc{Saber}, and releases an improved benchmark dataset.

Findings

01

Mutating actions significantly impact success rates, deviations reduce success odds by up to 96%.

02

extsc{Saber} improves performance by up to 28% on benchmark tasks.

03

Ceiling effects in existing benchmarks are addressed with the new extsc{tau}-Bench Verified dataset.

Abstract

Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: \emph{do all actions contribute equally to failure?} Analyzing execution traces on $τ$ -Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into \emph{mutating} (environment-changing) vs.\ non-mutating steps and formalize \emph{decisive deviations}, earliest action, level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action reduces the odds of success by upto $92%$ on Airline and upto $96%$ on Retail for SoTA models. In contrast, deviations in non-mutating actions have little to no effect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Multi-Agent Systems and Negotiation