The Autonomy Tax: Defense Training Breaks LLM Agents
Shawn Li, Yue Zhao

TL;DR
Defense training for large language model agents aimed at safety inadvertently impairs their ability to perform complex tasks, revealing fundamental biases and limitations that undermine multi-step agent reliability.
Contribution
This paper uncovers a capability-alignment paradox in defense training, showing it harms agent competence and introduces biases that compromise multi-step task performance.
Findings
Defense training reduces agent tool execution success.
Defended models fail more often in multi-step tasks.
Biases like cascade amplification and trigger bias are identified.
Abstract
Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Explainable Artificial Intelligence (XAI)
