The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li; Yue Zhao

arXiv:2603.19423·cs.CR·March 23, 2026

The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li, Yue Zhao

PDF

Open Access

TL;DR

Defense training for large language model agents aimed at safety inadvertently impairs their ability to perform complex tasks, revealing fundamental biases and limitations that undermine multi-step agent reliability.

Contribution

This paper uncovers a capability-alignment paradox in defense training, showing it harms agent competence and introduces biases that compromise multi-step task performance.

Findings

01

Defense training reduces agent tool execution success.

02

Defended models fail more often in multi-step tasks.

03

Biases like cascade amplification and trigger bias are identified.

Abstract

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Explainable Artificial Intelligence (XAI)