On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

Bo Yin; Qi Li; Xinchao Wang

arXiv:2605.11882·cs.AI·May 13, 2026

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

Bo Yin, Qi Li, Xinchao Wang

PDF

1 Repo

TL;DR

FATE is a framework that uses failure trajectories and verifier feedback to enable on-policy self-improvement of LLM agents, enhancing safety without sacrificing utility.

Contribution

The paper introduces FATE, a novel on-policy self-evolution method that leverages failure trajectories for safety improvements without relying on expert demonstrations.

Findings

01

FATE reduces attack success rate by 33.5%.

02

FATE decreases harmful compliance by 82.6%.

03

FATE improves trajectory safety diagnosis by 6.5%.

Abstract

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yinbo0927/FATE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.