TL;DR
FATE is a framework that uses failure trajectories and verifier feedback to enable on-policy self-improvement of LLM agents, enhancing safety without sacrificing utility.
Contribution
The paper introduces FATE, a novel on-policy self-evolution method that leverages failure trajectories for safety improvements without relying on expert demonstrations.
Findings
FATE reduces attack success rate by 33.5%.
FATE decreases harmful compliance by 82.6%.
FATE improves trajectory safety diagnosis by 6.5%.
Abstract
Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
