SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator

Xueyang Zhou; Weidong Wang; Lin Lu; Jiawen Shi; Guiyao Tie; Yongtian Xu; Lixing Chen; Pan Zhou; Neil Zhenqiang Gong; Lichao Sun

arXiv:2505.17735·cs.AI·July 21, 2025

SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator

Xueyang Zhou, Weidong Wang, Lin Lu, Jiawen Shi, Guiyao Tie, Yongtian Xu, Lixing Chen, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun

PDF

Open Access

TL;DR

AutoSafe is a novel framework that enhances LLM agent safety by automatically generating synthetic data to model and mitigate unsafe behaviors, significantly improving safety performance without real-world hazardous data.

Contribution

We introduce AutoSafe, the first fully automated safety training pipeline for LLM agents, utilizing an open threat model and synthetic data generation to improve safety without real-world data collection.

Findings

01

Safety scores increased by 45% on average

02

28.91% improvement on real-world safety benchmarks

03

Effective generalization of safety strategies across scenarios

Abstract

Large Language Model (LLM)-based agents are increasingly deployed in real-world applications such as "digital assistants, autonomous customer service, and decision-support systems", where their ability to "interact in multi-turn, tool-augmented environments" makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)