ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast \& Slow Reasoning for Robust Agent Defense

Shiyu Xiang; Tong Zhang; Ronghao Chen

arXiv:2505.19260·cs.CR·September 16, 2025

ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast \& Slow Reasoning for Robust Agent Defense

Shiyu Xiang, Tong Zhang, Ronghao Chen

PDF

Open Access

TL;DR

ALRPHFS introduces a novel defense framework for LLM agents that combines adversarial learning and hierarchical reasoning to improve safety and robustness against complex risks without retraining the base model.

Contribution

The paper presents ALRPHFS, a new framework that enhances LLM safety by learning risk patterns adversarially and employing hierarchical reasoning, addressing limitations of existing safety checks.

Findings

01

Achieves an average accuracy of 80% in risk detection.

02

Demonstrates strong generalizability across different agents and tasks.

03

Outperforms existing baseline methods in safety robustness.

Abstract

LLM Agents are becoming central to intelligent systems. However, their deployment raises serious safety concerns. Existing defenses largely rely on "Safety Checks", which struggle to capture the complex semantic risks posed by harmful user inputs or unsafe agent behaviors - creating a significant semantic gap between safety checks and real-world risks. To bridge this gap, we propose a novel defense framework, ALRPHFS (Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning). ALRPHFS consists of two core components: (1) an offline adversarial self-learning loop to iteratively refine a generalizable and balanced library of risk patterns, substantially enhancing robustness without retraining the base LLM, and (2) an online hierarchical fast & slow reasoning engine that balances detection effectiveness with computational efficiency. Experimental results demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Terrorism, Counterterrorism, and Political Violence