SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Zhe Liu; Zonghao Ying; Wenxin Zhang; Quanchen Zou; Deyue Zhang; Dongdong Yang; Xiangzheng Zhang; Hao Peng

arXiv:2605.05704·cs.CR·May 8, 2026

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Zhe Liu, Zonghao Ying, Wenxin Zhang, Quanchen Zou, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, Hao Peng

PDF

1 Repo

TL;DR

SafeHarbor is a hierarchical memory-augmented framework that enhances LLM agent safety by establishing precise decision boundaries and dynamically evolving defense rules, balancing utility and security.

Contribution

It introduces a novel, training-free, memory-based approach with self-evolving mechanisms for improved LLM safety against malicious prompts.

Findings

01

Achieves 63.6% benign utility on GPT-4o.

02

Maintains over 93% refusal rate against harmful requests.

03

Outperforms existing safety mechanisms in experiments.

Abstract

With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate this trade-off, we propose \textsc{SafeHarbor}, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, \textsc{SafeHarbor} extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ljj-cyber/SafeHarbor
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.