TL;DR
SafeHarbor is a hierarchical memory-augmented framework that enhances LLM agent safety by establishing precise decision boundaries and dynamically evolving defense rules, balancing utility and security.
Contribution
It introduces a novel, training-free, memory-based approach with self-evolving mechanisms for improved LLM safety against malicious prompts.
Findings
Achieves 63.6% benign utility on GPT-4o.
Maintains over 93% refusal rate against harmful requests.
Outperforms existing safety mechanisms in experiments.
Abstract
With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate this trade-off, we propose \textsc{SafeHarbor}, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, \textsc{SafeHarbor} extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
