MAGE: Safeguarding LLM Agents against Long-Horizon Threats via Shadow Memory
Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Charles Fleming, Ting Wang

TL;DR
MAGE is a novel framework that uses a dedicated shadow memory to detect and prevent long-horizon threats in LLM agents, enhancing safety with minimal utility overhead.
Contribution
MAGE introduces the first agentic memory-based approach to detect and mitigate long-horizon threats in LLM agents, inspired by systems security techniques.
Findings
MAGE outperforms existing defenses in detection accuracy.
It achieves early detection of most attacks.
MAGE incurs negligible overhead to agent utility.
Abstract
As large language model (LLM)-powered agents are increasingly deployed to perform complex, real-world tasks, they face a growing class of attacks that exploit extended user-agent-environment interactions to pursue malicious objectives improbable in single-turn settings. Such long-horizon threats pose significant risks to the safe deployment of LLM agents in critical domains. In this paper, we present MAGE (Memory As Guardrail Enforcement), a novel defensive framework designed to counter a wide range of long-horizon threats. Inspired by the "shadow stack" abstraction in systems security, MAGE maintains a dedicated, safety-focused agentic memory that distills and retains safety-critical context across the agent's full execution trajectory, leveraging this shadow memory to proactively assess the risk of pending actions prior to their execution. Extensive evaluation demonstrates that MAGE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
