Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents
Jun Wen Leong

TL;DR
This study systematically evaluates defenses against persistent memory attacks on open-source LLMs, revealing that most defenses fail except for memory-layer tool-gating, which can effectively prevent such attacks.
Contribution
It provides the first comprehensive analysis of defense effectiveness across architectural layers against persistent memory attacks on LLMs.
Findings
Most defenses fail to reduce attack success rates significantly.
Memory sandbox at the memory layer can reduce attack success to 0%.
Some defenses inadvertently invert their effectiveness depending on the model.
Abstract
Persistent memory attacks against LLM agents achieve high attack success rates against open-source models. In these attacks, malicious instructions injected via RAG-retrieved documents are stored in persistent memory and executed in later sessions. However, no systematic evaluation of defense effectiveness against this attack class exists. We evaluate six defenses across four architectural layers against delayed-trigger attacks on nine open-source models (5,040 runs, N=40 per condition). Four defenses fail at approximately baseline attack success rate: input-level filtering (Minimizer, Sanitizer) and retrieval-level filtering (RAG Sanitizer, RAG LLM Judge) achieve 88-89% ASR, statistically indistinguishable from the undefended baseline of 88.6%. Prompt Hardening partially fails at 77.8% ASR, with the reduction driven by two models at 0%: one genuine defense effect and one model-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
