MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
Ishrith Gowda (University of California, Berkeley)

TL;DR
This paper introduces MEMSAD, a gradient-based anomaly detection method for memory poisoning in retrieval-augmented language models, providing formal guarantees and demonstrating robustness against various attacks.
Contribution
The paper presents MEMSAD, a novel calibration-based defense grounded in a gradient coupling theorem, with theoretical optimality and practical effectiveness against memory poisoning attacks.
Findings
MEMSAD achieves 100% true positive rate and 0% false positive rate in experiments.
Faithful evaluation increases measured attack success by 4 times.
Online calibration bounds and a formal characterization of a synonym-invariance loophole are provided.
Abstract
Persistent external memory enables LLM agents to maintain context across sessions, yet its security properties remain formally uncharacterized. We formalize memory poisoning attacks on retrieval-augmented agents as a Stackelberg game with a unified evaluation framework spanning three attack classes with escalating access assumptions. Correcting an evaluation protocol inconsistency in the triggered-query specification of Chen et al. (2024), we show faithful evaluation increases measured attack success by (ASR-R: ). Our primary contribution is MEMSAD (Semantic Anomaly Detection), a calibration-based defense grounded in a gradient coupling theorem: under encoder regularity, the anomaly score gradient and the retrieval objective gradient are provably identical, so any continuous perturbation that reduces detection risk necessarily degrades retrieval rank. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
