State Contamination in Memory-Augmented LLM Agents
Yian Wang, Agam Goyal, Yuen Chen, Hari Sundaram

TL;DR
This paper investigates how toxic or adversarial information can be hidden within memory summaries of language model agents, leading to safety risks that are not detected by standard toxicity measures.
Contribution
It introduces the concept of memory laundering and the sub-threshold propagation gap to quantify hidden toxicity propagation in memory-augmented LLM agents.
Findings
Toxic-origin memory summaries can increase downstream toxicity without detection.
Memory sanitization before summarization reduces hidden toxicity influence.
Toxicity propagates through raw transcripts and compressed memory in different ways.
Abstract
LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
