Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
Debeshee Das, Julien Piet, Darya Kaviani, Luca Beurer-Kellner, Florian Tram\`er, David Wagner

TL;DR
Trojan Hippo demonstrates a novel persistent memory attack on LLM agents that plants dormant payloads, which activate during sensitive discussions and exfiltrate data, highlighting security-utility tradeoffs in defenses.
Contribution
This work systematically evaluates Trojan Hippo attacks across diverse memory architectures and defenses, introducing a dynamic evaluation framework and capability-aware analysis.
Findings
Achieves 85-100% attack success rate against top models
Defenses can reduce success rates to 0-5%
Security-utility tradeoff varies significantly with defense deployment
Abstract
Memory systems enable otherwise-stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work: the attacker plants a dormant payload into an agent's long-term memory via a single untrusted tool call (e.g., a crafted email), which activates only when the user later discusses sensitive topics such as finance, health, or identity, and exfiltrates high-value personal data to the attacker. While anecdotal demonstrations of such attacks have appeared against deployed systems, no prior work systematically evaluates them across heterogeneous memory architectures and defenses. We introduce a dynamic evaluation framework comprising two components: (1) an OpenEvolve-based adaptive red-teaming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
