OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences
Kaixiang Wang, Jiong Lou, Zhaojiacheng Zhou, Jie Li

TL;DR
This paper introduces OEP, a novel black-box attack method that poisons LLM agent memories with locally correct yet harmful experiences, causing risky generalizations during reflection.
Contribution
The authors propose a new poisoning attack, OEP, that exploits the reflection process of LLM agents without requiring system control or malicious content.
Findings
OEP achieves over 50% attack success rate on GPT-4o agents.
OEP outperforms existing attacks under LLM auditing defenses.
Poisoned experiences lead to over-generalized risky rules during reflection.
Abstract
Memory-augmented large language model (LLM) agents use iterative reflection and self-evolution to solve complex tasks, but these mechanisms introduce security risks. Existing agentic memory attacks require privileged access or explicit malicious content, making them detectable by advanced safety filters. This leaves a subtler attack surface underexplored: whether adversaries can induce agent to generate experiences that appear locally correct and semantically plausible yet induce harmful generalization during reflection. We find that reflective agents are vulnerable to such clean experiences, especially when paired with severe but plausible hypothetical consequences. Based on this observation, we introduce Obsessive Experience Poisoning (OEP), a low-privilege black-box attack requiring no direct control over the system prompt or memory database. OEP constructs adversarial clean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
