Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection
Darren Cheng, Wen-Kwang Tsao

TL;DR
This paper introduces a structural defense in OpenClaw that combines agent privilege separation and JSON formatting to effectively prevent prompt injection attacks in LLM applications, achieving near-zero attack success rates.
Contribution
The paper presents a novel multi-mechanism defense combining agent isolation and JSON formatting in OpenClaw, significantly reducing prompt injection success.
Findings
Agent isolation reduces attack success rate to 0.31%.
JSON formatting alone reduces attack success rate to 14.18%.
Combined defense achieves 0% attack success rate.
Abstract
Prompt injection remains one of the most practical attack vectors against LLM-integrated applications. We replicate the Microsoft LLMail-Inject benchmark (Greshake et al., 2024) against current generation models running inside OpenClaw, an open source multitool agent platform. Our proposed defense combines two mechanisms: agent isolation, implemented as a privilege separated two-agent pipeline with tool partitioning, and JSON formatting, which produces structured output that strips persuasive framing before the action agent processes it. We run four experiments on the same 649 attacks that succeeded against our single-agent baseline. The full pipeline achieves 0 percent attack success rate (ASR) on the evaluated benchmark. Agent isolation alone achieves 0.31 percent ASR, approximately 323 times lower than the baseline. JSON formatting alone achieves 14.18 percent ASR, about 7.1 times…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
