LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight
Lennart Wachowiak, Scott D. Blain, David Williams-King, Samuele Marro

TL;DR
This paper introduces a 'warden' LLM that monitors human-AI interactions in real time to detect and mitigate adversarial persuasion, significantly reducing manipulation success rates.
Contribution
The study presents a novel oversight mechanism using a secondary LLM to detect manipulation, demonstrating its effectiveness across multiple decision-making scenarios and simulations.
Findings
Adding a warden halves the success rate of adversarial LLMs from 65.4% to 30.4%.
Warden models reduce adversarial success in simulations from 34.7% to 12.3%.
Even weaker wardens provide meaningful protection, indicating scalable oversight potential.
Abstract
LLMs are increasingly capable of persuasion, which raises the question of how to protect users against manipulation. In a preregistered user study (N=120) across four decision-making scenarios, we find that an adversarial LLM with a hidden goal succeeds in steering users' decisions 65.4% of the time. We then introduce a "warden" model: a secondary LLM that monitors the human-AI interaction trace in real time and issues non-binding, private advisories to the user when it detects manipulation. Adding a warden more than halves the adversary's success rate to 30.4%, with a much smaller (8.6 percentage points) reduction for genuine interactions. To probe the mechanism behind these results, we release COAX-Bench, a simulation benchmark spanning 14 decision-making scenarios, including hiring, voting, and file access. Across 16,212 simulated multi-agent interactions, capable adversarial LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
