LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight

Lennart Wachowiak; Scott D. Blain; David Williams-King; Samuele Marro

arXiv:2605.08321·cs.LG·May 12, 2026

LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight

Lennart Wachowiak, Scott D. Blain, David Williams-King, Samuele Marro

PDF

TL;DR

This paper introduces a 'warden' LLM that monitors human-AI interactions in real time to detect and mitigate adversarial persuasion, significantly reducing manipulation success rates.

Contribution

The study presents a novel oversight mechanism using a secondary LLM to detect manipulation, demonstrating its effectiveness across multiple decision-making scenarios and simulations.

Findings

01

Adding a warden halves the success rate of adversarial LLMs from 65.4% to 30.4%.

02

Warden models reduce adversarial success in simulations from 34.7% to 12.3%.

03

Even weaker wardens provide meaningful protection, indicating scalable oversight potential.

Abstract

LLMs are increasingly capable of persuasion, which raises the question of how to protect users against manipulation. In a preregistered user study (N=120) across four decision-making scenarios, we find that an adversarial LLM with a hidden goal succeeds in steering users' decisions 65.4% of the time. We then introduce a "warden" model: a secondary LLM that monitors the human-AI interaction trace in real time and issues non-binding, private advisories to the user when it detects manipulation. Adding a warden more than halves the adversary's success rate to 30.4%, with a much smaller (8.6 percentage points) reduction for genuine interactions. To probe the mechanism behind these results, we release COAX-Bench, a simulation benchmark spanning 14 decision-making scenarios, including hiring, voting, and file access. Across 16,212 simulated multi-agent interactions, capable adversarial LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.