Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

Hanqi Xiao; Vaidehi Patil; Zaid Khan; Hyunji Lee; Elias Stengel-Eskin; Mohit Bansal

arXiv:2604.11666·cs.CL·April 14, 2026

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

Hanqi Xiao, Vaidehi Patil, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

PDF

1 Repo

TL;DR

This paper introduces ToM-SB, a challenging belief-steering task for AI defenders to fool attackers by modeling their beliefs, and shows reinforcement learning can improve both ToM and fooling abilities.

Contribution

It presents a novel belief-steering challenge, ToM-SB, and demonstrates reinforcement learning methods that enhance AI's theory of mind and fooling skills in adversarial dialogue scenarios.

Findings

01

Strong models struggle on ToM-SB without training.

02

Reinforcement learning improves both ToM and fooling success.

03

Combined ToM and fooling rewards yield the best performance.

Abstract

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

the-inscrutable-x/AIDoubleAgentDefenders
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.