TL;DR
This paper introduces ToM-SB, a challenging belief-steering task for AI defenders to fool attackers by modeling their beliefs, and shows reinforcement learning can improve both ToM and fooling abilities.
Contribution
It presents a novel belief-steering challenge, ToM-SB, and demonstrates reinforcement learning methods that enhance AI's theory of mind and fooling skills in adversarial dialogue scenarios.
Findings
Strong models struggle on ToM-SB without training.
Reinforcement learning improves both ToM and fooling success.
Combined ToM and fooling rewards yield the best performance.
Abstract
As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
