The Hidden Puppet Master: Predicting Human Belief Change in Manipulative LLM Dialogues
Jocelyn Shen, Amina Luvsanchultem, Jessica Kim, Kynnedy Smith, Valdemar Danry, Kantwon Rogers, Hae Won Park, Maarten Sap, Cynthia Breazeal

TL;DR
This paper introduces PUPPET, a resource and framework for studying and predicting human belief change caused by manipulative LLM dialogues, highlighting gaps in current safety measures.
Contribution
It provides a new dataset and theoretical taxonomy to evaluate and understand belief shifts in real-world advice interactions with LLMs.
Findings
State-of-the-art LLMs achieve moderate correlation (r=0.3-0.5) in predicting belief change.
Models systematically underestimate the strength of human belief susceptibility.
Current safety paradigms do not fully capture the impact of manipulative strategies.
Abstract
As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to subtle steering toward hidden incentives misaligned with their own interests. While existing NLP research has benchmarked manipulation detection, these efforts often rely on simulated debates and remain fundamentally decoupled from actual human belief shifts in real-world scenarios. We introduce PUPPET, a theoretical taxonomy and resource that bridges this gap by focusing on the moral direction of hidden incentives in everyday, advice-giving contexts. We provide an evaluation dataset of N=1,035 human-LLM interactions, where we measure users' belief shifts. Our analysis reveals a critical disconnect in current safety paradigms: while models can be trained to detect manipulative strategies, they do not correlate with the magnitude of resulting belief change. As such, we define the task of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
