WOLF: Werewolf-based Observations for LLM Deception and Falsehoods
Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe, Spencer Kim, Vasu Sharma, Sean O'Brien, Kevin Zhu

TL;DR
WOLF introduces a multi-agent benchmark based on Werewolf to evaluate deception production and detection in LLMs through dynamic, interactive gameplay with structured logs and role-specific deception taxonomy.
Contribution
It presents a novel, interactive benchmark for measuring deception and detection in LLMs, moving beyond static datasets to dynamic multi-agent interactions with detailed analysis.
Findings
Werewolves produce deceptive statements in 31% of turns.
Peer detection achieves 71-73% precision and ~52% accuracy.
Suspicion toward Werewolves increases over rounds, improving recall.
Abstract
Deception is a fundamental challenge for multi-agent reasoning: effective systems must strategically conceal information while detecting misleading behavior in others. Yet most evaluations reduce deception to static classification, ignoring the interactive, adversarial, and longitudinal nature of real deceptive dynamics. Large language models (LLMs) can deceive convincingly but remain weak at detecting deception in peers. We present WOLF, a multi-agent social deduction benchmark based on Werewolf that enables separable measurement of deception production and detection. WOLF embeds role-grounded agents (Villager, Werewolf, Seer, Doctor) in a programmable LangGraph state machine with strict night-day cycles, debate turns, and majority voting. Every statement is a distinct analysis unit, with self-assessed honesty from speakers and peer-rated deceptiveness from others. Deception is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
