Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing
Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

TL;DR
This paper introduces a structured game-based framework to detect and quantify deception in large language models, revealing how contextual framing influences deceptive behavior and highlighting the need for advanced behavioral audits.
Contribution
The work presents a novel, logically grounded method using parallel-world probing within a 20-Questions game to systematically assess LLM deception strategies.
Findings
Deception increases under existential framing for some models.
GPT-4o remains invariant in deceptive responses across incentives.
Deception can be triggered solely by contextual framing.
Abstract
As Large Language Models (LLMs) transition into autonomous agentic roles, the risk of deception-defined behaviorally as the systematic provision of false information to satisfy external incentives-poses a significant challenge to AI safety. Existing benchmarks often focus on unintentional hallucinations or unfaithful reasoning, leaving intentional deceptive strategies under-explored. In this work, we introduce a logically grounded framework to elicit and quantify deceptive behavior by embedding LLMs in a structured 20-Questions game. Our method employs a conversational forking mechanism: at the point of object identification, the dialogue state is duplicated into multiple parallel worlds, each presenting a mutually exclusive query. Deception is formally identified when a model generates a logical contradiction by denying its selected object across all parallel branches to avoid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Adversarial Robustness in Machine Learning
