Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models
Sitong Fang, Shiyi Hou, Kaile Wang, Boyuan Chen, Donghai Hong, Jiayi Zhou, Josef Dai, Yaodong Yang, Jiaming Ji

TL;DR
This paper introduces a new benchmark and a debate-based monitoring framework to detect and evaluate deceptive behaviors in multimodal large language models, addressing a critical safety concern as models become more capable.
Contribution
It presents MM-DeceptionBench, the first benchmark for multimodal deception, and proposes a debate with images framework to improve detection of deceptive strategies in multimodal models.
Findings
MM-DeceptionBench effectively characterizes six deception categories.
Debate with images significantly improves deception detection accuracy.
Model agreement with human judgments increases by 1.25x with the proposed method.
Abstract
Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind their performance leaps lie more insidious and destructive safety risks, namely deception. Unlike hallucination, which arises from insufficient capability and leads to mistakes, deception represents a deeper threat in which models deliberately mislead users through complex reasoning and insincere responses. As system capabilities advance, deceptive behaviours have spread from textual to multimodal settings, amplifying their potential harm. First and foremost, how can we monitor these covert multimodal deceptive behaviors? Nevertheless, current research remains almost entirely confined to text, leaving the deceptive risks of multimodal large language models unexplored. In this work, we systematically reveal and quantify multimodal deception risks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
