Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

Sitong Fang; Shiyi Hou; Kaile Wang; Boyuan Chen; Donghai Hong; Jiayi Zhou; Josef Dai; Yaodong Yang; Jiaming Ji

arXiv:2512.00349·cs.AI·December 2, 2025

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

Sitong Fang, Shiyi Hou, Kaile Wang, Boyuan Chen, Donghai Hong, Jiayi Zhou, Josef Dai, Yaodong Yang, Jiaming Ji

PDF

Open Access

TL;DR

This paper introduces a new benchmark and a debate-based monitoring framework to detect and evaluate deceptive behaviors in multimodal large language models, addressing a critical safety concern as models become more capable.

Contribution

It presents MM-DeceptionBench, the first benchmark for multimodal deception, and proposes a debate with images framework to improve detection of deceptive strategies in multimodal models.

Findings

01

MM-DeceptionBench effectively characterizes six deception categories.

02

Debate with images significantly improves deception detection accuracy.

03

Model agreement with human judgments increases by 1.25x with the proposed method.

Abstract

Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind their performance leaps lie more insidious and destructive safety risks, namely deception. Unlike hallucination, which arises from insufficient capability and leads to mistakes, deception represents a deeper threat in which models deliberately mislead users through complex reasoning and insincere responses. As system capabilities advance, deceptive behaviours have spread from textual to multimodal settings, amplifying their potential harm. First and foremost, how can we monitor these covert multimodal deceptive behaviors? Nevertheless, current research remains almost entirely confined to text, leaving the deceptive risks of multimodal large language models unexplored. In this work, we systematically reveal and quantify multimodal deception risks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)