TL;DR
This paper evaluates the limits of Multi-Modal Large Language Models in visual perception and reasoning tasks, revealing their shortcomings in detecting subtle clues and proposing challenging scenarios to push their capabilities closer to human-level detective skills.
Contribution
It introduces the CaughtCheating scenario, a new challenging task for MLLMs that tests their ability to detect suspicious clues in images, highlighting current limitations and guiding future improvements.
Findings
MLLMs perform poorly in subtle visual reasoning tasks.
Existing models fail in the CaughtCheating scenario with near-zero accuracy.
The study provides insights into the gap between current MLLMs and human detective abilities.
Abstract
Recent agentic Multi-Modal Large Language Models (MLLMs) such as GPT-o3 have achieved near-ceiling scores on various existing benchmarks, motivating a demand for more challenging test tasks. These MLLMs have been reported to excel in a few expert-level tasks for humans, e.g., GeoGuesser, reflecting their potential as a detective who can notice minuscule cues in an image and weave them into coherent, situational explanations, leading to a reliable answer. But can they match the performance of excellent human detectives? To answer this question, we investigate some hard scenarios where GPT-o3 can still handle, and find a common scenario where o3's performance drops to nearly zero, which we name CaughtCheating. It is inspired by the social media requests that ask others to detect suspicious clues from photos shared by the poster's partner. We conduct extensive experiments and analysis to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The paper proposes a genuinely novel and highly challenging task that pushes MLLMs beyond standard VQA or reasoning benchmarks. - The benchmark is thoughtfully constructed with two complementary subsets: a small, extremely difficult Real Subset that captures real-world subtlety and a large-scale Synthetic Subset for diversity. The annotation scheme is very thorough, distinguishing between "Deterministic" and "Non-deterministic" clues, and including "Decomposed" (Perception vs. Reasoning) que
* The Real Subset (100 images) is small, dominated by hotel scenes (69%), and biased toward cisgender heterosexual couples—failing to represent diverse scenarios (e.g., workplaces) or relationships, which limits conclusions about MLLMs’ performance across real-world contexts. * Clues in the Synthetic Subset are overly prominent (e.g., obvious lipstick, two glasses) due to prompt design, probably making it less effective at mimicking the subtlety of real-world clues (e.g., faint reflections) and
1. Interesting, socially relevant task that focuses on subtle visual perception and context-sensitive reasoning on image-claim pairs. 2. The annotation schema distinguishes deterministic and non-deterministic clues in the detective task and decomposes visual reasoning questions, aiming to evaluate the performance of MLLMs.
1. The scale of the constructed real dataset is quite small, with only 100 samples. A small real dataset raises concerns about the reliability of the benchmarking results and also weakens the contributions of benchmarking. 2. The paper primarily constructs a benchmark and reports the limitations of current MLLMs. It does not discuss the potential direction or insightful methods to address the identified failures. 3. Synthetic subset generation relies on stereotypes and templated clues. The pape
- The dataset captures subtle, real-world visual clues that existing benchmarks overlook, providing a meaningful test of advanced multimodal reasoning. - The analysis identifies why state-of-the-art models fail, linking errors to low-salience signals and lacking top-down guidance, offering actionable insights for future model design. - The benchmark includes both real and synthetic subsets, plus automatic scoring procedures, enabling rigorous and scalable assessment of MLLM capabilities.
- The benchmark focuses on a specific “detective-style clue finding” scenario, which may be partially included in other work and limits its generalizability to other multimodal reasoning tasks. - Some clues in real-world images may be open to interpretation, making the correctness labels somewhat subjective. - The prompt design is too simple, which cannot cover enough requirements in the real world for fine-grained perception and reasoning.
1. The paper introduces CaughtCheating, a creative and practically relevant benchmark that targets subtle, context-dependent visual reasoning—an underexplored yet important aspect of real-world perception tasks. 2.The study systematically evaluates advanced MLLMs’ visual reasoning processes, revealing specific scenarios where even top-tier models like GPT-o3 fail, thus providing valuable diagnostic insight into current model limitations. 3. By connecting model failures to Guided Search theory,
1. Section 2.1 provides an informative overview of how modern MLLMs perform visual reasoning. However, the section is unnecessarily verbose (particularly lines 148–160) and contains excessive descriptions of specific model responses that distract from the main narrative. It would be more effective to distill these observations into general patterns or response tendencies across mainstream MLLMs, in line with the section title “Exploring the Boundary of Visual Perception and Reasoning.” Moreover,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
