Hidden in Plain Sight: Reasoning in Underspecified and Misspecified Scenarios for Multimodal LLMs

Qianqi Yan; Hongquan Li; Shan Jiang; Yang Zhao; Xinze Guan; Ching-Chen Kuo; Xin Eric Wang

arXiv:2506.00258·cs.AI·August 26, 2025

Hidden in Plain Sight: Reasoning in Underspecified and Misspecified Scenarios for Multimodal LLMs

Qianqi Yan, Hongquan Li, Shan Jiang, Yang Zhao, Xinze Guan, Ching-Chen Kuo, Xin Eric Wang

PDF

1 Datasets

TL;DR

This paper systematically evaluates how current multimodal large language models handle implicit reasoning in messy, real-world scenarios, revealing a gap between their reasoning abilities and behavioral compliance, and proposing simple interventions to improve trustworthiness.

Contribution

It introduces a diagnostic suite for assessing MLLMs on implicit reasoning tasks and demonstrates that simple inference-time prompts can significantly enhance their performance in underconstrained environments.

Findings

01

Models often fail to detect hidden issues in ambiguous scenarios.

02

Explicit prompts can reveal underlying reasoning capabilities.

03

Simple interventions like clarifying questions improve model trustworthiness.

Abstract

Multimodal large language models (MLLMs) are increasingly deployed in open-ended, real-world environments where inputs are messy, underspecified, and not always trustworthy. Unlike curated benchmarks, these settings frequently involve instructions that refer to missing objects or contradictory facts, rely on ambiguous references, or request infeasible actions. In such cases, success hinges not on task execution alone, but on a model's ability to detect when something is silently wrong. This paper presents a systematic analysis of how current MLLMs handle such implicit reasoning scenarios: cases where the flaw is not explicitly stated but must be inferred from context. Using a curated diagnostic suite spanning four categories of real-world failure modes, we evaluate six MLLMs, including o3 and GPT-4o, and find that models frequently fail to surface hidden issues, even when they possess…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

rippleripple/iReason
dataset· 19 dl
19 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.