Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models
Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, Kiyoharu Aizawa

TL;DR
This paper proposes Unsolvable Problem Detection (UPD), a new evaluation task for Large Multimodal Models to assess their true understanding by identifying when they should withhold answers, revealing limitations of current benchmarks.
Contribution
Introduces UPD as a novel task and MM-UPD Bench as a benchmark to evaluate LMMs' ability to recognize unsolvable problems, highlighting gaps in current understanding assessments.
Findings
Most LMMs struggle with UPD tasks, indicating overestimation of their understanding.
Chain-of-thought and self-reflection improve LMM performance on UPD.
Current benchmarks do not adequately measure models' trustworthiness and true understanding.
Abstract
This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed . Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM's ability to withhold answers when encountering unsolvable problems of MCQA, verifying whether the model truly understands the answer. UPD encompasses three problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD), covering unsolvable cases like answer-lacking or incompatible choices and image-question mismatches. For the evaluation, we introduce the MM-UPD Bench, a benchmark for assessing performance across various ability dimensions. Our experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies
MethodsSparse Evolutionary Training
