Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models

Atsuyuki Miyai; Jingkang Yang; Jingyang Zhang; Yifei Ming; Qing Yu; Go Irie; Yixuan Li; Hai Li; Ziwei Liu; Kiyoharu Aizawa

arXiv:2403.20331·cs.CV·June 10, 2025·1 cites

Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models

Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, Kiyoharu Aizawa

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper proposes Unsolvable Problem Detection (UPD), a new evaluation task for Large Multimodal Models to assess their true understanding by identifying when they should withhold answers, revealing limitations of current benchmarks.

Contribution

Introduces UPD as a novel task and MM-UPD Bench as a benchmark to evaluate LMMs' ability to recognize unsolvable problems, highlighting gaps in current understanding assessments.

Findings

01

Most LMMs struggle with UPD tasks, indicating overestimation of their understanding.

02

Chain-of-thought and self-reflection improve LMM performance on UPD.

03

Current benchmarks do not adequately measure models' trustworthiness and true understanding.

Abstract

This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed $Unsolvable Problem Detection (UPD)$ . Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM's ability to withhold answers when encountering unsolvable problems of MCQA, verifying whether the model truly understands the answer. UPD encompasses three problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD), covering unsolvable cases like answer-lacking or incompatible choices and image-question mismatches. For the evaluation, we introduce the MM-UPD Bench, a benchmark for assessing performance across various ability dimensions. Our experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

atsumiyai/upd
pytorchOfficial

Datasets

MM-UPD/MM-UPD
dataset· 457 dl
457 dl

Videos

Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models· underline

Taxonomy

TopicsSemantic Web and Ontologies

MethodsSparse Evolutionary Training