Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

Pu Jian; Donglei Yu; Wen Yang; Shuo Ren; Jiajun Zhang

arXiv:2507.13773·cs.CV·September 17, 2025

Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

Pu Jian, Donglei Yu, Wen Yang, Shuo Ren, Jiajun Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces ClearVQA, a benchmark for evaluating how well visual language models can resolve ambiguities in visual questions through interactive clarification, addressing a gap in existing research.

Contribution

The paper presents the ClearVQA benchmark and explores training VLMs to ask clarifying questions, enhancing their ability to handle ambiguous visual queries.

Findings

01

ClearVQA effectively evaluates VLMs' ambiguity resolution capabilities.

02

Training VLMs to ask improves clarification success.

03

Benchmark covers diverse ambiguity categories.

Abstract

In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs' capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems