Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
Pu Jian, Donglei Yu, Wen Yang, Shuo Ren, Jiajun Zhang

TL;DR
This paper introduces ClearVQA, a benchmark for evaluating how well visual language models can resolve ambiguities in visual questions through interactive clarification, addressing a gap in existing research.
Contribution
The paper presents the ClearVQA benchmark and explores training VLMs to ask clarifying questions, enhancing their ability to handle ambiguous visual queries.
Findings
ClearVQA effectively evaluates VLMs' ambiguity resolution capabilities.
Training VLMs to ask improves clarification success.
Benchmark covers diverse ambiguity categories.
Abstract
In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs' capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
