AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions
Jihyoung Jang, Hyounghun Kim

TL;DR
This paper introduces AQuA, a dataset for ambiguous visual question answering, enabling models to recognize ambiguity levels and select appropriate response strategies, thus improving their ability to handle real-world VQA challenges.
Contribution
The paper presents AQuA, a novel dataset categorizing ambiguity in VQA and demonstrates fine-tuning models for strategy-aware responses, advancing beyond existing benchmarks.
Findings
Models trained on AQuA better recognize ambiguity levels.
Fine-tuned models select context-appropriate response strategies.
Enhanced models outperform baselines in strategic response generation.
Abstract
Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper identifies and attempts to tackle the issue of overconfident predictions by vision-language models for questions that are ambiguous. - The data generation pipeline of AQuA is described in detail. Human filtering on eval split is performed to ensure clean samples. - The paper is well written and easy to follow.
- The dataset is not meaningfully 'fine-grained'. There are only 4 categories of ambiguity, with real-world objects (also not from fine-grained categories) - AQuA has a single fixed "correct" answer strategy for each level which is unrealistic. In real interactions multiple strategies (or combinations of them) are also appropriate. For instance AQuA says the only acceptable strategy for answering L3 questions is to ask for clarifications, whereas realistic answers could involve making a best-gue
- The core idea is well-motivated - Getting 3B models to outperform 72B+ models shows this training approach works.
- Generation, filtering, and evaluation all use GPT-5 variants. This creates circular logic—you're essentially teaching models to mimic GPT-5's behavior and then using GPT-5 to judge success - 3.6K training samples from COCO only. Will this generalize to other domains? - Why 20% bounding box area for Level 1? Why not 15% or 25%? No ablation studies to justify these choices. - How do humans perform on strategic selection? - Performance drops from 92.22% to 77.0% (Fig. 5). The "redistribution" exp
1. The research problem is clearly framed, with 4 levels of categorization 2. The dataset construction pipeline has human validation
1. The importance of the problem in real-world settings. In Figure 1, the other models' answers still seem reasonable to me. So I wonder about the significance of the problem in the VQA setting. 2. The rationale/completeness behind the 4 different levels. How can you tell whether there aren't other ambiguous questions? 3. It seems that the difference between the levels is simply the number of salient objects, which can be quite subjective or prone to errors. You need to pre-define a size thresho
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
