AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

Jihyoung Jang; Hyounghun Kim

arXiv:2603.07394·cs.CV·March 10, 2026

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

Jihyoung Jang, Hyounghun Kim

PDF

Open Access 2 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces AQuA, a dataset for ambiguous visual question answering, enabling models to recognize ambiguity levels and select appropriate response strategies, thus improving their ability to handle real-world VQA challenges.

Contribution

The paper presents AQuA, a novel dataset categorizing ambiguity in VQA and demonstrates fine-tuning models for strategy-aware responses, advancing beyond existing benchmarks.

Findings

01

Models trained on AQuA better recognize ambiguity levels.

02

Fine-tuned models select context-appropriate response strategies.

03

Enhanced models outperform baselines in strategic response generation.

Abstract

Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The paper identifies and attempts to tackle the issue of overconfident predictions by vision-language models for questions that are ambiguous. - The data generation pipeline of AQuA is described in detail. Human filtering on eval split is performed to ensure clean samples. - The paper is well written and easy to follow.

Weaknesses

- The dataset is not meaningfully 'fine-grained'. There are only 4 categories of ambiguity, with real-world objects (also not from fine-grained categories) - AQuA has a single fixed "correct" answer strategy for each level which is unrealistic. In real interactions multiple strategies (or combinations of them) are also appropriate. For instance AQuA says the only acceptable strategy for answering L3 questions is to ask for clarifications, whereas realistic answers could involve making a best-gue

Reviewer 02Rating 4Confidence 3

Strengths

- The core idea is well-motivated - Getting 3B models to outperform 72B+ models shows this training approach works.

Weaknesses

- Generation, filtering, and evaluation all use GPT-5 variants. This creates circular logic—you're essentially teaching models to mimic GPT-5's behavior and then using GPT-5 to judge success - 3.6K training samples from COCO only. Will this generalize to other domains? - Why 20% bounding box area for Level 1? Why not 15% or 25%? No ablation studies to justify these choices. - How do humans perform on strategic selection? - Performance drops from 92.22% to 77.0% (Fig. 5). The "redistribution" exp

Reviewer 03Rating 6Confidence 3

Strengths

1. The research problem is clearly framed, with 4 levels of categorization 2. The dataset construction pipeline has human validation

Weaknesses

1. The importance of the problem in real-world settings. In Figure 1, the other models' answers still seem reasonable to me. So I wonder about the significance of the problem in the VQA setting. 2. The rationale/completeness behind the 4 different levels. How can you tell whether there aren't other ambiguous questions? 3. It seems that the difference between the levels is simply the number of salient objects, which can be quite subjective or prone to errors. You need to pre-define a size thresho

Code & Models

Models

Datasets

jihyoung/AQuA
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning