Clarify or Answer: Reinforcement Learning for Agentic VQA with Context Under-specification
Zongwan Cao, Bingbing Wen, Lucy Lu Wang

TL;DR
This paper introduces CoA, a reinforcement learning-based agent that decides when to ask for clarification or answer directly in context-dependent visual question answering, significantly improving accuracy over baseline methods.
Contribution
It presents a novel ask-or-answer framework with reinforcement learning for clarification question generation in VQA, addressing under-specified questions with improved performance.
Findings
CoA improves VQA accuracy by an average of 15.3 points.
The approach effectively generates well-formed, focused clarification questions.
CoA outperforms prompting-based baselines across multiple datasets and models.
Abstract
Real-world visual question answering (VQA) is often context-dependent: an image-question pair may be under-specified, such that the correct answer depends on external information that is not observable in the image. In such cases, directly answering can lead to confident but incorrect predictions. We propose CoA(Clarify-or-Answer), an ask-or-answer agent that separately models the decision to ask or answer, and what to ask if needed. CoA first determines whether clarification is necessary; if so, it asks a single focused question and then incorporates the response to produce the final answer. We introduce CONTEXTCLARIFY with a set of ambiguous VQA questions and the contrast set that is non-ambiguous. We further introduce GRPO-CR (Clarification Reasoning), a reinforcement learning approach that optimizes clarification question generation with multiple reward signals encouraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
