Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

TL;DR
This paper introduces RepARe, a framework that improves zero-shot vision-language question answering by rephrasing questions with visual grounding, leading to significant accuracy gains across multiple datasets.
Contribution
RepARe is a novel, gradient-free method that enhances zero-shot VQA performance by automatically rephrasing questions with visual context using LVLMs.
Findings
Achieves up to 7.94% accuracy increase on VQA datasets.
Using gold answers for question selection boosts accuracy by 14.41%.
Outputs increase syntactic complexity and improve vision-language reasoning.
Abstract
An increasing number of vision-language tasks can be handled with little to no training, i.e., in a zero and few-shot manner, by marrying large language models (LLMs) to vision encoders, resulting in large vision-language models (LVLMs). While this has huge upsides, such as not requiring training data or custom architectures, how an input is presented to an LVLM can have a major impact on zero-shot model performance. In particular, inputs phrased in an underspecified way can result in incorrect answers due to factors like missing visual information, complex implicit reasoning, or linguistic ambiguity. Therefore, adding visually-grounded information to the input as a preemptive clarification should improve model performance by reducing underspecification, e.g., by localizing objects and disambiguating references. Similarly, in the VQA setting, changing the way questions are framed can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
