Rephrase, Augment, Reason: Visual Grounding of Questions for   Vision-Language Models

Archiki Prasad; Elias Stengel-Eskin; Mohit Bansal

arXiv:2310.05861·cs.CL·April 3, 2024·1 cites

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces RepARe, a framework that improves zero-shot vision-language question answering by rephrasing questions with visual grounding, leading to significant accuracy gains across multiple datasets.

Contribution

RepARe is a novel, gradient-free method that enhances zero-shot VQA performance by automatically rephrasing questions with visual context using LVLMs.

Findings

01

Achieves up to 7.94% accuracy increase on VQA datasets.

02

Using gold answers for question selection boosts accuracy by 14.41%.

03

Outputs increase syntactic complexity and improve vision-language reasoning.

Abstract

An increasing number of vision-language tasks can be handled with little to no training, i.e., in a zero and few-shot manner, by marrying large language models (LLMs) to vision encoders, resulting in large vision-language models (LVLMs). While this has huge upsides, such as not requiring training data or custom architectures, how an input is presented to an LVLM can have a major impact on zero-shot model performance. In particular, inputs phrased in an underspecified way can result in incorrect answers due to factors like missing visual information, complex implicit reasoning, or linguistic ambiguity. Therefore, adding visually-grounded information to the input as a preemptive clarification should improve model performance by reducing underspecification, e.g., by localizing objects and disambiguating references. Similarly, in the VQA setting, changing the way questions are framed can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

archiki/repare
pytorchOfficial

Videos

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications