Contrastive Region Guidance: Improving Grounding in Vision-Language   Models without Training

David Wan; Jaemin Cho; Elias Stengel-Eskin; Mohit Bansal

arXiv:2403.02325·cs.CV·March 5, 2024·1 cites

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

PDF

Open Access

TL;DR

Contrastive Region Guidance (CRG) is a training-free method that enhances open-source vision-language models by leveraging visual prompts, significantly improving performance across diverse tasks without additional training.

Contribution

CRG introduces a novel contrastive approach enabling open-source VLMs to utilize visual prompts effectively without training, broadening applicability and performance.

Findings

01

Up to 11.1% accuracy increase on ViP-Bench tasks

02

10% improvement on spatial reasoning with What'sUp

03

Enhanced image-text alignment with up to 8.4 AUROC and 6.8 F1 points

Abstract

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data that includes visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer (i.e., the model's prior). CRG achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems