Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

TL;DR
Contrastive Region Guidance (CRG) is a training-free method that enhances open-source vision-language models by leveraging visual prompts, significantly improving performance across diverse tasks without additional training.
Contribution
CRG introduces a novel contrastive approach enabling open-source VLMs to utilize visual prompts effectively without training, broadening applicability and performance.
Findings
Up to 11.1% accuracy increase on ViP-Bench tasks
10% improvement on spatial reasoning with What'sUp
Enhanced image-text alignment with up to 8.4 AUROC and 6.8 F1 points
Abstract
Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data that includes visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer (i.e., the model's prior). CRG achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems
