Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images
Shengguang Wu, Fan-Yun Sun, Kaiyue Wen, Nick Haber

TL;DR
This paper introduces S-VCO, a novel finetuning method for vision-language models that improves visual grounding and reduces hallucinations by emphasizing fine-grained image details and alignment with text, using a new dataset and contrastive optimization.
Contribution
We propose S-VCO, a symmetrical contrastive optimization technique, and MVC, a challenging dataset for better visual-text alignment in vision-language models.
Findings
Up to 22% reduction in hallucinations.
Significant improvements in vision-centric tasks.
Enhanced performance on diverse benchmarks.
Abstract
Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens. To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. Experiments show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
