Same or Not? Enhancing Visual Perception in Vision-Language Models
Damiano Marsili, Aditya Mehta, Ryan Y. Lin, Georgia Gkioxari

TL;DR
This paper introduces TWIN, a large-scale dataset and task to improve the fine-grained visual perception of vision-language models, leading to better recognition of subtle visual details without sacrificing general performance.
Contribution
The authors present TWIN, a new dataset and benchmark that enhance VLMs' perceptual abilities through fine-tuning, addressing limitations in fine-grained recognition and visual detail perception.
Findings
Fine-tuning on TWIN improves VLMs' fine-grained recognition by up to 19.3%.
Models trained on TWIN perform well across diverse domains like art, animals, and landmarks.
TWIN's scale correlates with improved perceptual performance in VLMs.
Abstract
Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
