Perceptual Grouping in Contrastive Vision-Language Models

Kanchana Ranasinghe; Brandon McKinzie; Sachin Ravi; Yinfei Yang,; Alexander Toshev; Jonathon Shlens

arXiv:2210.09996·cs.CV·August 23, 2023·1 cites

Perceptual Grouping in Contrastive Vision-Language Models

Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Yinfei Yang,, Alexander Toshev, Jonathon Shlens

PDF

Open Access 2 Repos 1 Models 1 Video

TL;DR

This paper investigates how contrastive vision-language models understand object locations and grouping within images, proposing modifications to improve their spatial understanding and demonstrating state-of-the-art results in unsupervised segmentation.

Contribution

The authors introduce minimal modifications to contrastive models that enable simultaneous learning of semantic and spatial information, enhancing object localization and grouping capabilities.

Findings

01

Achieves state-of-the-art unsupervised segmentation performance.

02

Models become more robust to dataset biases and spurious correlations.

03

Improved understanding of object locations within images.

Abstract

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
jongwoopark7978/LVNet
model· ♡ 3
♡ 3

Videos

Perceptual Grouping in Contrastive Vision-Language Models· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning