Object-centric Binding in Contrastive Language-Image Pretraining

Rim Assouel; Pietro Astolfi; Florian Bordes; Michal Drozdzal; Adriana; Romero-Soriano

arXiv:2502.14113·cs.CV·February 21, 2025

Object-centric Binding in Contrastive Language-Image Pretraining

Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, Adriana, Romero-Soriano

PDF

Open Access

TL;DR

This paper introduces a novel binding module that enhances contrastive vision-language models' ability to understand complex scenes with multiple objects and relationships by integrating scene graphs and structured representations without hard-negative training.

Contribution

The work presents a new binding module that connects scene graphs with slot-structured image representations, improving compositional understanding in CLIP-like models without additional hard negatives.

Findings

01

Improved multi-object scene understanding performance.

02

Enhanced image-text matching accuracy for complex scenes.

03

Sample-efficient learning with better relational reasoning.

Abstract

Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training