Object-centric Binding in Contrastive Language-Image Pretraining
Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, Adriana, Romero-Soriano

TL;DR
This paper introduces a novel binding module that enhances contrastive vision-language models' ability to understand complex scenes with multiple objects and relationships by integrating scene graphs and structured representations without hard-negative training.
Contribution
The work presents a new binding module that connects scene graphs with slot-structured image representations, improving compositional understanding in CLIP-like models without additional hard negatives.
Findings
Improved multi-object scene understanding performance.
Enhanced image-text matching accuracy for complex scenes.
Sample-efficient learning with better relational reasoning.
Abstract
Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
