Adding simple structure at inference improves Vision-Language Compositionality
Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune

TL;DR
This paper introduces a simple inference-time method that enhances vision-language compositionality in dual encoder models by analyzing image segments and text components, leading to improved retrieval performance without additional training.
Contribution
The authors propose a novel inference-time technique involving image cropping and text segmentation to improve compositionality in vision-language models, demonstrating consistent gains across datasets.
Findings
Improves VLM performance without retraining
Enhances attribute-object binding accuracy
Processing image crops is crucial for gains
Abstract
Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
MethodsContrastive Language-Image Pre-training · ALIGN
