Adding simple structure at inference improves Vision-Language Compositionality

Imanol Miranda; Ander Salaberria; Eneko Agirre; Gorka Azkune

arXiv:2506.09691·cs.CV·June 12, 2025

Adding simple structure at inference improves Vision-Language Compositionality

Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune

PDF

Open Access 1 Repo

TL;DR

This paper introduces a simple inference-time method that enhances vision-language compositionality in dual encoder models by analyzing image segments and text components, leading to improved retrieval performance without additional training.

Contribution

The authors propose a novel inference-time technique involving image cropping and text segmentation to improve compositionality in vision-language models, demonstrating consistent gains across datasets.

Findings

01

Improves VLM performance without retraining

02

Enhances attribute-object binding accuracy

03

Processing image crops is crucial for gains

Abstract

Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

imirandam/structure-inference-compositionality
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling

MethodsContrastive Language-Image Pre-training · ALIGN