Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

Imanol Miranda; Ander Salaberria; Eneko Agirre; Gorka Azkune

arXiv:2604.11496·cs.CV·April 17, 2026

Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune

PDF

TL;DR

This paper demonstrates that improving local region-segment alignment in dual-encoder vision-language models significantly enhances their compositional generalization, especially under distribution shifts, without needing to fine-tune the entire model.

Contribution

It introduces a lightweight transformer to learn localized alignments from frozen embeddings, improving out-of-domain compositional performance over traditional fine-tuning methods.

Findings

01

Explicit region-segment alignment improves compositional benchmarks.

02

Learning localized alignment from frozen embeddings matches full fine-tuning in in-domain retrieval.

03

Alignment mechanisms are crucial for robust compositional generalization.

Abstract

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.