Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Amit Peleg; Naman Deep Singh; Matthias Hein

arXiv:2505.24424·cs.LG·October 29, 2025

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Amit Peleg, Naman Deep Singh, Matthias Hein

PDF

Open Access 1 Video

TL;DR

This paper introduces CLIC, a fine-tuning method that enhances compositional reasoning in CLIP models, improving both lexical and semantic understanding and boosting retrieval performance, including state-of-the-art results.

Contribution

The paper presents a novel fine-tuning technique, CLIC, that significantly improves compositional awareness and retrieval performance in CLIP models across various architectures.

Findings

01

CLIC improves compositionality in CLIP models.

02

CLIC enhances retrieval performance, achieving SOTA results.

03

Short fine-tuning with CLIC yields the best compositional CLIP on SugarCrepe++.

Abstract

Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning· slideslive

Taxonomy

TopicsSemantic Web and Ontologies

MethodsContrastive Language-Image Pre-training