Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Amit Peleg, Naman Deep Singh, Matthias Hein

TL;DR
This paper introduces CLIC, a fine-tuning method that enhances compositional reasoning in CLIP models, improving both lexical and semantic understanding and boosting retrieval performance, including state-of-the-art results.
Contribution
The paper presents a novel fine-tuning technique, CLIC, that significantly improves compositional awareness and retrieval performance in CLIP models across various architectures.
Findings
CLIC improves compositionality in CLIP models.
CLIC enhances retrieval performance, achieving SOTA results.
Short fine-tuning with CLIC yields the best compositional CLIP on SugarCrepe++.
Abstract
Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSemantic Web and Ontologies
MethodsContrastive Language-Image Pre-training
