TL;DR
This paper introduces a concept-centric learning approach for contrastive vision-language models that enhances compositionality without sacrificing zero-shot and retrieval performance, using simple modifications and auxiliary losses.
Contribution
It proposes a novel, straightforward method involving concept-centric captions and attention pooling to improve compositionality in V&L models without degrading their zero-shot capabilities.
Findings
Achieved state-of-the-art on compositionality benchmarks.
Maintained or improved zero-shot and retrieval performance.
Did not increase inference cost.
Abstract
Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
