No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Hai X. Pham; David T. Hoffmann; Ricardo Guerrero; Brais Martinez

arXiv:2603.25722·cs.CV·May 20, 2026

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez

PDF

1 Repo

TL;DR

This paper introduces a concept-centric learning approach for contrastive vision-language models that enhances compositionality without sacrificing zero-shot and retrieval performance, using simple modifications and auxiliary losses.

Contribution

It proposes a novel, straightforward method involving concept-centric captions and attention pooling to improve compositionality in V&L models without degrading their zero-shot capabilities.

Findings

01

Achieved state-of-the-art on compositionality benchmarks.

02

Maintained or improved zero-shot and retrieval performance.

03

Did not increase inference cost.

Abstract

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

saic-fi/concept_centric_clip
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques