$\beta$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment
Fatimah Zohra, Chen Zhao, Hani Itani, Bernard Ghanem

TL;DR
$eta$-CLIP introduces a hierarchical, text-conditioned contrastive learning framework that enhances fine-grained vision-language alignment by leveraging multi-granular supervision and a novel contrastive loss, significantly improving dense image-text retrieval performance.
Contribution
The paper proposes $eta$-CLIP, a novel multi-granular, text-conditioned contrastive learning approach with a hierarchical alignment mechanism and a new contrastive loss, improving fine-grained vision-language tasks.
Findings
Achieves state-of-the-art dense alignment on Urban1K and FG-OVD datasets.
Demonstrates that hierarchical supervision benefits both soft and hard contrastive losses.
Significantly improves zero-shot and fine-tuned image-text retrieval performance.
Abstract
CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose -CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities-from full captions to sentences and phrases-and their corresponding visual regions. For each level of granularity, -CLIP utilizes cross-attention to dynamically pool image patches, producing contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the -Contextualized Contrastive Alignment Loss (-CAL). This objective parameterizes the trade-off between strict query-specific matching and relaxed intra-image contextualization, supporting both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
