$\beta$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

Fatimah Zohra; Chen Zhao; Hani Itani; Bernard Ghanem

arXiv:2512.12678·cs.CV·March 3, 2026

$\beta$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

Fatimah Zohra, Chen Zhao, Hani Itani, Bernard Ghanem

PDF

Open Access

TL;DR

$eta$-CLIP introduces a hierarchical, text-conditioned contrastive learning framework that enhances fine-grained vision-language alignment by leveraging multi-granular supervision and a novel contrastive loss, significantly improving dense image-text retrieval performance.

Contribution

The paper proposes $eta$-CLIP, a novel multi-granular, text-conditioned contrastive learning approach with a hierarchical alignment mechanism and a new contrastive loss, improving fine-grained vision-language tasks.

Findings

01

Achieves state-of-the-art dense alignment on Urban1K and FG-OVD datasets.

02

Demonstrates that hierarchical supervision benefits both soft and hard contrastive losses.

03

Significantly improves zero-shot and fine-tuned image-text retrieval performance.

Abstract

CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose $β$ -CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities-from full captions to sentences and phrases-and their corresponding visual regions. For each level of granularity, $β$ -CLIP utilizes cross-attention to dynamically pool image patches, producing contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the $β$ -Contextualized Contrastive Alignment Loss ( $β$ -CAL). This objective parameterizes the trade-off between strict query-specific matching and relaxed intra-image contextualization, supporting both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning