HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models
Haoxi Zeng, Haoxuan Li, Yi Bin, Pengpeng Zeng, Xing Xu, Yang Yang, Heng Tao Shen

TL;DR
HarmoCLIP introduces a framework that enhances CLIP by explicitly aligning local textual and visual semantics, improving fine-grained understanding without sacrificing global coherence, leading to state-of-the-art results.
Contribution
It proposes a novel semantic supervision strategy that harmonizes global and region representations in CLIP, addressing the local-global trade-off issue.
Findings
Achieves up to 69.78% improvement in retrieval performance.
Improves Top-1 accuracy by 3.2% on bounding-box classification.
Outperforms prior methods on global and local vision-language tasks.
Abstract
Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable generalization ability and strong performance across a wide range of vision-language tasks. However, due to the lack of region-level supervision, CLIP exhibits limited fine-grained semantic understanding. Although several methods attempt to mitigate this issue, they unintentionally disrupt the global alignment, resulting in a persistent trade-off where improving local perception simultaneously degrades global coherence. In this paper, we propose HarmoCLIP, a novel framework designed to harmonize global and region representations within CLIP. We first identify that the absence of direct alignment between local textual and visual semantics is the fundamental cause of the trade-off. To address this, HarmoCLIP introduces an explicit fine-grained semantic supervision term that directly aligns textual segments with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
