Contrastive Localized Language-Image Pre-Training

Hong-You Chen; Zhengfeng Lai; Haotian Zhang; Xinze Wang; Marcin; Eichner; Keen You; Meng Cao; Bowen Zhang; Yinfei Yang; Zhe Gan

arXiv:2410.02746·cs.CV·February 20, 2025·2 cites

Contrastive Localized Language-Image Pre-Training

Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin, Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, Zhe Gan

PDF

Open Access

TL;DR

This paper introduces CLOC, an enhancement to CLIP that improves its ability to generate fine-grained, region-specific image representations for better localization and grounding in multimodal models.

Contribution

We propose a novel pre-training method called CLOC that incorporates region-text contrastive loss and promptable embeddings to enhance CLIP's localization capabilities.

Findings

01

CLOC improves regional embedding quality for image recognition.

02

CLOC enhances performance on referring and grounding tasks.

03

Scaling to billions of images yields high-quality region representations.

Abstract

Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecond Language Learning and Teaching · EFL/ESL Teaching and Learning

MethodsContrastive Language-Image Pre-training