Contrastive Localized Language-Image Pre-Training
Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin, Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, Zhe Gan

TL;DR
This paper introduces CLOC, an enhancement to CLIP that improves its ability to generate fine-grained, region-specific image representations for better localization and grounding in multimodal models.
Contribution
We propose a novel pre-training method called CLOC that incorporates region-text contrastive loss and promptable embeddings to enhance CLIP's localization capabilities.
Findings
CLOC improves regional embedding quality for image recognition.
CLOC enhances performance on referring and grounding tasks.
Scaling to billions of images yields high-quality region representations.
Abstract
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecond Language Learning and Teaching · EFL/ESL Teaching and Learning
MethodsContrastive Language-Image Pre-training
