DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Yongming Rao; Wenliang Zhao; Guangyi Chen; Yansong Tang; Zheng Zhu,; Guan Huang; Jie Zhou; Jiwen Lu

arXiv:2112.01518·cs.CV·March 22, 2022·37 cites

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu,, Guan Huang, Jie Zhou, Jiwen Lu

PDF

Open Access 1 Repo

TL;DR

DenseCLIP introduces a novel framework that leverages pre-trained CLIP knowledge for dense prediction tasks by converting image-text matching into pixel-text matching and using contextual prompts, achieving superior results across multiple vision tasks.

Contribution

The paper presents a model-agnostic approach that adapts CLIP for dense prediction by pixel-text matching and contextual prompting, extending CLIP's capabilities beyond classification.

Findings

01

Superior performance on semantic segmentation

02

Effective on object detection and instance segmentation

03

Compatible with various pre-trained backbones

Abstract

Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision. Benefiting from a broader source of supervision, this new paradigm exhibits impressive transferability to downstream classification tasks and datasets. However, the problem of transferring the knowledge learned from image-text pairs to more complex dense prediction tasks has barely been visited. In this work, we present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raoyongming/denseclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training