CRIS: CLIP-Driven Referring Image Segmentation

Zhaoqing Wang; Yu Lu; Qiang Li; Xunqiang Tao; Yandong Guo; Mingming; Gong; Tongliang Liu

arXiv:2111.15174·cs.CV·March 16, 2022

CRIS: CLIP-Driven Referring Image Segmentation

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming, Gong, Tongliang Liu

PDF

Open Access 1 Repo

TL;DR

CRIS leverages CLIP's multi-modal knowledge through vision-language decoding and contrastive learning to improve referring image segmentation, achieving state-of-the-art results without post-processing.

Contribution

This paper introduces a novel end-to-end framework that effectively transfers multi-modal knowledge for segmentation using CLIP, vision-language decoding, and contrastive learning.

Findings

01

Significantly outperforms existing methods on benchmark datasets.

02

No post-processing needed for high performance.

03

Demonstrates effective multi-modal knowledge transfer.

Abstract

Referring image segmentation aims to segment a referent via a natural linguistic expression.Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically, we design a vision-language decoder to propagate fine-grained semantic information from textual representations to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DerrickWang005/CRIS.pytorch
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning