SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary   Semantic Segmentation

Huaishao Luo; Junwei Bao; Youzheng Wu; Xiaodong He; Tianrui Li

arXiv:2211.14813·cs.CV·June 21, 2023·27 cites

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, Tianrui Li

PDF

Open Access 1 Repo

TL;DR

SegCLIP introduces a novel CLIP-based approach for open-vocabulary semantic segmentation that dynamically groups image patches into semantic regions using learnable centers, achieving competitive results without annotations.

Contribution

The paper proposes a new segmentation method leveraging ViT and learnable patch centers, with novel loss functions, for improved open-vocabulary segmentation without annotations.

Findings

01

Achieves +0.3% mIoU on PASCAL VOC 2012

02

Achieves +2.3% mIoU on PASCAL Context

03

Achieves +2.2% mIoU on COCO

Abstract

Recently, the contrastive language-image pre-training, e.g., CLIP, has demonstrated promising results on various downstream tasks. The pre-trained model can capture enriched visual concepts for images by learning from a large scale of text-image data. However, transferring the learned visual knowledge to open-vocabulary semantic segmentation is still under-explored. In this paper, we propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation in an annotation-free manner. The SegCLIP achieves segmentation based on ViT and the main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. The gathering operation can dynamically capture the semantic groups, which can be used to generate the final segmentation results. We further propose a reconstruction loss on masked patches and a superpixel-based KL loss with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arrowluo/segclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training