Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen   Convolutional CLIP

Qihang Yu; Ju He; Xueqing Deng; Xiaohui Shen; Liang-Chieh Chen

arXiv:2308.02487·cs.CV·November 16, 2023·30 cites

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces FC-CLIP, a single-stage open-vocabulary segmentation model using a frozen convolutional CLIP backbone, achieving superior accuracy and efficiency over traditional two-stage methods across multiple datasets.

Contribution

The paper presents a novel single-stage framework with a frozen convolutional CLIP backbone for open-vocabulary segmentation, simplifying the pipeline and improving performance and efficiency.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Significantly faster training and inference times.

03

Uses fewer parameters than prior methods.

Abstract

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/fc-clip
pytorchOfficial

Videos

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training