FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary   Segmentation

Xi Chen; Haosen Yang; Sheng Jin; Xiatian Zhu; Hongxun Yao

arXiv:2409.03525·cs.CV·September 6, 2024

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

Xi Chen, Haosen Yang, Sheng Jin, Xiatian Zhu, Hongxun Yao

PDF

Open Access

TL;DR

FrozenSeg effectively combines spatial and semantic knowledge from pre-trained foundation models like SAM and CLIP to improve open-vocabulary segmentation, achieving state-of-the-art zero-shot performance with minimal training overhead.

Contribution

The paper introduces FrozenSeg, a novel framework that integrates frozen spatial and semantic foundation models for improved open-vocabulary segmentation.

Findings

01

Achieves state-of-the-art zero-shot segmentation results.

02

Utilizes a lightweight transformer decoder trained on COCO data.

03

Demonstrates significant performance gains across multiple benchmarks.

Abstract

Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training