FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation
Xi Chen, Haosen Yang, Sheng Jin, Xiatian Zhu, Hongxun Yao

TL;DR
FrozenSeg effectively combines spatial and semantic knowledge from pre-trained foundation models like SAM and CLIP to improve open-vocabulary segmentation, achieving state-of-the-art zero-shot performance with minimal training overhead.
Contribution
The paper introduces FrozenSeg, a novel framework that integrates frozen spatial and semantic foundation models for improved open-vocabulary segmentation.
Findings
Achieves state-of-the-art zero-shot segmentation results.
Utilizes a lightweight transformer decoder trained on COCO data.
Demonstrates significant performance gains across multiple benchmarks.
Abstract
Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
