Foundation Model Assisted Weakly Supervised Semantic Segmentation
Xiaobo Yang, Xiaojin Gong

TL;DR
This paper introduces a novel framework leveraging foundation models CLIP and SAM to generate high-quality seeds for weakly supervised semantic segmentation, achieving state-of-the-art results on PASCAL VOC 2012.
Contribution
The work proposes a coarse-to-fine seed generation method using frozen foundation models with learnable prompts, improving weakly supervised segmentation performance.
Findings
Achieves state-of-the-art on PASCAL VOC 2012
Competitive results on MS COCO 2014
Effective seed generation with CLIP and SAM modules
Abstract
This work aims to leverage pre-trained foundation models, such as contrastive language-image pre-training (CLIP) and segment anything model (SAM), to address weakly supervised semantic segmentation (WSSS) using image-level labels. To this end, we propose a coarse-to-fine framework based on CLIP and SAM for generating high-quality segmentation seeds. Specifically, we construct an image classification task and a seed segmentation task, which are jointly performed by CLIP with frozen weights and two sets of learnable task-specific prompts. A SAM-based seeding (SAMS) module is designed and applied to each task to produce either coarse or fine seed maps. Moreover, we design a multi-label contrastive loss supervised by image-level labels and a CAM activation loss supervised by the generated coarse seed map. These losses are used to learn the prompts, which are the only parts need to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Foundation Model Assisted Weakly Supervised Semantic Segmentation· youtube
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsSegment Anything Model · Class-activation map · Contrastive Language-Image Pre-training
