Enhancing Weakly Supervised Semantic Segmentation with Multi-modal Foundation Models: An End-to-End Approach
Elham Ravanbakhsh, Cheng Niu, Yongqing Liang, J. Ramanujam, Xin Li

TL;DR
This paper introduces an end-to-end weakly-supervised semantic segmentation framework that leverages multi-modal foundation models like SAM, Grounding-DINO, and CLIP to improve boundary delineation and achieve state-of-the-art results.
Contribution
It proposes a novel two-stage framework utilizing foundation models for pseudo-label generation and segmentation, significantly enhancing WSSS performance without requiring image label supervision.
Findings
Achieves state-of-the-art results on PASCAL VOC 2012 and MS COCO 2014 datasets.
Effectively improves boundary accuracy in weakly-supervised segmentation.
Reduces reliance on extensive image annotations by leveraging foundation models.
Abstract
Semantic segmentation is a core computer vision problem, but the high costs of data annotation have hindered its wide application. Weakly-Supervised Semantic Segmentation (WSSS) offers a cost-efficient workaround to extensive labeling in comparison to fully-supervised methods by using partial or incomplete labels. Existing WSSS methods have difficulties in learning the boundaries of objects leading to poor segmentation results. We propose a novel and effective framework that addresses these issues by leveraging visual foundation models inside the bounding box. Adopting a two-stage WSSS framework, our proposed network consists of a pseudo-label generation module and a segmentation module. The first stage leverages Segment Anything Model (SAM) to generate high-quality pseudo-labels. To alleviate the problem of delineating precise boundaries, we adopt SAM inside the bounding box with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsSegment Anything Model · Contrastive Language-Image Pre-training
