Enhancing Weakly Supervised Semantic Segmentation with Multi-modal   Foundation Models: An End-to-End Approach

Elham Ravanbakhsh; Cheng Niu; Yongqing Liang; J. Ramanujam; Xin Li

arXiv:2405.06586·cs.CV·May 13, 2024

Enhancing Weakly Supervised Semantic Segmentation with Multi-modal Foundation Models: An End-to-End Approach

Elham Ravanbakhsh, Cheng Niu, Yongqing Liang, J. Ramanujam, Xin Li

PDF

Open Access

TL;DR

This paper introduces an end-to-end weakly-supervised semantic segmentation framework that leverages multi-modal foundation models like SAM, Grounding-DINO, and CLIP to improve boundary delineation and achieve state-of-the-art results.

Contribution

It proposes a novel two-stage framework utilizing foundation models for pseudo-label generation and segmentation, significantly enhancing WSSS performance without requiring image label supervision.

Findings

01

Achieves state-of-the-art results on PASCAL VOC 2012 and MS COCO 2014 datasets.

02

Effectively improves boundary accuracy in weakly-supervised segmentation.

03

Reduces reliance on extensive image annotations by leveraging foundation models.

Abstract

Semantic segmentation is a core computer vision problem, but the high costs of data annotation have hindered its wide application. Weakly-Supervised Semantic Segmentation (WSSS) offers a cost-efficient workaround to extensive labeling in comparison to fully-supervised methods by using partial or incomplete labels. Existing WSSS methods have difficulties in learning the boundaries of objects leading to poor segmentation results. We propose a novel and effective framework that addresses these issues by leveraging visual foundation models inside the bounding box. Adopting a two-stage WSSS framework, our proposed network consists of a pseudo-label generation module and a segmentation module. The first stage leverages Segment Anything Model (SAM) to generate high-quality pseudo-labels. To alleviate the problem of delineating precise boundaries, we adopt SAM inside the bounding box with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsSegment Anything Model · Contrastive Language-Image Pre-training