PosSAM: Panoptic Open-vocabulary Segment Anything

Vibashan VS; Shubhankar Borse; Hyojin Park; Debasmit Das; Vishal; Patel; Munawar Hayat; Fatih Porikli

arXiv:2403.09620·cs.CV·March 15, 2024·1 cites

PosSAM: Panoptic Open-vocabulary Segment Anything

Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal, Patel, Munawar Hayat, Fatih Porikli

PDF

Open Access 1 Repo

TL;DR

PosSAM introduces an end-to-end open-vocabulary panoptic segmentation model that combines SAM's spatial features with CLIP's semantic understanding, achieving state-of-the-art results across multiple datasets.

Contribution

The paper proposes PosSAM, a novel unified framework that integrates SAM and CLIP for open-vocabulary panoptic segmentation with new modules for improved classification and mask quality.

Findings

01

Achieves state-of-the-art performance on COCO and ADE20K datasets.

02

Outperforms previous methods by 2.4 PQ and 4.6 PQ respectively.

03

Demonstrates strong generalization across multiple datasets.

Abstract

In this paper, we introduce an open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model (SAM) with the vision-language CLIP model in an end-to-end framework. While SAM excels in generating spatially-aware masks, it's decoder falls short in recognizing object class information and tends to oversegment without additional guidance. Existing approaches address this limitation by using multi-stage techniques and employing separate models to generate class-aware prompts, such as bounding boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model which leverages SAM's spatially rich features to produce instance-aware masks and harnesses CLIP's semantically discriminative features for effective instance classification. Specifically, we address the limitations of SAM and propose a novel Local Discriminative Pooling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Vibashan/PosSAM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsContrastive Language-Image Pre-training · Segment Anything Model