PosSAM: Panoptic Open-vocabulary Segment Anything
Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal, Patel, Munawar Hayat, Fatih Porikli

TL;DR
PosSAM introduces an end-to-end open-vocabulary panoptic segmentation model that combines SAM's spatial features with CLIP's semantic understanding, achieving state-of-the-art results across multiple datasets.
Contribution
The paper proposes PosSAM, a novel unified framework that integrates SAM and CLIP for open-vocabulary panoptic segmentation with new modules for improved classification and mask quality.
Findings
Achieves state-of-the-art performance on COCO and ADE20K datasets.
Outperforms previous methods by 2.4 PQ and 4.6 PQ respectively.
Demonstrates strong generalization across multiple datasets.
Abstract
In this paper, we introduce an open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model (SAM) with the vision-language CLIP model in an end-to-end framework. While SAM excels in generating spatially-aware masks, it's decoder falls short in recognizing object class information and tends to oversegment without additional guidance. Existing approaches address this limitation by using multi-stage techniques and employing separate models to generate class-aware prompts, such as bounding boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model which leverages SAM's spatially rich features to produce instance-aware masks and harnesses CLIP's semantically discriminative features for effective instance classification. Specifically, we address the limitations of SAM and propose a novel Local Discriminative Pooling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsContrastive Language-Image Pre-training · Segment Anything Model
