Open-vocabulary Panoptic Segmentation with Embedding Modulation
Xi Chen, Shuang Li, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao

TL;DR
This paper introduces OPSNet, a data-efficient framework for open-vocabulary panoptic segmentation that leverages embedding modulation and CLIP to achieve state-of-the-art results across multiple datasets.
Contribution
The paper proposes OPSNet with an Embedding Modulation module, enabling effective open- and closed-vocabulary segmentation with less data and improved performance.
Findings
Achieves state-of-the-art results on multiple datasets
Effective embedding enhancement via Embedding Modulation
Superior performance with fewer additional data requirements
Abstract
Open-vocabulary image segmentation is attracting increasing attention due to its critical applications in the real world. Traditional closed-vocabulary segmentation methods are not able to characterize novel objects, whereas several recent open-vocabulary attempts obtain unsatisfactory results, i.e., notable performance reduction on the closed vocabulary and massive demand for extra data. To this end, we propose OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panoptic Segmentation. Specifically, the exquisitely designed Embedding Modulation module, together with several meticulous components, enables adequate embedding enhancement and information exchange between the segmentation model and the visual-linguistic well-aligned CLIP encoder, resulting in superior segmentation performance under both open- and closed-vocabulary settings with much fewer need of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
