EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation
Hongwei Niu, Jie Hu, Jianghang Lin, Guannan Jiang, Shengchuan Zhang

TL;DR
EOV-Seg introduces a fast, efficient single-stage open-vocabulary panoptic segmentation framework that leverages novel modules for improved semantic understanding and spatial awareness, achieving state-of-the-art speed and competitive accuracy.
Contribution
The paper presents EOV-Seg, the first efficient single-stage open-vocabulary panoptic segmentation framework with novel modules for semantic and spatial feature integration.
Findings
EOV-Seg achieves 24.5 PQ and 11.6 FPS on ADE20K.
It is 4-19 times faster than previous methods.
Runs at 23.8 FPS with ResNet50 backbone on a single GPU.
Abstract
Open-vocabulary panoptic segmentation aims to segment and classify everything in diverse scenes across an unbounded vocabulary. Existing methods typically employ two-stage or single-stage framework. The two-stage framework involves cropping the image multiple times using masks generated by a mask generator, followed by feature extraction, while the single-stage framework relies on a heavyweight mask decoder to make up for the lack of spatial position information through self-attention and cross-attention in multiple stacked Transformer blocks. Both methods incur substantial computational overhead, thereby hindering the efficiency of model inference. To fill the gap in efficiency, we propose EOV-Seg, a novel single-stage, shared, efficient, and spatialaware framework designed for open-vocabulary panoptic segmentation. Specifically, EOV-Seg innovates in two aspects. First, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing
