EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

Hongwei Niu; Jie Hu; Jianghang Lin; Guannan Jiang; Shengchuan Zhang

arXiv:2412.08628·cs.CV·December 17, 2024

EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

Hongwei Niu, Jie Hu, Jianghang Lin, Guannan Jiang, Shengchuan Zhang

PDF

Open Access 1 Repo

TL;DR

EOV-Seg introduces a fast, efficient single-stage open-vocabulary panoptic segmentation framework that leverages novel modules for improved semantic understanding and spatial awareness, achieving state-of-the-art speed and competitive accuracy.

Contribution

The paper presents EOV-Seg, the first efficient single-stage open-vocabulary panoptic segmentation framework with novel modules for semantic and spatial feature integration.

Findings

01

EOV-Seg achieves 24.5 PQ and 11.6 FPS on ADE20K.

02

It is 4-19 times faster than previous methods.

03

Runs at 23.8 FPS with ResNet50 backbone on a single GPU.

Abstract

Open-vocabulary panoptic segmentation aims to segment and classify everything in diverse scenes across an unbounded vocabulary. Existing methods typically employ two-stage or single-stage framework. The two-stage framework involves cropping the image multiple times using masks generated by a mask generator, followed by feature extraction, while the single-stage framework relies on a heavyweight mask decoder to make up for the lack of spatial position information through self-attention and cross-attention in multiple stacked Transformer blocks. Both methods incur substantial computational overhead, thereby hindering the efficiency of model inference. To fill the gap in efficiency, we propose EOV-Seg, a novel single-stage, shared, efficient, and spatialaware framework designed for open-vocabulary panoptic segmentation. Specifically, EOV-Seg innovates in two aspects. First, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nhw649/eov-seg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing