Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of   Vision-Language Multiway Transformer Model

Yi-Chia Chen; Wei-Hua Li; Chu-Song Chen

arXiv:2412.18917·cs.CV·December 30, 2024

Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model

Yi-Chia Chen, Wei-Hua Li, Chu-Song Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces OMTSeg, a novel open-vocabulary panoptic segmentation model that leverages BEiT-3's vision-language pre-training and cross-modal attention to improve generalization to unlimited classes.

Contribution

It proposes using BEiT-3's cross-modal attention for open-vocabulary segmentation, advancing beyond prior methods like CLIP.

Findings

01

OMTSeg outperforms state-of-the-art models in open-vocabulary segmentation tasks.

02

Utilizes BEiT-3's vision-language pre-training for better generalization.

03

Demonstrates the effectiveness of cross-modal attention in segmentation performance.

Abstract

Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai-application-and-integration-lab/omtseg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training