Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model
Yi-Chia Chen, Wei-Hua Li, Chu-Song Chen

TL;DR
This paper introduces OMTSeg, a novel open-vocabulary panoptic segmentation model that leverages BEiT-3's vision-language pre-training and cross-modal attention to improve generalization to unlimited classes.
Contribution
It proposes using BEiT-3's cross-modal attention for open-vocabulary segmentation, advancing beyond prior methods like CLIP.
Findings
OMTSeg outperforms state-of-the-art models in open-vocabulary segmentation tasks.
Utilizes BEiT-3's vision-language pre-training for better generalization.
Demonstrates the effectiveness of cross-modal attention in segmentation performance.
Abstract
Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
