TL;DR
This paper introduces VPS-Transformer, a hybrid convolutional and Transformer-based model for video panoptic segmentation that improves temporal consistency and quality with efficient attention mechanisms.
Contribution
The paper presents a novel hybrid architecture combining CNNs and Transformers for video panoptic segmentation, with new attention schemes for efficiency and improved temporal modeling.
Findings
Improves video panoptic quality by 2.2% on Cityscapes-VPS
Enhances temporal consistency with minimal additional computation
Demonstrates effectiveness of Transformer-based modules in video segmentation
Abstract
We propose a novel solution for the task of video panoptic segmentation, that simultaneously predicts pixel-level semantic and instance segmentation and generates clip-level instance tracks. Our network, named VPS-Transformer, with a hybrid architecture based on the state-of-the-art panoptic segmentation network Panoptic-DeepLab, combines a convolutional architecture for single-frame panoptic segmentation and a novel video module based on an instantiation of the pure Transformer block. The Transformer, equipped with attention mechanisms, models spatio-temporal relations between backbone output features of current and past frames for more accurate and consistent panoptic estimates. As the pure Transformer block introduces large computation overhead when processing high resolution images, we propose a few design changes for a more efficient compute. We study how to aggregate information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Time-Space Transformers for Video Panoptic Segmentation· youtube
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Label Smoothing · Adam · Dense Connections · Softmax · Byte Pair Encoding · Position-Wise Feed-Forward Layer
