Time-Space Transformers for Video Panoptic Segmentation

Andra Petrovai; Sergiu Nedevschi

arXiv:2210.03546·cs.CV·October 10, 2022

Time-Space Transformers for Video Panoptic Segmentation

Andra Petrovai, Sergiu Nedevschi

PDF

1 Video

TL;DR

This paper introduces VPS-Transformer, a hybrid convolutional and Transformer-based model for video panoptic segmentation that improves temporal consistency and quality with efficient attention mechanisms.

Contribution

The paper presents a novel hybrid architecture combining CNNs and Transformers for video panoptic segmentation, with new attention schemes for efficiency and improved temporal modeling.

Findings

01

Improves video panoptic quality by 2.2% on Cityscapes-VPS

02

Enhances temporal consistency with minimal additional computation

03

Demonstrates effectiveness of Transformer-based modules in video segmentation

Abstract

We propose a novel solution for the task of video panoptic segmentation, that simultaneously predicts pixel-level semantic and instance segmentation and generates clip-level instance tracks. Our network, named VPS-Transformer, with a hybrid architecture based on the state-of-the-art panoptic segmentation network Panoptic-DeepLab, combines a convolutional architecture for single-frame panoptic segmentation and a novel video module based on an instantiation of the pure Transformer block. The Transformer, equipped with attention mechanisms, models spatio-temporal relations between backbone output features of current and past frames for more accurate and consistent panoptic estimates. As the pure Transformer block introduces large computation overhead when processing high resolution images, we propose a few design changes for a more efficient compute. We study how to aggregate information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Time-Space Transformers for Video Panoptic Segmentation· youtube

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Label Smoothing · Adam · Dense Connections · Softmax · Byte Pair Encoding · Position-Wise Feed-Forward Layer