PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic   Segmentation

Haobo Yuan; Xiangtai Li; Yibo Yang; Guangliang Cheng; Jing Zhang,; Yunhai Tong; Lefei Zhang; Dacheng Tao

arXiv:2112.02582·cs.CV·December 29, 2022

PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation

Haobo Yuan, Xiangtai Li, Yibo Yang, Guangliang Cheng, Jing Zhang,, Yunhai Tong, Lefei Zhang, Dacheng Tao

PDF

1 Repo

TL;DR

PolyphonicFormer introduces a unified transformer-based approach for depth-aware video panoptic segmentation, effectively integrating depth prediction and segmentation to improve robustness and achieve state-of-the-art results.

Contribution

It proposes a novel paradigm of predicting instance-level depth maps with object queries, unifying depth estimation and panoptic segmentation under a transformer framework.

Findings

01

Achieves state-of-the-art results on Semantic KITTI and Cityscapes datasets.

02

Ranks 1st on the ICCV-2021 BMTT Challenge video + depth track.

03

Demonstrates benefits in both depth estimation and panoptic segmentation.

Abstract

The Depth-aware Video Panoptic Segmentation (DVPS) is a new challenging vision problem that aims to predict panoptic segmentation and depth in a video simultaneously. The previous work solves this task by extending the existing panoptic segmentation method with an extra dense depth prediction and instance tracking head. However, the relationship between the depth and panoptic segmentation is not well explored -- simply combining existing methods leads to competition and needs carefully weight balancing. In this paper, we present PolyphonicFormer, a vision transformer to unify these sub-tasks under the DVPS task and lead to more robust results. Our principal insight is that the depth can be harmonized with the panoptic segmentation with our proposed new paradigm of predicting instance level depth maps with object queries. Then the relationship between the two tasks via query-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

harboryuan/polyphonicformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Softmax · Residual Connection · Layer Normalization · Dense Connections · Multi-Head Attention · Vision Transformer