TL;DR
PolyphonicFormer introduces a unified transformer-based approach for depth-aware video panoptic segmentation, effectively integrating depth prediction and segmentation to improve robustness and achieve state-of-the-art results.
Contribution
It proposes a novel paradigm of predicting instance-level depth maps with object queries, unifying depth estimation and panoptic segmentation under a transformer framework.
Findings
Achieves state-of-the-art results on Semantic KITTI and Cityscapes datasets.
Ranks 1st on the ICCV-2021 BMTT Challenge video + depth track.
Demonstrates benefits in both depth estimation and panoptic segmentation.
Abstract
The Depth-aware Video Panoptic Segmentation (DVPS) is a new challenging vision problem that aims to predict panoptic segmentation and depth in a video simultaneously. The previous work solves this task by extending the existing panoptic segmentation method with an extra dense depth prediction and instance tracking head. However, the relationship between the depth and panoptic segmentation is not well explored -- simply combining existing methods leads to competition and needs carefully weight balancing. In this paper, we present PolyphonicFormer, a vision transformer to unify these sub-tasks under the DVPS task and lead to more robust results. Our principal insight is that the depth can be harmonized with the panoptic segmentation with our proposed new paradigm of predicting instance level depth maps with object queries. Then the relationship between the two tasks via query-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Softmax · Residual Connection · Layer Normalization · Dense Connections · Multi-Head Attention · Vision Transformer
