Balancing Shared and Task-Specific Representations: A Hybrid Approach to   Depth-Aware Video Panoptic Segmentation

Kurt H.W. Stolle (Eindhoven University of Technology)

arXiv:2412.07966·cs.CV·December 12, 2024

Balancing Shared and Task-Specific Representations: A Hybrid Approach to Depth-Aware Video Panoptic Segmentation

Kurt H.W. Stolle (Eindhoven University of Technology)

PDF

Open Access

TL;DR

This paper introduces Multiformer, a hybrid transformer-based model for depth-aware video panoptic segmentation that effectively balances shared and task-specific representations, achieving state-of-the-art results.

Contribution

It proposes a hybrid multi-task decoder with task-specific branches and shared representations, advancing depth-aware video segmentation techniques.

Findings

01

Outperforms previous methods by 3.0 DVPQ points with ResNet-50 backbone.

02

Further improves by 4.0 DVPQ points with Swin-B backbone.

03

Achieves state-of-the-art performance on Cityscapes-DVPS and SemKITTI-DVPS datasets.

Abstract

In this work, we present Multiformer, a novel approach to depth-aware video panoptic segmentation (DVPS) based on the mask transformer paradigm. Our method learns object representations that are shared across segmentation, monocular depth estimation, and object tracking subtasks. In contrast to recent unified approaches that progressively refine a common object representation, we propose a hybrid method using task-specific branches within each decoder block, ultimately fusing them into a shared representation at the block interfaces. Extensive experiments on the Cityscapes-DVPS and SemKITTI-DVPS datasets demonstrate that Multiformer achieves state-of-the-art performance across all DVPS metrics, outperforming previous methods by substantial margins. With a ResNet-50 backbone, Multiformer surpasses the previous best result by 3.0 DVPQ points while also improving depth estimation accuracy.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Visual Attention and Saliency Detection · Cinema and Media Studies