TubeFormer-DeepLab: Video Mask Transformer
Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim,, Hartwig Adam, In So Kweon, Liang-Chieh Chen

TL;DR
TubeFormer-DeepLab introduces a unified transformer-based approach for multiple video segmentation tasks, simplifying models and achieving state-of-the-art results across benchmarks by predicting video tubes with task-specific labels.
Contribution
It presents the first unified transformer model for diverse video segmentation tasks, linking segmentation masks into tubes and directly predicting task-specific labels.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Simplifies video segmentation models by unifying tasks.
Effectively predicts video tubes with task-specific labels.
Abstract
We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner. Different video segmentation tasks (e.g., video semantic/instance/panoptic segmentation) are usually considered as distinct problems. State-of-the-art models adopted in the separate communities have diverged, and radically different approaches dominate in each task. By contrast, we make a crucial observation that video segmentation tasks could be generally formulated as the problem of assigning different predicted labels to video tubes (where a tube is obtained by linking segmentation masks along the time axis) and the labels may encode different values depending on the target task. The observation motivates us to develop TubeFormer-DeepLab, a simple and effective video mask transformer model that is widely applicable to multiple video segmentation tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
