Improving Video Instance Segmentation via Temporal Pyramid Routing
Xiangtai Li, Hao He, Yibo Yang, Henghui Ding, Kuiyuan Yang, Guangliang, Cheng, Yunhai Tong, Dacheng Tao

TL;DR
This paper introduces a novel Temporal Pyramid Routing strategy that effectively combines temporal and multi-scale features for improved video instance segmentation, enhancing existing methods with a lightweight, plug-and-play module.
Contribution
The paper proposes a new TPR strategy with DACR and CPR components to better align and aggregate features across time and scale in video segmentation.
Findings
Improves performance on YouTube-VIS and Cityscapes-VPS datasets.
Achieves state-of-the-art results with efficient computation.
Easily integrable into existing segmentation frameworks.
Abstract
Video Instance Segmentation (VIS) is a new and inherently multi-task problem, which aims to detect, segment, and track each instance in a video sequence. Existing approaches are mainly based on single-frame features or single-scale features of multiple frames, where either temporal information or multi-scale information is ignored. To incorporate both temporal and scale information, we propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames. Specifically, TPR contains two novel components, including Dynamic Aligned Cell Routing (DACR) and Cross Pyramid Routing (CPR), where DACR is designed for aligning and gating pyramid features across temporal dimension, while CPR transfers temporally aggregated features across scale dimension. Moreover, our approach is a light-weight and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
