STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation
Zhengkai Jiang, Zhangxuan Gu, Jinlong Peng, Hang Zhou, Liang Liu,, Yabiao Wang, Ying Tai, Chengjie Wang, Liqing Zhang

TL;DR
This paper introduces a simple, efficient single-stage video instance segmentation framework that leverages spatio-temporal contrastive learning and temporal consistency to improve tracking accuracy and coherence.
Contribution
It proposes a novel bi-directional spatio-temporal contrastive learning strategy and an instance-wise temporal consistency scheme within a single-stage VIS framework.
Findings
Achieves state-of-the-art performance on YouTube-VIS and OVIS datasets.
Demonstrates improved instance association accuracy and temporal coherence.
Outperforms complex multi-stage methods in efficiency and effectiveness.
Abstract
Video Instance Segmentation (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video. Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions. In contrast, we present a simple and efficient single-stage VIS framework based on the instance segmentation method CondInst by adding an extra tracking head. To improve instance association accuracy, a novel bi-directional spatio-temporal contrastive learning strategy for tracking embedding across frames is proposed. Moreover, an instance-wise temporal consistency scheme is utilized to produce temporally coherent results. Experiments conducted on the YouTube-VIS-2019, YouTube-VIS-2021, and OVIS-2021 datasets validate the effectiveness and efficiency of the proposed method. We hope the proposed framework can serve as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Video Surveillance and Tracking Methods · Advanced Vision and Imaging
MethodsContrastive Learning · Conditional Convolutions for Instance Segmentation
