STC: Spatio-Temporal Contrastive Learning for Video Instance   Segmentation

Zhengkai Jiang; Zhangxuan Gu; Jinlong Peng; Hang Zhou; Liang Liu,; Yabiao Wang; Ying Tai; Chengjie Wang; Liqing Zhang

arXiv:2202.03747·cs.CV·August 23, 2022·1 cites

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation

Zhengkai Jiang, Zhangxuan Gu, Jinlong Peng, Hang Zhou, Liang Liu,, Yabiao Wang, Ying Tai, Chengjie Wang, Liqing Zhang

PDF

Open Access

TL;DR

This paper introduces a simple, efficient single-stage video instance segmentation framework that leverages spatio-temporal contrastive learning and temporal consistency to improve tracking accuracy and coherence.

Contribution

It proposes a novel bi-directional spatio-temporal contrastive learning strategy and an instance-wise temporal consistency scheme within a single-stage VIS framework.

Findings

01

Achieves state-of-the-art performance on YouTube-VIS and OVIS datasets.

02

Demonstrates improved instance association accuracy and temporal coherence.

03

Outperforms complex multi-stage methods in efficiency and effectiveness.

Abstract

Video Instance Segmentation (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video. Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions. In contrast, we present a simple and efficient single-stage VIS framework based on the instance segmentation method CondInst by adding an extra tracking head. To improve instance association accuracy, a novel bi-directional spatio-temporal contrastive learning strategy for tracking embedding across frames is proposed. Moreover, an instance-wise temporal consistency scheme is utilized to produce temporally coherent results. Experiments conducted on the YouTube-VIS-2019, YouTube-VIS-2021, and OVIS-2021 datasets validate the effectiveness and efficiency of the proposed method. We hope the proposed framework can serve as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Video Surveillance and Tracking Methods · Advanced Vision and Imaging

MethodsContrastive Learning · Conditional Convolutions for Instance Segmentation