Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration
Liqi Yan, Qifan Wang, Siqi Ma, Jingang Wang, Changbin Yu

TL;DR
This paper introduces STC-Seg, a weakly supervised framework for video instance segmentation that leverages spatio-temporal collaboration, pseudo-labels, and a novel puzzle loss to achieve competitive results with less annotation effort.
Contribution
The paper proposes a novel weakly supervised video instance segmentation framework combining depth, optical flow, and a puzzle loss for end-to-end training, outperforming some fully supervised methods.
Findings
Outperforms fully supervised TrackR-CNN and MaskTrack R-CNN on KITTI MOTS and YT-VIS datasets.
Effectively utilizes pseudo-labels from depth and optical flow for training.
Enhances robustness with a spatio-temporal tracking module.
Abstract
Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel weakly supervised framework with \textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance \textbf{Seg}mentation in videos, namely \textbf{STC-Seg}. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
