Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency
Qing Liu, Vignesh Ramanathan, Dhruv Mahajan, Alan Yuille, Zhenheng, Yang

TL;DR
This paper introduces a novel approach for weakly supervised video instance segmentation by leveraging temporal consistency and motion cues, significantly improving segmentation accuracy over image-based methods.
Contribution
It is the first to utilize video signals for weakly supervised instance segmentation, proposing two methods to incorporate motion and temporal consistency into training.
Findings
Improved $AP_{50}$ by 5% on Youtube-VIS dataset.
Enhanced $AP_{50}$ by 3% on Cityscapes dataset.
Demonstrated effectiveness of temporal cues in weakly supervised segmentation.
Abstract
Weakly supervised instance segmentation reduces the cost of annotations required to train models. However, existing approaches which rely only on image-level class labels predominantly suffer from errors due to (a) partial segmentation of objects and (b) missing object predictions. We show that these issues can be better addressed by training with weakly labeled videos instead of images. In videos, motion and temporal consistency of predictions across frames provide complementary signals which can help segmentation. We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation. We propose two ways to leverage this information in our model. First, we adapt inter-pixel relation network (IRN) to effectively incorporate motion information during training. Second, we introduce a new MaskConsist module, which addresses the problem of missing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
