FlowCut: Unsupervised Video Instance Segmentation via Temporal Mask Matching
Alp Eren Sari, Paolo Favaro

TL;DR
FlowCut introduces an unsupervised video instance segmentation method that constructs a pseudo-labeled dataset through a three-stage process, enabling training without manual annotations and achieving state-of-the-art results.
Contribution
It is the first to create a pseudo-labeled video dataset for unsupervised video instance segmentation and demonstrates its effectiveness with superior benchmark performance.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Successfully constructs high-quality pseudo-labels for training.
Demonstrates the effectiveness of temporal mask matching.
Abstract
We propose FlowCut, a simple and capable method for unsupervised video instance segmentation consisting of a three-stage framework to construct a high-quality video dataset with pseudo labels. To our knowledge, our work is the first attempt to curate a video dataset with pseudo-labels for unsupervised video instance segmentation. In the first stage, we generate pseudo-instance masks by exploiting the affinities of features from both images and optical flows. In the second stage, we construct short video segments containing high-quality, consistent pseudo-instance masks by temporally matching them across the frames. In the third stage, we use the YouTubeVIS-2021 video dataset to extract our training instance segmentation set, and then train a video segmentation model. FlowCut achieves state-of-the-art performance on the YouTubeVIS-2019, YouTubeVIS-2021, DAVIS-2017, and DAVIS-2017 Motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis
