MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training
De-An Huang, Zhiding Yu, Anima Anandkumar

TL;DR
MinVIS is a minimal, training-efficient video instance segmentation framework that achieves state-of-the-art results without video-based training, using only image-based models and simple tracking methods.
Contribution
MinVIS introduces a novel approach that trains on independent images and applies simple bipartite matching for tracking, eliminating the need for video-based training or complex architectures.
Findings
Outperforms previous methods on Occluded VIS dataset by over 10% AP.
Achieves comparable results with only 1% labeled frames on YouTube-VIS datasets.
Reduces labeling costs and memory requirements without sacrificing performance.
Abstract
We propose MinVIS, a minimal video instance segmentation (VIS) framework that achieves state-of-the-art VIS performance with neither video-based architectures nor training procedures. By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP. Since MinVIS treats frames in training videos as independent images, we can drastically sub-sample the annotated frames in training videos without any modifications. With only 1% of labeled frames, MinVIS outperforms or is comparable to fully-supervised state-of-the-art approaches on YouTube-VIS 2019/2021. Our key observation is that queries trained to be discriminative between intra-frame object instances are temporally consistent and can be used to track instances without any manually designed heuristics. MinVIS thus has the following…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
