UVIS: Unsupervised Video Instance Segmentation
Shuaiyi Huang, Saksham Suri, Kamal Gupta, Sai Saketh Rambhatla,, Ser-nam Lim, Abhinav Shrivastava

TL;DR
UVIS introduces an unsupervised framework for video instance segmentation that leverages self-supervised and vision-language models, achieving competitive results without relying on annotated video data.
Contribution
The paper presents UVIS, a novel unsupervised video instance segmentation method utilizing dense shape priors and open-set recognition, eliminating the need for annotated video data.
Findings
Achieves 21.1 AP on YoutubeVIS-2019 without annotations.
Uses a dual-memory design for improved pseudo-label quality.
Demonstrates competitive performance on standard benchmarks.
Abstract
Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the openset recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design. This design includes a semantic memory bank for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · Contrastive Language-Image Pre-training · self-DIstillation with NO labels
