UVIS: Unsupervised Video Instance Segmentation

Shuaiyi Huang; Saksham Suri; Kamal Gupta; Sai Saketh Rambhatla,; Ser-nam Lim; Abhinav Shrivastava

arXiv:2406.06908·cs.CV·June 12, 2024

UVIS: Unsupervised Video Instance Segmentation

Shuaiyi Huang, Saksham Suri, Kamal Gupta, Sai Saketh Rambhatla,, Ser-nam Lim, Abhinav Shrivastava

PDF

Open Access

TL;DR

UVIS introduces an unsupervised framework for video instance segmentation that leverages self-supervised and vision-language models, achieving competitive results without relying on annotated video data.

Contribution

The paper presents UVIS, a novel unsupervised video instance segmentation method utilizing dense shape priors and open-set recognition, eliminating the need for annotated video data.

Findings

01

Achieves 21.1 AP on YoutubeVIS-2019 without annotations.

02

Uses a dual-memory design for improved pseudo-label quality.

03

Demonstrates competitive performance on standard benchmarks.

Abstract

Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the openset recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design. This design includes a semantic memory bank for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition

MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · Contrastive Language-Image Pre-training · self-DIstillation with NO labels