Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

Kaixuan Lu; Mehmet Onurcan Kaya; Dim P. Papadopoulos

arXiv:2512.06864·cs.CV·December 9, 2025

Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

Kaixuan Lu, Mehmet Onurcan Kaya, Dim P. Papadopoulos

PDF

Open Access

TL;DR

AutoQ-VIS introduces a quality-guided self-training framework for unsupervised video instance segmentation, effectively bridging the synthetic-to-real domain gap and achieving state-of-the-art results without human annotations.

Contribution

The paper proposes AutoQ-VIS, a novel unsupervised framework that uses quality-guided self-training to improve video instance segmentation from synthetic to real videos.

Findings

01

Achieves 52.6 AP50 on YouTubeVIS-2019 val set.

02

Surpasses previous state-of-the-art VideoCutLER by 4.4%.

03

Requires no human annotations.

Abstract

Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $AP_{50}$ on YouTubeVIS-2019 $val$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis