AutoQ-VIS: Improving Unsupervised Video Instance Segmentation via Automatic Quality Assessment
Kaixuan Lu, Mehmet Onurcan Kaya, Dim P. Papadopoulos

TL;DR
AutoQ-VIS introduces an unsupervised video instance segmentation framework that uses automatic quality assessment to improve pseudo-labels, enabling effective domain adaptation from synthetic to real videos without human annotations.
Contribution
It proposes a novel quality-guided self-training approach that bridges the synthetic-to-real domain gap in unsupervised VIS, achieving state-of-the-art results.
Findings
Achieves 52.6% AP50 on YouTubeVIS-2019 val set.
Surpasses previous state-of-the-art VideoCutLER by 4.4%.
Requires no human annotations for training.
Abstract
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 on YouTubeVIS-2019 val set, surpassing the previous state-of-the-art VideoCutLER by 4.4, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
