VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos
Zhijing Lu, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal

TL;DR
VVitCutLER is an unsupervised framework that enhances video object detection and segmentation by ensuring temporal consistency, reducing errors, and improving robustness in challenging real-world scenarios.
Contribution
It introduces VitCut, a pseudo-label generator with temporal stability, and integrates cross-frame feature aggregation for improved video-level robustness.
Findings
Significant improvement in detection and segmentation accuracy.
Reduction in temporal flickering and instability.
Enhanced robustness in real-world video scenarios.
Abstract
Unsupervised pixel-level video understanding remains challenging in real-world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering pseudo-labels.We propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo-labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo-label generator that reduces error accumulation during field degradation through cross-frame region consistency. Meanwhile, VitCut uses a distillation decoder to achieve effective instance mask prediction. Then, based on VitCut, VVitCutLER further integrates cross-frame feature aggregation to enhance video-level robustness. Extensive experiments on standard video benchmarks demonstrate that VVitCutLER significantly improves detection and segmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
