VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos

Zhijing Lu; Khurram Azeem Hashmi; Didier Stricker; Muhammad Zeshan Afzal

arXiv:2605.17584·cs.CV·May 19, 2026

VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos

Zhijing Lu, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal

PDF

TL;DR

VVitCutLER is an unsupervised framework that enhances video object detection and segmentation by ensuring temporal consistency, reducing errors, and improving robustness in challenging real-world scenarios.

Contribution

It introduces VitCut, a pseudo-label generator with temporal stability, and integrates cross-frame feature aggregation for improved video-level robustness.

Findings

01

Significant improvement in detection and segmentation accuracy.

02

Reduction in temporal flickering and instability.

03

Enhanced robustness in real-world video scenarios.

Abstract

Unsupervised pixel-level video understanding remains challenging in real-world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering pseudo-labels.We propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo-labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo-label generator that reduces error accumulation during field degradation through cross-frame region consistency. Meanwhile, VitCut uses a distillation decoder to achieve effective instance mask prediction. Then, based on VitCut, VVitCutLER further integrates cross-frame feature aggregation to enhance video-level robustness. Extensive experiments on standard video benchmarks demonstrate that VVitCutLER significantly improves detection and segmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.