Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation
Gensheng Pei, Yazhou Yao, Jianbo Jiao, Wenguan Wang, Liqiang Nie, and, Jinhui Tang

TL;DR
This paper introduces HVC, a self-supervised hybrid static-dynamic visual correspondence framework for video object segmentation that efficiently learns from static images, reducing training time and memory while achieving state-of-the-art results.
Contribution
HVC is the first to combine static and dynamic visual correspondence in a self-supervised manner for VOS, requiring only one training session on static images.
Findings
Achieves state-of-the-art results on self-supervised VOS benchmarks.
Reduces training time to approximately 2 hours and memory to 16GB.
Effectively propagates video labels using static image data.
Abstract
Conventional video object segmentation (VOS) methods usually necessitate a substantial volume of pixel-level annotated video data for fully supervised learning. In this paper, we present HVC, a \textbf{h}ybrid static-dynamic \textbf{v}isual \textbf{c}orrespondence framework for self-supervised VOS. HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model. Our approach utilizes a minimalist fully-convolutional architecture to capture static-dynamic visual correspondence in image-cropped views. To achieve this objective, we present a unified self-supervised approach to learn visual representations of static-dynamic feature similarity. Firstly, we establish static correspondence by utilizing a priori coordinate information between cropped views to guide the formation of consistent static feature representations. Subsequently, we devise a concise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Vision and Imaging
MethodsVOS
