Boosting Video Object Segmentation via Space-time Correspondence Learning
Yurong Zhang, Liulei Li, Wenguan Wang, Rong Xie, Li Song, Wenjun Zhang

TL;DR
This paper introduces a correspondence-aware training framework that enhances video object segmentation by explicitly enforcing robust space-time correspondence matching, leading to significant performance improvements without extra annotation or architectural changes.
Contribution
It proposes a novel training method that incorporates contrastive correspondence learning to improve matching-based VOS models, leveraging intrinsic video coherence without additional annotation.
Findings
Achieves performance gains on DAVIS and YouTube-VOS benchmarks.
No extra annotation cost or architectural modifications required.
Improves robustness of space-time correspondence matching in VOS.
Abstract
Current top-leading solutions for video object segmentation (VOS) typically follow a matching-based regime: for each query frame, the segmentation mask is inferred according to its correspondence to previously processed and the first annotated frames. They simply exploit the supervisory signals from the groundtruth masks for learning mask prediction only, without posing any constraint on the space-time correspondence matching, which, however, is the fundamental building block of such regime. To alleviate this crucial yet commonly ignored issue, we devise a correspondence-aware training framework, which boosts matching-based VOS solutions by explicitly encouraging robust correspondence matching during network learning. Through comprehensively exploring the intrinsic coherence in videos on pixel and object levels, our algorithm reinforces the standard, fully supervised training of mask…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · VOS
