One-shot Training for Video Object Segmentation
Baiyu Chen, Sixian Chan, Xiaoqin Zhang

TL;DR
This paper introduces a novel one-shot training framework for video object segmentation that requires only a single labeled frame per training video, significantly reducing annotation effort while maintaining competitive performance.
Contribution
It presents the first one-shot training method for VOS, utilizing bi-directional mask inference and reconstruction, applicable to most state-of-the-art VOS networks.
Findings
Achieves comparable results to fully supervised methods using only one labeled frame.
Simple end-to-end approach that is easy to implement.
Reduces annotation cost significantly while maintaining performance.
Abstract
Video Object Segmentation (VOS) aims to track objects across frames in a video and segment them based on the initial annotated frame of the target objects. Previous VOS works typically rely on fully annotated videos for training. However, acquiring fully annotated training videos for VOS is labor-intensive and time-consuming. Meanwhile, self-supervised VOS methods have attempted to build VOS systems through correspondence learning and label propagation. Still, the absence of mask priors harms their robustness to complex scenarios, and the label propagation paradigm makes them impractical in terms of efficiency. To address these issues, we propose, for the first time, a general one-shot training framework for VOS, requiring only a single labeled frame per training video and applicable to a majority of state-of-the-art VOS networks. Specifically, our algorithm consists of: i) Inferring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsVOS
