Box Supervised Video Segmentation Proposal Network
Tanveer Hannan, Rajat Koner, Jonathan Kobold, Matthias Schubert

TL;DR
This paper introduces a box-supervised video object segmentation network that leverages motion cues and a novel affinity loss, significantly improving performance over existing self-supervised and fully-supervised methods without changing network architecture.
Contribution
It proposes a new box-supervised approach for video segmentation that incorporates motion analysis and a motion-aware affinity loss, bridging the gap between self-supervised and fully-supervised methods.
Findings
Outperforms state-of-the-art self-supervised benchmarks by 16.4% in J&F scores.
Achieves competitive results with fully supervised methods on DAVIS and Youtube-VOS datasets.
Demonstrates robustness through extensive testing and ablation studies.
Abstract
Video Object Segmentation (VOS) has been targeted by various fully-supervised and self-supervised approaches. While fully-supervised methods demonstrate excellent results, self-supervised ones, which do not use pixel-level ground truth, attract much attention. However, self-supervised approaches pose a significant performance gap. Box-level annotations provide a balanced compromise between labeling effort and result quality for image segmentation but have not been exploited for the video domain. In this work, we propose a box-supervised video object segmentation proposal network, which takes advantage of intrinsic video properties. Our method incorporates object motion in the following way: first, motion is computed using a bidirectional temporal difference and a novel bounding box-guided motion compensation. Second, we introduce a novel motion-aware affinity loss that encourages the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
