UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model
Zhenghao Zhang, Shengfan Zhang, Zhichao Wei, Zuozhuo Dai, Siyu Zhu

TL;DR
This paper introduces UVOSAM, a novel mask-free approach for unsupervised video object segmentation that leverages the Segment Anything Model and a specialized tracker to outperform existing methods without requiring mask annotations.
Contribution
UVOSAM is the first mask-free UVOS method utilizing SAM with a new tracker and attention mechanism, achieving superior results over mask-supervised approaches.
Findings
Outperforms existing mask-supervised UVOS methods on DAVIS2017-unsupervised and YoutubeVIS datasets.
Demonstrates strong generalization to weakly-annotated video datasets.
Uses a novel spatial-temporal decoupled deformable attention mechanism.
Abstract
The current state-of-the-art methods for unsupervised video object segmentation (UVOS) require extensive training on video datasets with mask annotations, limiting their effectiveness in handling challenging scenarios. However, the Segment Anything Model (SAM) introduces a new prompt-driven paradigm for image segmentation, offering new possibilities. In this study, we investigate SAM's potential for UVOS through different prompt strategies. We then propose UVOSAM, a mask-free paradigm for UVOS that utilizes the STD-Net tracker. STD-Net incorporates a spatial-temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features, remarkably enhancing the quality of box prompts in complex video scenes. Extensive experiments on the DAVIS2017-unsupervised and YoutubeVIS19\&21 datasets demonstrate the superior performance of UVOSAM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Video Surveillance and Tracking Methods
MethodsSegment Anything Model
