UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model

Zhenghao Zhang; Shengfan Zhang; Zhichao Wei; Zuozhuo Dai; Siyu Zhu

arXiv:2305.12659·cs.CV·July 9, 2025·6 cites

UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model

Zhenghao Zhang, Shengfan Zhang, Zhichao Wei, Zuozhuo Dai, Siyu Zhu

PDF

Open Access 1 Repo

TL;DR

This paper introduces UVOSAM, a novel mask-free approach for unsupervised video object segmentation that leverages the Segment Anything Model and a specialized tracker to outperform existing methods without requiring mask annotations.

Contribution

UVOSAM is the first mask-free UVOS method utilizing SAM with a new tracker and attention mechanism, achieving superior results over mask-supervised approaches.

Findings

01

Outperforms existing mask-supervised UVOS methods on DAVIS2017-unsupervised and YoutubeVIS datasets.

02

Demonstrates strong generalization to weakly-annotated video datasets.

03

Uses a novel spatial-temporal decoupled deformable attention mechanism.

Abstract

The current state-of-the-art methods for unsupervised video object segmentation (UVOS) require extensive training on video datasets with mask annotations, limiting their effectiveness in handling challenging scenarios. However, the Segment Anything Model (SAM) introduces a new prompt-driven paradigm for image segmentation, offering new possibilities. In this study, we investigate SAM's potential for UVOS through different prompt strategies. We then propose UVOSAM, a mask-free paradigm for UVOS that utilizes the STD-Net tracker. STD-Net incorporates a spatial-temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features, remarkably enhancing the quality of box prompts in complex video scenes. Extensive experiments on the DAVIS2017-unsupervised and YoutubeVIS19\&21 datasets demonstrate the superior performance of UVOSAM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibaba/uvosam
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Video Surveillance and Tracking Methods

MethodsSegment Anything Model