UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity
Junwei Yu, Trevor Darrell, XuDong Wang

TL;DR
UnSAMv2 introduces a self-supervised approach that enables the Segment Anything Model to control segmentation granularity precisely at any scale without human annotations, significantly improving performance across various tasks.
Contribution
It proposes a novel granularity control embedding and a self-supervised learning method that unlocks multi-scale segmentation capabilities in SAM without requiring dense annotations.
Findings
Achieves improved segmentation metrics across 11 benchmarks.
Enables continuous control over segmentation scale.
Uses only 6K unlabeled images with minimal additional parameters.
Abstract
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
