The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation
Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee

TL;DR
This paper analyzes the fundamental differences between SAM2 and SAM3, explaining why prompt-based expertise in SAM2 does not transfer to SAM3's concept-driven, multimodal segmentation paradigm, and highlights future research directions.
Contribution
It provides a detailed comparison of SAM2 and SAM3 architectures, datasets, training, and evaluation, clarifying why their segmentation approaches are fundamentally different.
Findings
SAM2 uses spatial prompts for geometric segmentation
SAM3 employs vision-language models for semantic, open-vocabulary segmentation
Prompt-based expertise in SAM2 does not transfer to SAM3's multimodal paradigm
Abstract
This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection
