The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

Ranjan Sapkota; Konstantinos I. Roumeliotis; Manoj Karkee

arXiv:2512.06032·cs.CV·December 9, 2025

The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee

PDF

Open Access

TL;DR

This paper analyzes the fundamental differences between SAM2 and SAM3, explaining why prompt-based expertise in SAM2 does not transfer to SAM3's concept-driven, multimodal segmentation paradigm, and highlights future research directions.

Contribution

It provides a detailed comparison of SAM2 and SAM3 architectures, datasets, training, and evaluation, clarifying why their segmentation approaches are fundamentally different.

Findings

01

SAM2 uses spatial prompts for geometric segmentation

02

SAM3 employs vision-language models for semantic, open-vocabulary segmentation

03

Prompt-based expertise in SAM2 does not transfer to SAM3's multimodal paradigm

Abstract

This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection