SAM 2: Segment Anything in Images and Videos

Nikhila Ravi; Valentin Gabeur; Yuan-Ting Hu; Ronghang Hu; Chaitanya; Ryali; Tengyu Ma; Haitham Khedr; Roman R\"adle; Chloe Rolland; Laura; Gustafson; Eric Mintun; Junting Pan; Kalyan Vasudev Alwala; Nicolas Carion,; Chao-Yuan Wu; Ross Girshick; Piotr Doll\'ar; Christoph Feichtenhofer

arXiv:2408.00714·cs.CV·October 29, 2024·221 cites

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya, Ryali, Tengyu Ma, Haitham Khedr, Roman R\"adle, Chloe Rolland, Laura, Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion,, Chao-Yuan Wu, Ross Girshick, Piotr Doll\'ar

PDF

Open Access 5 Repos 10 Models 5 Datasets 1 Video 3 Reviews

TL;DR

SAM 2 is a new foundation model for promptable visual segmentation in images and videos, featuring a large dataset, real-time processing, and improved accuracy and speed over previous models.

Contribution

The paper introduces SAM 2, a transformer-based model with streaming memory, a large video segmentation dataset, and demonstrates significant improvements in accuracy and efficiency.

Findings

01

Better accuracy in video segmentation with fewer interactions

02

6x faster image segmentation compared to SAM

03

Largest video segmentation dataset to date

Abstract

We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, dataset, as well as code for model training and our demo.

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 10Confidence 4

Strengths

1. With the data engine pipeline, the paper provides an extremely large-scale video segmentation dataset compared to previous datasets. This will allow the researchers to tackle much more challenging tasks in video segmentation. 2. Based on the experimental results, the trained SAM2 model outperforms the combination of SAM and existing state-of-the-art trackers by a large margin. Therefore, the data scaling-up with the data engine is effective, as described by the authors. The results also impl

Weaknesses

1. Although this paper uses a simpler structure and performs well, it is still possible to use previous structures, such as Cutie [R1], to achieve even better performance with SAM2 data. It would be better if the structure could be explored. 2. SAM2 cannot recognize segmented objects like previous models [R2, R3]. It would be better to discuss this since it may limit the application of this paper. It would also be better to discuss the difference with [R4], which supports image and video segmen

Reviewer 02Rating 8Confidence 4

Strengths

1. This paper proposed a strong foundation model for the video and image segmentation. The data, model, and insights will serve as a significant milestone for video segmentation. 2. The writing of the paper is good and the paper is easy to understand.

Weaknesses

1. More experiments should be conducted. For example, more interactive VOS methods should be compared. [1*] Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. CVPR 2021 [2*] Memory aggregation networks for efficient interactive video object segmentation. CVPR 2020 2. More VOS datasets (e.g., VIPOSeg[4*]) should be included in this paper. [4*] Video Object Segmentation in Panoptic Wild Scenes. IJCAI 2023

Reviewer 03Rating 8Confidence 4

Strengths

* Compared to the original SAM model, SAM 2 improves segmentation accuracy, enabling more precise identification and segmentation of objects in images and videos. * The processing speed is approximately six times faster than its predecessor. This allows SAM 2 to generate segmentation masks more quickly, making it suitable for real-time applications. * SAM 2 exhibits strong zero-shot transfer capability. * The training dataset includes 11 million images and 11 billion masks, providing a robust fo

Weaknesses

From my perspective, there is no obvious weakness in this work. If must to say: 1. The claimed improvement in running speed is mainly due to the usage of the Hiera image encoder, which may not be viewed as a unique contribution of this study. 2. The primary contribution lies in a large-scale dataset and pre-trained models, while the technical contribution is relatively limited.

Code & Models

Repositories

Models

Datasets

Videos

SAM 2: Segment Anything in Images and Videos· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis

MethodsHow to Easily Connect US Travelocity℗ Support: 12 Proven Options · Segment Anything Model