SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya, Ryali, Tengyu Ma, Haitham Khedr, Roman R\"adle, Chloe Rolland, Laura, Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion,, Chao-Yuan Wu, Ross Girshick, Piotr Doll\'ar

TL;DR
SAM 2 is a new foundation model for promptable visual segmentation in images and videos, featuring a large dataset, real-time processing, and improved accuracy and speed over previous models.
Contribution
The paper introduces SAM 2, a transformer-based model with streaming memory, a large video segmentation dataset, and demonstrates significant improvements in accuracy and efficiency.
Findings
Better accuracy in video segmentation with fewer interactions
6x faster image segmentation compared to SAM
Largest video segmentation dataset to date
Abstract
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, dataset, as well as code for model training and our demo.
Peer Reviews
Decision·ICLR 2025 Oral
1. With the data engine pipeline, the paper provides an extremely large-scale video segmentation dataset compared to previous datasets. This will allow the researchers to tackle much more challenging tasks in video segmentation. 2. Based on the experimental results, the trained SAM2 model outperforms the combination of SAM and existing state-of-the-art trackers by a large margin. Therefore, the data scaling-up with the data engine is effective, as described by the authors. The results also impl
1. Although this paper uses a simpler structure and performs well, it is still possible to use previous structures, such as Cutie [R1], to achieve even better performance with SAM2 data. It would be better if the structure could be explored. 2. SAM2 cannot recognize segmented objects like previous models [R2, R3]. It would be better to discuss this since it may limit the application of this paper. It would also be better to discuss the difference with [R4], which supports image and video segmen
1. This paper proposed a strong foundation model for the video and image segmentation. The data, model, and insights will serve as a significant milestone for video segmentation. 2. The writing of the paper is good and the paper is easy to understand.
1. More experiments should be conducted. For example, more interactive VOS methods should be compared. [1*] Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. CVPR 2021 [2*] Memory aggregation networks for efficient interactive video object segmentation. CVPR 2020 2. More VOS datasets (e.g., VIPOSeg[4*]) should be included in this paper. [4*] Video Object Segmentation in Panoptic Wild Scenes. IJCAI 2023
* Compared to the original SAM model, SAM 2 improves segmentation accuracy, enabling more precise identification and segmentation of objects in images and videos. * The processing speed is approximately six times faster than its predecessor. This allows SAM 2 to generate segmentation masks more quickly, making it suitable for real-time applications. * SAM 2 exhibits strong zero-shot transfer capability. * The training dataset includes 11 million images and 11 billion masks, providing a robust fo
From my perspective, there is no obvious weakness in this work. If must to say: 1. The claimed improvement in running speed is mainly due to the usage of the Hiera image encoder, which may not be viewed as a unique contribution of this study. 2. The primary contribution lies in a large-scale dataset and pre-trained models, while the technical contribution is relatively limited.
Code & Models
- 🤗facebook/sam2.1-hiera-largemodel· 30k dl· ♡ 13230k dl♡ 132
- 🤗facebook/sam2-hiera-largemodel· 39k dl· ♡ 13139k dl♡ 131
- 🤗facebook/sam2-hiera-base-plusmodel· 3.7k dl· ♡ 123.7k dl♡ 12
- 🤗facebook/sam2-hiera-tinymodel· 6.0k dl· ♡ 266.0k dl♡ 26
- 🤗facebook/sam2-hiera-smallmodel· 4.9k dl· ♡ 154.9k dl♡ 15
- 🤗facebook/sam2-hiera-base-plus-hfmodel· 9 dl· ♡ 29 dl♡ 2
- 🤗facebook/sam2-hiera-large-hfmodel· 18 dl· ♡ 918 dl♡ 9
- 🤗facebook/sam2-hiera-small-hfmodel· 51 dl· ♡ 151 dl♡ 1
- 🤗facebook/sam2-hiera-tiny-hfmodel· 38 dl· ♡ 438 dl♡ 4
- 🤗shubham0204/sam2-onnx-modelsmodel· ♡ 7♡ 7
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsHow to Easily Connect US Travelocity℗ Support: 12 Proven Options · Segment Anything Model
