SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin

TL;DR
This paper introduces SAM2S, a foundation model for surgical video segmentation that leverages a new benchmark and novel memory and learning mechanisms to improve long-term tracking and zero-shot generalization in surgical scenarios.
Contribution
The paper presents SAM2S, a novel foundation model for surgical iVOS, built upon a large surgical benchmark and incorporating DiveMem, temporal semantic learning, and ambiguity-resilient training.
Findings
SAM2 improves by 12.99 points over vanilla SAM2.
SAM2S achieves 80.42 average J&F, surpassing baselines.
Model runs at 68 FPS with strong zero-shot generalization.
Abstract
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training · Advanced Neural Network Applications · Multimodal Machine Learning Applications
