Segment Anything for Video: A Comprehensive Review of Video Object Segmentation and Tracking from Past to Future

Guoping Xu; Jayaram K. Udupa; Yajun Yu; Hua-Chieh Shao; Songlin Zhao; Wei Liu; You Zhang

arXiv:2507.22792·cs.CV·August 5, 2025

Segment Anything for Video: A Comprehensive Review of Video Object Segmentation and Tracking from Past to Future

Guoping Xu, Jayaram K. Udupa, Yajun Yu, Hua-Chieh Shao, Songlin Zhao, Wei Liu, You Zhang

PDF

TL;DR

This paper reviews the evolution of video object segmentation and tracking methods, emphasizing the impact of foundation models like SAM and SAM2 on improving accuracy, efficiency, and future research directions.

Contribution

It provides a comprehensive survey of SAM/SAM2-based VOST methods, structured across temporal dimensions, highlighting recent innovations and remaining challenges.

Findings

01

SAM/SAM2 enable prompt-driven segmentation with strong generalization

02

Recent methods incorporate motion-aware memory and trajectory-guided prompting

03

Identifies challenges like memory redundancy and prompt inefficiency

Abstract

Video Object Segmentation and Tracking (VOST) presents a complex yet critical challenge in computer vision, requiring robust integration of segmentation and tracking across temporally dynamic frames. Traditional methods have struggled with domain generalization, temporal consistency, and computational efficiency. The emergence of foundation models like the Segment Anything Model (SAM) and its successor, SAM2, has introduced a paradigm shift, enabling prompt-driven segmentation with strong generalization capabilities. Building upon these advances, this survey provides a comprehensive review of SAM/SAM2-based methods for VOST, structured along three temporal dimensions: past, present, and future. We examine strategies for retaining and updating historical information (past), approaches for extracting and optimizing discriminative features from the current frame (present), and motion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.