Segment Anything Across Shots: A Method and Benchmark
Hengrui Hu, Kaining Ying, Henghui Ding

TL;DR
This paper introduces a new method and benchmark for multi-shot semi-supervised video object segmentation, addressing shot discontinuities with a novel data augmentation strategy and a model that effectively detects and segments across shot transitions.
Contribution
The paper proposes the TMA data augmentation strategy and the SAAS model for improved cross-shot generalization in MVOS, along with the new Cut-VOS benchmark for evaluation.
Findings
SAAS achieves state-of-the-art performance on YouMVOS and Cut-VOS datasets.
TMA enhances cross-shot generalization with limited single-shot data.
Cut-VOS provides a diverse and challenging benchmark for MVOS.
Abstract
This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection · Video Analysis and Summarization · Human Pose and Action Recognition
