Training-Free Robust Interactive Video Object Segmentation
Xiaoli Wei, Zhaoqing Wang, Yandong Guo, Chunxia Zhang, Tongliang Liu,, Mingming Gong

TL;DR
This paper introduces a training-free, robust interactive video object segmentation framework leveraging SAM, combining sparse points and boxes tracking with a cross-round module to improve stability and performance across diverse datasets.
Contribution
The proposed I-PT framework is novel in integrating training-free prompt tracking with a cross-round module for enhanced robustness in interactive video segmentation.
Findings
Achieves strong zero-shot segmentation on DAVIS 2017, YouTube-VOS 2018, and MOSE 2023 datasets.
Maintains a good balance between segmentation accuracy and interaction time.
Outperforms existing methods in robustness and efficiency.
Abstract
Interactive video object segmentation is a crucial video task, having various applications from video editing to data annotating. However, current approaches struggle to accurately segment objects across diverse domains. Recently, Segment Anything Model (SAM) introduces interactive visual prompts and demonstrates impressive performance across different domains. In this paper, we propose a training-free prompt tracking framework for interactive video object segmentation (I-PT), leveraging the powerful generalization of SAM. Although point tracking efficiently captures the pixel-wise information of objects in a video, points tend to be unstable when tracked over a long period, resulting in incorrect segmentation. Towards fast and robust interaction, we jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. To better integrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Multimodal Machine Learning Applications
MethodsSegment Anything Model · VOS
