Training-Free Robust Interactive Video Object Segmentation

Xiaoli Wei; Zhaoqing Wang; Yandong Guo; Chunxia Zhang; Tongliang Liu,; Mingming Gong

arXiv:2406.05485·cs.CV·June 11, 2024

Training-Free Robust Interactive Video Object Segmentation

Xiaoli Wei, Zhaoqing Wang, Yandong Guo, Chunxia Zhang, Tongliang Liu,, Mingming Gong

PDF

Open Access

TL;DR

This paper introduces a training-free, robust interactive video object segmentation framework leveraging SAM, combining sparse points and boxes tracking with a cross-round module to improve stability and performance across diverse datasets.

Contribution

The proposed I-PT framework is novel in integrating training-free prompt tracking with a cross-round module for enhanced robustness in interactive video segmentation.

Findings

01

Achieves strong zero-shot segmentation on DAVIS 2017, YouTube-VOS 2018, and MOSE 2023 datasets.

02

Maintains a good balance between segmentation accuracy and interaction time.

03

Outperforms existing methods in robustness and efficiency.

Abstract

Interactive video object segmentation is a crucial video task, having various applications from video editing to data annotating. However, current approaches struggle to accurately segment objects across diverse domains. Recently, Segment Anything Model (SAM) introduces interactive visual prompts and demonstrates impressive performance across different domains. In this paper, we propose a training-free prompt tracking framework for interactive video object segmentation (I-PT), leveraging the powerful generalization of SAM. Although point tracking efficiently captures the pixel-wise information of objects in a video, points tend to be unstable when tracked over a long period, resulting in incorrect segmentation. Towards fast and robust interaction, we jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. To better integrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Multimodal Machine Learning Applications

MethodsSegment Anything Model · VOS