X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation

Pinxue Guo; Wanyun Li; Hao Huang; Lingyi Hong; Xinyu Zhou; Zhaoyu; Chen; Jinglun Li; Kaixun Jiang; Wei Zhang; Wenqiang Zhang

arXiv:2409.19342·cs.CV·October 1, 2024

X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation

Pinxue Guo, Wanyun Li, Hao Huang, Lingyi Hong, Xinyu Zhou, Zhaoyu, Chen, Jinglun Li, Kaixun Jiang, Wei Zhang, Wenqiang Zhang

PDF

Open Access 1 Repo

TL;DR

The paper introduces X-Prompt, a universal multi-modal video object segmentation framework that adapts a pre-trained RGB model to various modalities using prompts, achieving state-of-the-art results across multiple benchmarks.

Contribution

It proposes a novel prompt-based framework with modality-specific adaptation experts for efficient multi-modal VOS, reducing the need for full fine-tuning and improving generalization.

Findings

01

X-Prompt outperforms full fine-tuning methods on multiple benchmarks.

02

The framework achieves state-of-the-art performance across 3 tasks and 4 datasets.

03

Prompt-based adaptation effectively leverages limited multi-modal data.

Abstract

Multi-modal Video Object Segmentation (VOS), including RGB-Thermal, RGB-Depth, and RGB-Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full-parameter fine-tuning for fusion in each task. However, this paradigm not only duplicates research efforts and hardware costs but also risks model collapse with the limited multi-modal annotated data. In this paper, we propose a universal framework named X-Prompt for all multi-modal video object segmentation tasks, designated as RGB+X. The X-Prompt framework first pre-trains a video object segmentation foundation model using RGB data, and then utilize the additional modality of the prompt to adapt it to downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pinxueguo/x-prompt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods

MethodsSoftmax · Attention Is All You Need · VOS