X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation
Pinxue Guo, Wanyun Li, Hao Huang, Lingyi Hong, Xinyu Zhou, Zhaoyu, Chen, Jinglun Li, Kaixun Jiang, Wei Zhang, Wenqiang Zhang

TL;DR
The paper introduces X-Prompt, a universal multi-modal video object segmentation framework that adapts a pre-trained RGB model to various modalities using prompts, achieving state-of-the-art results across multiple benchmarks.
Contribution
It proposes a novel prompt-based framework with modality-specific adaptation experts for efficient multi-modal VOS, reducing the need for full fine-tuning and improving generalization.
Findings
X-Prompt outperforms full fine-tuning methods on multiple benchmarks.
The framework achieves state-of-the-art performance across 3 tasks and 4 datasets.
Prompt-based adaptation effectively leverages limited multi-modal data.
Abstract
Multi-modal Video Object Segmentation (VOS), including RGB-Thermal, RGB-Depth, and RGB-Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full-parameter fine-tuning for fusion in each task. However, this paradigm not only duplicates research efforts and hardware costs but also risks model collapse with the limited multi-modal annotated data. In this paper, we propose a universal framework named X-Prompt for all multi-modal video object segmentation tasks, designated as RGB+X. The X-Prompt framework first pre-trains a video object segmentation foundation model using RGB data, and then utilize the additional modality of the prompt to adapt it to downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods
MethodsSoftmax · Attention Is All You Need · VOS
