GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation
Ci-Siang Lin, I-Jieh Liu, Min-Hung Chen, Chien-Yi Wang, Sifei Liu,, Yu-Chiang Frank Wang

TL;DR
This paper introduces GroPrompt, a framework that efficiently adapts foundation segmentation models for referring video object segmentation using weak supervision, achieving competitive results without dense mask annotations.
Contribution
It proposes a novel Grounded Prompting framework with Text-Aware Prompt Contrastive Learning to generate temporal-consistent, text-aware prompts from weak supervision for RVOS.
Findings
Achieves competitive performance on standard RVOS benchmarks.
Effectively generates temporal-consistent, text-aware prompts from box supervision.
Reduces reliance on dense mask annotations for training.
Abstract
Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence throughout the entire video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we aim to efficiently adapt foundation segmentation models for addressing RVOS from weak supervision with the proposed Grounded Prompting (GroPrompt) framework. More specifically, we propose Text-Aware Prompt Contrastive Learning (TAP-CL) to enhance the association between the position prompts and the referring sentences with only box supervisions, including Text-Contrastive Prompt Learning (TextCon) and Modality-Contrastive Prompt Learning (ModalCon) at frame level and video level, respectively. With the proposed TAP-CL, our GroPrompt framework can generate temporal-consistent yet text-aware position prompts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Advanced Data Compression Techniques
MethodsContrastive Learning
