GroPrompt: Efficient Grounded Prompting and Adaptation for Referring   Video Object Segmentation

Ci-Siang Lin; I-Jieh Liu; Min-Hung Chen; Chien-Yi Wang; Sifei Liu,; Yu-Chiang Frank Wang

arXiv:2406.12834·cs.CV·June 25, 2024

GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation

Ci-Siang Lin, I-Jieh Liu, Min-Hung Chen, Chien-Yi Wang, Sifei Liu,, Yu-Chiang Frank Wang

PDF

Open Access

TL;DR

This paper introduces GroPrompt, a framework that efficiently adapts foundation segmentation models for referring video object segmentation using weak supervision, achieving competitive results without dense mask annotations.

Contribution

It proposes a novel Grounded Prompting framework with Text-Aware Prompt Contrastive Learning to generate temporal-consistent, text-aware prompts from weak supervision for RVOS.

Findings

01

Achieves competitive performance on standard RVOS benchmarks.

02

Effectively generates temporal-consistent, text-aware prompts from box supervision.

03

Reduces reliance on dense mask annotations for training.

Abstract

Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence throughout the entire video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we aim to efficiently adapt foundation segmentation models for addressing RVOS from weak supervision with the proposed Grounded Prompting (GroPrompt) framework. More specifically, we propose Text-Aware Prompt Contrastive Learning (TAP-CL) to enhance the association between the position prompts and the referring sentences with only box supervisions, including Text-Contrastive Prompt Learning (TextCon) and Modality-Contrastive Prompt Learning (ModalCon) at frame level and video level, respectively. With the proposed TAP-CL, our GroPrompt framework can generate temporal-consistent yet text-aware position prompts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Advanced Data Compression Techniques

MethodsContrastive Learning