Harnessing Vision-Language Pretrained Models with Temporal-Aware   Adaptation for Referring Video Object Segmentation

Zikun Zhou; Wentao Xiong; Li Zhou; Xin Li; Zhenyu He; Yaowei Wang

arXiv:2405.10610·cs.CV·September 24, 2024

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation

Zikun Zhou, Wentao Xiong, Li Zhou, Xin Li, Zhenyu He, Yaowei Wang

PDF

Open Access

TL;DR

This paper introduces VLP-RVOS, a framework that leverages vision-language pretrained models with temporal-aware adaptation to improve referring video object segmentation by modeling dense text-video relations at pixel level.

Contribution

It proposes a novel temporal-aware prompt-tuning and cube-frame attention mechanism to adapt pretrained VLP models for dynamic pixel-level RVOS tasks.

Findings

01

Outperforms state-of-the-art RVOS methods.

02

Exhibits strong generalization across datasets.

03

Effectively models temporal and spatial relations in videos.

Abstract

The crux of Referring Video Object Segmentation (RVOS) lies in modeling dense text-video relations to associate abstract linguistic concepts with dynamic visual contents at pixel-level. Current RVOS methods typically use vision and language models pretrained independently as backbones. As images and texts are mapped to uncoupled feature spaces, they face the arduous task of learning Vision-Language (VL) relation modeling from scratch. Witnessing the success of Vision-Language Pretrained (VLP) models, we propose to learn relation modeling for RVOS based on their aligned VL feature space. Nevertheless, transferring VLP models to RVOS is a deceptively challenging task due to the substantial gap between the pretraining task (static image/region-level prediction) and the RVOS task (dynamic pixel-level prediction). To address this transfer challenge, we introduce a framework named VLP-RVOS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques