3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion   Expression guided Video Segmentation

Feiyu Pan; Hao Fang; Xiankai Lu

arXiv:2406.04842·cs.CV·June 10, 2024

3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation

Feiyu Pan, Hao Fang, Xiankai Lu

PDF

Open Access

TL;DR

This paper presents a novel approach for referring video object segmentation that leverages frozen pre-trained vision-language models and enhanced cross-modal feature fusion, achieving competitive results in the CVPR 2024 PVUW workshop.

Contribution

The work introduces the use of frozen CLIP backbone for feature alignment and a new video query initialization method to improve RVOS performance.

Findings

01

Achieved 51.5 J&F score on MeViS test set

02

Ranked 3rd in CVPR 2024 PVUW workshop MeViS track

03

Enhanced cross-modal feature fusion improves segmentation quality

Abstract

Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video, emphasizing modeling dense text-video relations. The current RVOS methods typically use independently pre-trained vision and language models as backbones, resulting in a significant domain gap between video and text. In cross-modal feature interaction, text features are only used as query initialization and do not fully utilize important information in the text. In this work, we propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction. Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap and reducing training costs. Secondly, we add more cross-modal feature fusion in the pipeline to enhance the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training