3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation
Feiyu Pan, Hao Fang, Xiankai Lu

TL;DR
This paper presents a novel approach for referring video object segmentation that leverages frozen pre-trained vision-language models and enhanced cross-modal feature fusion, achieving competitive results in the CVPR 2024 PVUW workshop.
Contribution
The work introduces the use of frozen CLIP backbone for feature alignment and a new video query initialization method to improve RVOS performance.
Findings
Achieved 51.5 J&F score on MeViS test set
Ranked 3rd in CVPR 2024 PVUW workshop MeViS track
Enhanced cross-modal feature fusion improves segmentation quality
Abstract
Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video, emphasizing modeling dense text-video relations. The current RVOS methods typically use independently pre-trained vision and language models as backbones, resulting in a significant domain gap between video and text. In cross-modal feature interaction, text features are only used as query initialization and do not fully utilize important information in the text. In this work, we propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction. Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap and reducing training costs. Secondly, we add more cross-modal feature fusion in the pipeline to enhance the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
