UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

Hao Fang; Feiyu Pan; Xiankai Lu; Wei Zhang; Runmin Cong

arXiv:2408.10129·cs.CV·August 27, 2024

UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

Hao Fang, Feiyu Pan, Xiankai Lu, Wei Zhang, Runmin Cong

PDF

Open Access

TL;DR

This paper presents UNINEXT-Cutie, a novel pipeline combining RVOS and VOS models with semi-supervised learning to excel in the challenging MeViS benchmark for referring video object segmentation.

Contribution

It introduces a simple, effective approach that integrates RVOS and VOS models, leveraging high-quality key frames and semi-supervised learning for improved performance.

Findings

01

Achieved 62.57 J&F on MeViS test set

02

Ranked 1st in the 6th LSVOS Challenge RVOS Track

03

Demonstrated effectiveness of combining RVOS and VOS models

Abstract

Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video. In this year, LSVOS Challenge RVOS Track replaced the origin YouTube-RVOS benchmark with MeViS. MeViS focuses on referring the target object in a video through its motion descriptions instead of static attributes, posing a greater challenge to RVOS task. In this work, we integrate strengths of that leading RVOS and VOS models to build up a simple and effective pipeline for RVOS. Firstly, We finetune the state-of-the-art RVOS model to obtain mask sequences that are correlated with language descriptions. Secondly, based on a reliable and high-quality key frames, we leverage VOS model to enhance the quality and temporal consistency of the mask results. Finally, we further improve the performance of the RVOS model using semi-supervised learning. Our solution achieved 62.57…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection · Retinal Imaging and Analysis

MethodsSparse Evolutionary Training · VOS