The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal   Refinement for Consistent Semantic Segmentation

Tuyen Tran

arXiv:2408.12447·cs.CV·August 23, 2024

The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation

Tuyen Tran

PDF

Open Access

TL;DR

This paper presents a method that leverages SAM-v2 for improved temporal consistency in referring video object segmentation, achieving high performance and second place in the ECCV 2024 LSVOS Challenge.

Contribution

It introduces a novel approach combining SAM-v2 with existing models to enhance temporal consistency in RVOS tasks.

Findings

01

Achieved 60.40 a0a0 score on MeViS dataset

02

Placed 2nd in the ECCV 2024 LSVOS Challenge

03

Demonstrated improved temporal segmentation consistency

Abstract

Referring Video Object Segmentation (RVOS) is a challenging task due to its requirement for temporal understanding. Due to the obstacle of computational complexity, many state-of-the-art models are trained on short time intervals. During testing, while these models can effectively process information over short time steps, they struggle to maintain consistent perception over prolonged time sequences, leading to inconsistencies in the resulting semantic segmentation masks. To address this challenge, we take a step further in this work by leveraging the tracking capabilities of the newly introduced Segment Anything Model version 2 (SAM-v2) to enhance the temporal consistency of the referring object segmentation model. Our method achieved a score of 60.40 \mathcal{J\text{\&}F} on the test set of the MeViS dataset, placing 2nd place in the final ranking of the RVOS Track at the ECCV 2024…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutomated Road and Building Extraction · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training