The Instance-centric Transformer for the RVOS Track of LSVOS Challenge:   3rd Place Solution

Bin Cao; Yisi Zhang; Hanyi Wang; Xingjian He; Jing Liu

arXiv:2408.10541·cs.CV·August 21, 2024

The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

Bin Cao, Yisi Zhang, Hanyi Wang, Xingjian He, Jing Liu

PDF

Open Access

TL;DR

This paper presents a novel instance-centric approach combining DETR-based models and SAM for improved referring video object segmentation, achieving third place in the LSVOS Challenge with high accuracy.

Contribution

It introduces two instance-centric models that fuse frame-level and instance-level predictions, enhancing temporal and spatial segmentation performance.

Findings

01

Achieved 52.67 J&F on validation set

02

Secured 3rd place in LSVOS Challenge

03

Demonstrated effective fusion of models for RVOS

Abstract

Referring Video Object Segmentation is an emerging multi-modal task that aims to segment objects in the video given a natural language expression. In this work, we build two instance-centric models and fuse predicted results from frame-level and instance-level. First, we introduce instance mask into the DETR-based model for query initialization to achieve temporal enhancement and employ SAM for spatial refinement. Secondly, we build an instance retrieval model conducting binary instance mask classification whether the instance is referred. Finally, we fuse predicted results and our method achieved a score of 52.67 J&F in the validation phase and 60.36 J&F in the test phase, securing the final ranking of 3rd place in the 6-th LSVOS Challenge RVOS Track.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptical Systems and Laser Technology · Advanced Fiber Optic Sensors · Geophysics and Sensor Technology

MethodsSegment Anything Model