The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution
Bin Cao, Yisi Zhang, Hanyi Wang, Xingjian He, Jing Liu

TL;DR
This paper presents a novel instance-centric approach combining DETR-based models and SAM for improved referring video object segmentation, achieving third place in the LSVOS Challenge with high accuracy.
Contribution
It introduces two instance-centric models that fuse frame-level and instance-level predictions, enhancing temporal and spatial segmentation performance.
Findings
Achieved 52.67 J&F on validation set
Secured 3rd place in LSVOS Challenge
Demonstrated effective fusion of models for RVOS
Abstract
Referring Video Object Segmentation is an emerging multi-modal task that aims to segment objects in the video given a natural language expression. In this work, we build two instance-centric models and fuse predicted results from frame-level and instance-level. First, we introduce instance mask into the DETR-based model for query initialization to achieve temporal enhancement and employ SAM for spatial refinement. Secondly, we build an instance retrieval model conducting binary instance mask classification whether the instance is referred. Finally, we fuse predicted results and our method achieved a score of 52.67 J&F in the validation phase and 60.36 J&F in the test phase, securing the final ranking of 3rd place in the 6-th LSVOS Challenge RVOS Track.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptical Systems and Laser Technology · Advanced Fiber Optic Sensors · Geophysics and Sensor Technology
MethodsSegment Anything Model
