Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

Tianming Liang; Haichao Jiang; Yuting Yang; Chaolei Tan; Shuai Li; Wei-Shi Zheng; Jian-Fang Hu

arXiv:2505.12702·cs.CV·October 29, 2025

Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

Tianming Liang, Haichao Jiang, Yuting Yang, Chaolei Tan, Shuai Li, Wei-Shi Zheng, Jian-Fang Hu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Long-RVOS, a large-scale benchmark dataset for long-term referring video object segmentation, highlighting the challenges of long videos and proposing a new baseline method, ReferMo, to improve performance.

Contribution

The paper presents Long-RVOS, a comprehensive long-duration video dataset with new evaluation metrics and introduces ReferMo, a baseline method that effectively captures long-term dependencies in RVOS.

Findings

01

Current methods perform poorly on long videos.

02

ReferMo significantly outperforms existing approaches in long-term scenarios.

03

Long-RVOS enables more realistic evaluation of RVOS models.

Abstract

Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce \textbf{Long-RVOS}, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, unlike previous benchmarks that rely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

iSEE-Laboratory/Long-RVOS
dataset· 134 dl
134 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Visual Attention and Saliency Detection

MethodsSoftmax · Attention Is All You Need · Focus