Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
Tianming Liang, Haichao Jiang, Yuting Yang, Chaolei Tan, Shuai Li, Wei-Shi Zheng, Jian-Fang Hu

TL;DR
This paper introduces Long-RVOS, a large-scale benchmark dataset for long-term referring video object segmentation, highlighting the challenges of long videos and proposing a new baseline method, ReferMo, to improve performance.
Contribution
The paper presents Long-RVOS, a comprehensive long-duration video dataset with new evaluation metrics and introduces ReferMo, a baseline method that effectively captures long-term dependencies in RVOS.
Findings
Current methods perform poorly on long videos.
ReferMo significantly outperforms existing approaches in long-term scenarios.
Long-RVOS enables more realistic evaluation of RVOS models.
Abstract
Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce \textbf{Long-RVOS}, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, unlike previous benchmarks that rely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Visual Attention and Saliency Detection
MethodsSoftmax · Attention Is All You Need · Focus
