Show Me When and Where: Towards Referring Video Object Segmentation in the Wild

Mingqi Gao; Jinyu Yang; Jingnan Luo; Xiantong Zhen; Jungong Han; Giovanni Montana; Feng Zheng

arXiv:2603.14300·cs.CV·March 17, 2026

Show Me When and Where: Towards Referring Video Object Segmentation in the Wild

Mingqi Gao, Jinyu Yang, Jingnan Luo, Xiantong Zhen, Jungong Han, Giovanni Montana, Feng Zheng

PDF

Open Access

TL;DR

This paper introduces a new in-the-wild RVOS setting and dataset, YoURVOS, challenging methods to identify when and where objects appear in untrimmed videos, and proposes OMFormer as a baseline solution.

Contribution

The paper presents a new challenging RVOS benchmark dataset from untrimmed videos and a novel Object-level Multimodal TransFormer (OMFormer) model for improved localization.

Findings

01

Previous VOS methods perform poorly on YoURVOS, especially with more target-absent frames.

02

OMFormer consistently outperforms existing methods on the YoURVOS benchmark.

03

YoURVOS provides a more realistic and challenging testbed for RVOS research.

Abstract

Referring video object segmentation (RVOS) has recently generated great popularity in computer vision due to its widespread applications. Existing RVOS setting contains elaborately trimmed videos, with text-referred objects always appearing in all frames, which however fail to fully reflect the realistic challenges of this task. This simplified setting requires RVOS methods to only predict where objects, with no need to show when the objects appear. In this work, we introduce a new setting towards in-the-wild RVOS. To this end, we collect a new benchmark dataset using Youtube Untrimmed videos for RVOS - YoURVOS, which contains 1,120 in-the-wild videos with 7 times more duration and scenes than existing datasets. Our new benchmark challenges RVOS methods to show not only where but also when objects appear in videos. To set a baseline, we propose Object-level Multimodal TransFormers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection