Show Me When and Where: Towards Referring Video Object Segmentation in the Wild
Mingqi Gao, Jinyu Yang, Jingnan Luo, Xiantong Zhen, Jungong Han, Giovanni Montana, Feng Zheng

TL;DR
This paper introduces a new in-the-wild RVOS setting and dataset, YoURVOS, challenging methods to identify when and where objects appear in untrimmed videos, and proposes OMFormer as a baseline solution.
Contribution
The paper presents a new challenging RVOS benchmark dataset from untrimmed videos and a novel Object-level Multimodal TransFormer (OMFormer) model for improved localization.
Findings
Previous VOS methods perform poorly on YoURVOS, especially with more target-absent frames.
OMFormer consistently outperforms existing methods on the YoURVOS benchmark.
YoURVOS provides a more realistic and challenging testbed for RVOS research.
Abstract
Referring video object segmentation (RVOS) has recently generated great popularity in computer vision due to its widespread applications. Existing RVOS setting contains elaborately trimmed videos, with text-referred objects always appearing in all frames, which however fail to fully reflect the realistic challenges of this task. This simplified setting requires RVOS methods to only predict where objects, with no need to show when the objects appear. In this work, we introduce a new setting towards in-the-wild RVOS. To this end, we collect a new benchmark dataset using Youtube Untrimmed videos for RVOS - YoURVOS, which contains 1,120 in-the-wild videos with 7 times more duration and scenes than existing datasets. Our new benchmark challenges RVOS methods to show not only where but also when objects appear in videos. To set a baseline, we propose Object-level Multimodal TransFormers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
