MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip H.S. Torr, Song Bai

TL;DR
MOSEv2 is a new, more challenging video object segmentation dataset that includes complex real-world scenarios, revealing current methods' limitations and guiding future research improvements.
Contribution
We introduce MOSEv2, a significantly more difficult VOS dataset with diverse challenging scenarios, to better evaluate and advance real-world VOS methods.
Findings
Performance drops significantly on MOSEv2 compared to MOSEv1.
Current VOS methods struggle with complex scenes and adverse conditions.
Practical tricks can improve model performance on challenging data.
Abstract
Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To bridge this gap, the coMplex video Object SEgmentation (MOSEv1) dataset was introduced to facilitate VOS research in complex scenes. Building on the foundations and insights of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces much greater scene complexity, including {more frequent object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Visual Attention and Saliency Detection
