MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

Henghui Ding; Kaining Ying; Chang Liu; Shuting He; Xudong Jiang; Yu-Gang Jiang; Philip H.S. Torr; Song Bai

arXiv:2508.05630·cs.CV·September 23, 2025

MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip H.S. Torr, Song Bai

PDF

Open Access 1 Datasets

TL;DR

MOSEv2 is a new, more challenging video object segmentation dataset that includes complex real-world scenarios, revealing current methods' limitations and guiding future research improvements.

Contribution

We introduce MOSEv2, a significantly more difficult VOS dataset with diverse challenging scenarios, to better evaluate and advance real-world VOS methods.

Findings

01

Performance drops significantly on MOSEv2 compared to MOSEv1.

02

Current VOS methods struggle with complex scenes and adverse conditions.

03

Practical tricks can improve model performance on challenging data.

Abstract

Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To bridge this gap, the coMplex video Object SEgmentation (MOSEv1) dataset was introduced to facilitate VOS research in complex scenes. Building on the foundations and insights of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces much greater scene complexity, including {more frequent object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

FudanCVL/MOSEv2
dataset· 620 dl
620 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Visual Attention and Saliency Detection