MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H.S. Torr,, Song Bai

TL;DR
The paper introduces MOSE, a challenging new video object segmentation dataset with complex scenes, and benchmarks existing methods revealing significant performance gaps in such environments.
Contribution
It presents MOSE, a large-scale dataset with complex scenes for VOS, and evaluates 18 methods, highlighting the need for improved algorithms in real-world scenarios.
Findings
Current VOS methods achieve only 59.4% J&F on MOSE, much lower than on existing datasets.
Existing algorithms struggle with occlusion and crowded scenes in MOSE.
There is a significant performance gap indicating challenges in complex environments.
Abstract
Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsVOS · Contrastive Language-Image Pre-training
