YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark
Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang,, Jianchao Yang, and Thomas Huang

TL;DR
This paper introduces YouTube-VOS, the largest video object segmentation dataset to date, enabling better development of end-to-end spatial-temporal segmentation algorithms and establishing new benchmarks.
Contribution
The creation of a large-scale, diverse video object segmentation dataset called YouTube-VOS with 4,453 videos and 94 categories, facilitating advanced research.
Findings
Established baseline performances of existing algorithms on YouTube-VOS.
Demonstrated the dataset's potential to improve video segmentation methods.
Provided a publicly available benchmark for future research.
Abstract
Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatialtemporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 4,453 YouTube video clips and 94 object categories. This is by far the largest video object segmentation dataset to our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
