YouTube-VOS: Sequence-to-Sequence Video Object Segmentation
Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen, Liang, Brian Price, Scott Cohen, and Thomas Huang

TL;DR
This paper introduces a large-scale video object segmentation dataset, YouTube-VOS, and proposes a sequence-to-sequence network that effectively captures long-term spatial-temporal features, achieving state-of-the-art results.
Contribution
The paper presents the largest video object segmentation dataset to date and a novel sequence-to-sequence model for improved long-term video segmentation.
Findings
Achieved top performance on YouTube-VOS test set
Comparable results on DAVIS 2016 with current state-of-the-art
Large dataset significantly improves model effectiveness
Abstract
Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatial-temporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 3,252 YouTube video clips and 78 categories including common objects and human activities. This is by far the largest video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Video Surveillance and Tracking Methods
