YouTube-8M: A Large-Scale Video Classification Benchmark
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George, Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan

TL;DR
YouTube-8M is a large-scale, multi-label video classification dataset with 8 million videos and 4800 labels, enabling rapid training and benchmarking of video understanding models.
Contribution
The paper introduces YouTube-8M, the largest multi-label video dataset, with high-quality labels and a scalable framework for training and evaluating video classification models.
Findings
Models trained on YouTube-8M achieve competitive performance.
Training models on the dataset can be done in less than a day on a single machine.
The dataset facilitates rapid development and benchmarking of video classification algorithms.
Abstract
Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities. To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics. While the labels are machine-generated, they have high-precision and are derived from a variety of human-based signals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
