Encoding Video and Label Priors for Multi-label Video Classification on YouTube-8M dataset
Seil Na, Youngjae Yu, Sangho Lee, Jisung Kim, Gunhee Kim

TL;DR
This paper presents a deep neural network approach for multi-label video classification on the YouTube-8M dataset, addressing challenges like temporal modeling, label imbalance, and label correlations, achieving high performance.
Contribution
It introduces a novel neural network architecture with specific components and methods tailored for multi-label video classification on large-scale datasets.
Findings
Proposed models outperform baseline models significantly.
Ensemble approach achieved 8th place in Kaggle competition.
Effective handling of label correlations and imbalances.
Abstract
YouTube-8M is the largest video dataset for multi-label video classification. In order to tackle the multi-label classification on this challenging dataset, it is necessary to solve several issues such as temporal modeling of videos, label imbalances, and correlations between labels. We develop a deep neural network model, which consists of four components: the frame encoder, the classification layer, the label processing layer, and the loss function. We introduce our newly proposed methods and discusses how existing models operate in the YouTube-8M Classification Task, what insights they have, and why they succeed (or fail) to achieve good performance. Most of the models we proposed are very high compared to the baseline models, and the ensemble of the models we used is 8th in the Kaggle Competition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Multimodal Machine Learning Applications
