The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge
He-Da Wang, Teng Zhang, Ji Wu

TL;DR
This paper presents the final solution by team monkeytyping for the YouTube-8M video understanding challenge, featuring novel network structures, multi-scale and attention mechanisms, and ensemble strategies to improve multi-label video classification.
Contribution
The paper introduces the Chaining network structure, multi-scale and attention pooling techniques, and a stacking algorithm called attention weighted stacking for enhanced video understanding.
Findings
Ensemble of 74 models achieved top performance.
Chaining network improves label interaction modeling.
Attention weighted stacking boosts single model accuracy.
Abstract
This article describes the final solution of team monkeytyping, who finished in second place in the YouTube-8M video understanding challenge. The dataset used in this challenge is a large-scale benchmark for multi-label video classification. We extend the work in [1] and propose several improvements for frame sequence modeling. We propose a network structure called Chaining that can better capture the interactions between labels. Also, we report our approaches in dealing with multi-scale information and attention pooling. In addition, We find that using the output of model ensemble as a side target in training can boost single model performance. We report our experiments in bagging, boosting, cascade, and stacking, and propose a stacking algorithm called attention weighted stacking. Our final submission is an ensemble that consists of 74 sub models, all of which are listed in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
