The Monkeytyping Solution to the YouTube-8M Video Understanding   Challenge

He-Da Wang; Teng Zhang; Ji Wu

arXiv:1706.05150·cs.CV·June 19, 2017·21 cites

The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge

He-Da Wang, Teng Zhang, Ji Wu

PDF

Open Access 1 Repo

TL;DR

This paper presents the final solution by team monkeytyping for the YouTube-8M video understanding challenge, featuring novel network structures, multi-scale and attention mechanisms, and ensemble strategies to improve multi-label video classification.

Contribution

The paper introduces the Chaining network structure, multi-scale and attention pooling techniques, and a stacking algorithm called attention weighted stacking for enhanced video understanding.

Findings

01

Ensemble of 74 models achieved top performance.

02

Chaining network improves label interaction modeling.

03

Attention weighted stacking boosts single model accuracy.

Abstract

This article describes the final solution of team monkeytyping, who finished in second place in the YouTube-8M video understanding challenge. The dataset used in this challenge is a large-scale benchmark for multi-label video classification. We extend the work in [1] and propose several improvements for frame sequence modeling. We propose a network structure called Chaining that can better capture the interactions between labels. Also, we report our approaches in dealing with multi-scale information and attention pooling. In addition, We find that using the output of model ensemble as a side target in training can boost single model performance. We report our experiments in bagging, boosting, cascade, and stacking, and propose a stacking algorithm called attention weighted stacking. Our final submission is an ensemble that consists of 74 sub models, all of which are listed in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangheda/youtube-8m
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization