Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding
Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li,, Zhichao Li, Jie Zhou, Shilei Wen

TL;DR
This paper explores various temporal modeling techniques for large-scale video recognition on YouTube-8M, achieving high accuracy and ranking third in a major challenge by effectively aggregating frame-level features.
Contribution
It introduces a combination of two-stream sequence models, fast-forward sequence models, and temporal residual neural networks for improved video understanding.
Findings
Fast-forward LSTM with 7 layers achieves 82.75% GAP@20.
Proposed approaches significantly outperform existing temporal models.
System ranks 3rd in the YouTube-8M Video Understanding Challenge.
Abstract
This paper describes our solution for the video recognition task of the Google Cloud and YouTube-8M Video Understanding Challenge that ranked the 3rd place. Because the challenge provides pre-extracted visual and audio features instead of the raw videos, we mainly investigate various temporal modeling approaches to aggregate the frame-level features for multi-label video recognition. Our system contains three major components: two-stream sequence model, fast-forward sequence model and temporal residual neural networks. Experiment results on the challenging Youtube-8M dataset demonstrate that our proposed temporal modeling approaches can significantly improve existing temporal modeling approaches in the large-scale video recognition tasks. To be noted, our fast-forward LSTM with a depth of 7 layers achieves 82.75% in term of GAP@20 on the Kaggle Public test set.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
