Temporal Modeling Approaches for Large-scale Youtube-8M Video   Understanding

Fu Li; Chuang Gan; Xiao Liu; Yunlong Bian; Xiang Long; Yandong Li,; Zhichao Li; Jie Zhou; Shilei Wen

arXiv:1707.04555·cs.CV·July 17, 2017·49 cites

Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding

Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li,, Zhichao Li, Jie Zhou, Shilei Wen

PDF

Open Access 1 Repo

TL;DR

This paper explores various temporal modeling techniques for large-scale video recognition on YouTube-8M, achieving high accuracy and ranking third in a major challenge by effectively aggregating frame-level features.

Contribution

It introduces a combination of two-stream sequence models, fast-forward sequence models, and temporal residual neural networks for improved video understanding.

Findings

01

Fast-forward LSTM with 7 layers achieves 82.75% GAP@20.

02

Proposed approaches significantly outperform existing temporal models.

03

System ranks 3rd in the YouTube-8M Video Understanding Challenge.

Abstract

This paper describes our solution for the video recognition task of the Google Cloud and YouTube-8M Video Understanding Challenge that ranked the 3rd place. Because the challenge provides pre-extracted visual and audio features instead of the raw videos, we mainly investigate various temporal modeling approaches to aggregate the frame-level features for multi-label video recognition. Our system contains three major components: two-stream sequence model, fast-forward sequence model and temporal residual neural networks. Experiment results on the challenging Youtube-8M dataset demonstrate that our proposed temporal modeling approaches can significantly improve existing temporal modeling approaches in the large-scale video recognition tasks. To be noted, our fast-forward LSTM with a depth of 7 layers achieves 82.75% in term of GAP@20 on the Kaggle Public test set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

baidu/Youtube-8M
paddleOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory