Hierarchical Deep Recurrent Architecture for Video Understanding
Luming Tang, Boyang Deng, Haiyu Zhao, Shuai Yi

TL;DR
This paper presents a hierarchical deep recurrent architecture for large-scale multi-label video classification, incorporating novel attention pooling methods and ensemble techniques, achieving high accuracy on the Youtube-8M benchmark.
Contribution
It introduces a hierarchical deep recurrent framework with attention pooling and ensemble methods for improved video understanding and classification performance.
Findings
Achieved 0.84346 GAP score on public test dataset.
Developed novel attention pooling methods for frame importance.
Ensemble of 18 models outperforms individual models.
Abstract
This paper introduces the system we developed for the Youtube-8M Video Understanding Challenge, in which a large-scale benchmark dataset was used for multi-label video classification. The proposed framework contains hierarchical deep architecture, including the frame-level sequence modeling part and the video-level classification part. In the frame-level sequence modelling part, we explore a set of methods including Pooling-LSTM (PLSTM), Hierarchical-LSTM (HLSTM), Random-LSTM (RLSTM) in order to address the problem of large amount of frames in a video. We also introduce two attention pooling methods, single attention pooling (ATT) and multiply attention pooling (Multi-ATT) so that we can pay more attention to the informative frames in a video and ignore the useless frames. In the video-level classification part, two methods are proposed to increase the classification performance, i.e.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Music and Audio Processing
