Learnable Pooling Methods for Video Classification
Sebastian Kmiec, Juhan Bae, Ruijian An

TL;DR
This paper proposes learnable pooling methods with attention mechanisms for video classification, offering new architectures that achieve competitive accuracy within budget constraints, demonstrated on the YouTube-8M challenge.
Contribution
It introduces novel learnable pooling architectures using attention and function approximation for improved video descriptor aggregation.
Findings
Achieved state-of-the-art accuracy within budget constraints
Demonstrated effectiveness on YouTube-8M dataset
Provided open-source implementations
Abstract
We introduce modifications to state-of-the-art approaches to aggregating local video descriptors by using attention mechanisms and function approximations. Rather than using ensembles of existing architectures, we provide an insight on creating new architectures. We demonstrate our solutions in the "The 2nd YouTube-8M Video Understanding Challenge", by using frame-level video and audio descriptors. We obtain testing accuracy similar to the state of the art, while meeting budget constraints, and touch upon strategies to improve the state of the art. Model implementations are available in https://github.com/pomonam/LearnablePoolingMethods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
