Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge
Tianqi Liu, Bo Liu

TL;DR
This paper describes a constrained-size, ensemble-based TensorFlow model for YouTube-8M video classification, achieving high accuracy with significant compression, and builds upon the Gated NetVLAD architecture.
Contribution
It introduces a compressed, ensemble approach using float16 precision for efficient video classification in a competitive setting.
Findings
Achieved 88.324% GAP on private leaderboard
Realized 48.5% model size reduction with no accuracy loss
Utilized ensemble of four models based on Gated NetVLAD architecture
Abstract
This paper presents our 7th place solution to the second YouTube-8M video understanding competition which challenges participates to build a constrained-size model to classify millions of YouTube videos into thousands of classes. Our final model consists of four single models aggregated into one tensorflow graph. For each single model, we use the same network architecture as in the winning solution of the first YouTube-8M video understanding competition, namely Gated NetVLAD. We train the single models separately in tensorflow's default float32 precision, then replace weights with float16 precision and ensemble them in the evaluation and inference stages., achieving 48.5% compression rate without loss of precision. Our best model achieved 88.324% GAP on private leaderboard. The code is publicly available at https://github.com/boliu61/youtube-8m
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
