Multi-attention Networks for Temporal Localization of Video-level Labels
Lijun Zhang, Srinath Nizampatnam, Ahana Gangopadhyay, and Marcos V., Conde

TL;DR
This paper introduces an ensemble of attention-based neural networks for temporal localization of video labels, effectively handling noisy data and improving accuracy in large-scale video understanding tasks.
Contribution
The work proposes a novel multi-attention network architecture with multiple attention mechanisms and ensemble strategies for segment-level video classification.
Findings
Achieved competitive ranking in the YouTube-8M challenge
Effectively handled noisy labels with attention mechanisms
Improved performance through model ensemble and fine-tuning
Abstract
Temporal localization remains an important challenge in video understanding. In this work, we present our solution to the 3rd YouTube-8M Video Understanding Challenge organized by Google Research. Participants were required to build a segment-level classifier using a large-scale training data set with noisy video-level labels and a relatively small-scale validation data set with accurate segment-level labels. We formulated the problem as a multiple instance multi-label learning and developed an attention-based mechanism to selectively emphasize the important frames by attention weights. The model performance is further improved by constructing multiple sets of attention networks. We further fine-tuned the model using the segment-level data set. Our final model consists of an ensemble of attention/multi-attention networks, deep bag of frames models, recurrent neural networks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods
MethodsAttention Is All You Need · Softmax · Linear Layer · Long Short-Term Memory · Gated Recurrent Unit · Multi-Attention Network · Multi-Head Attention
