Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training
Saurabh Sahu, Palash Goyal

TL;DR
This paper introduces Gated Adversarial Transformer (GAT), a novel attention-based model for video understanding that employs multi-level attention gating and adversarial training to improve robustness and accuracy on large-scale datasets.
Contribution
The paper proposes GAT, combining multi-level attention gating with adversarial training, to enhance video understanding capabilities of Transformer models.
Findings
GAT outperforms existing models on YouTube-8M dataset.
Multi-level attention gating improves relevance modeling.
Adversarial training enhances model robustness.
Abstract
The introduction of Transformer model has led to tremendous advancements in sequence modeling, especially in text domain. However, the use of attention-based models for video understanding is still relatively unexplored. In this paper, we introduce Gated Adversarial Transformer (GAT) to enhance the applicability of attention-based models to videos. GAT uses a multi-level attention gate to model the relevance of a frame based on local and global contexts. This enables the model to understand the video at various granularities. Further, GAT uses adversarial training to improve model generalization. We propose temporal attention regularization scheme to improve the robustness of attention modules to adversarial examples. We illustrate the performance of GAT on the large-scale YoutTube-8M data set on the task of video categorization. We further show ablation studies along with quantitative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning
Methodsfast speak--How do I Speak to someone at Expedia? · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · Residual Connection · Label Smoothing
