Enhancing Transformer for Video Understanding Using Gated Multi-Level   Attention and Temporal Adversarial Training

Saurabh Sahu; Palash Goyal

arXiv:2103.10043·cs.CV·March 19, 2021·1 cites

Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training

Saurabh Sahu, Palash Goyal

PDF

Open Access

TL;DR

This paper introduces Gated Adversarial Transformer (GAT), a novel attention-based model for video understanding that employs multi-level attention gating and adversarial training to improve robustness and accuracy on large-scale datasets.

Contribution

The paper proposes GAT, combining multi-level attention gating with adversarial training, to enhance video understanding capabilities of Transformer models.

Findings

01

GAT outperforms existing models on YouTube-8M dataset.

02

Multi-level attention gating improves relevance modeling.

03

Adversarial training enhances model robustness.

Abstract

The introduction of Transformer model has led to tremendous advancements in sequence modeling, especially in text domain. However, the use of attention-based models for video understanding is still relatively unexplored. In this paper, we introduce Gated Adversarial Transformer (GAT) to enhance the applicability of attention-based models to videos. GAT uses a multi-level attention gate to model the relevance of a frame based on local and global contexts. This enables the model to understand the video at various granularities. Further, GAT uses adversarial training to improve model generalization. We propose temporal attention regularization scheme to improve the robustness of attention modules to adversarial examples. We illustrate the performance of GAT on the large-scale YoutTube-8M data set on the task of video categorization. We further show ablation studies along with quantitative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning

Methodsfast speak--How do I Speak to someone at Expedia? · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · Residual Connection · Label Smoothing