Attention in Attention: Modeling Context Correlation for Efficient Video Classification
Yanbin Hao, Shuo Wang, Pei Cao, Xinjian Gao, Tong Xu, Jinmeng Wu and, Xiangnan He

TL;DR
This paper introduces an efficient attention-in-attention (AIA) mechanism that models the correlation between different context types in video classification, improving feature refinement with minimal computational overhead.
Contribution
It proposes a novel AIA module that incorporates context correlation into attention learning, enhancing video feature refinement without significant computational cost.
Findings
AIA improves classification accuracy on standard benchmarks.
The method introduces less than 0.02% additional computational cost.
Extensive experiments validate the effectiveness of AIA in various backbones.
Abstract
Attention mechanisms have significantly boosted the performance of video classification neural networks thanks to the utilization of perspective contexts. However, the current research on video attention generally focuses on adopting a specific aspect of contexts (e.g., channel, spatial/temporal, or global context) to refine the features and neglects their underlying correlation when computing attentions. This leads to incomplete context utilization and hence bears the weakness of limited performance improvement. To tackle the problem, this paper proposes an efficient attention-in-attention (AIA) method for element-wise feature refinement, which investigates the feasibility of inserting the channel context into the spatio-temporal attention learning module, referred to as CinST, and also its reverse variant, referred to as STinC. Specifically, we instantiate the video feature contexts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
MethodsMax Pooling
