Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN
Novanto Yudistira, Muthu Subash Kavitha, Takio Kurita

TL;DR
This paper introduces a weakly-supervised method for action localization and recognition in videos using global-local attention mechanisms in 3D CNNs, improving interpretability and accuracy.
Contribution
It proposes a novel global-local gradient aggregation and attention gating approach for enhanced visual explanations and action recognition in 3D CNNs.
Findings
Improved visual attribution and localization accuracy.
Enhanced action recognition performance over baseline.
Effective use of layer-wise attention for video analysis.
Abstract
3D Convolutional Neural Network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; i) aggregate layer-wise global to local (global-local) discrete gradients using trained 3DResNext network, and ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global-local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradients and activations of every layer are then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods3 Dimensional Convolutional Neural Network · Global-Local Attention · Convolution
