Human Action Recognition: Pose-based Attention draws focus to Hands
Fabien Baradel, Christian Wolf, Julien Mille

TL;DR
This paper introduces a pose-based attention mechanism for human action recognition that automatically focuses on hands and key moments, improving accuracy and interpretability on large datasets.
Contribution
It presents a novel recurrent, differentiable attention model conditioned on human pose, enhancing focus on discriminative action parts without relying on RNN hidden states.
Findings
Achieved state-of-the-art results on NTU-RGB+D dataset.
Demonstrated improved focus on hands and key action moments.
Provided insights into model interpretability through attention visualization.
Abstract
We propose a new spatio-temporal attention based mechanism for human action recognition able to automatically attend to the hands most involved into the studied action and detect the most discriminative moments in an action. Attention is handled in a recurrent manner employing Recurrent Neural Network (RNN) and is fully-differentiable. In contrast to standard soft-attention based mechanisms, our approach does not use the hidden RNN state as input to the attention model. Instead, attention distributions are extracted using external information: human articulated pose. We performed an extensive ablation study to show the strengths of this approach and we particularly studied the conditioning aspect of the attention mechanism. We evaluate the method on the largest currently available human action recognition dataset, NTU-RGB+D, and report state-of-the-art results. Other advantages of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
