Regional Attention Network (RAN) for Head Pose and Fine-grained Gesture Recognition
Ardhendu Behera, Zachary Wharton, Morteza Ghahremani, Swagat Kumar,, Nik Bessis

TL;DR
This paper introduces a Regional Attention Network (RAN), a CNN-based model that leverages attention mechanisms over semantic regions to improve fine-grained gesture and head pose recognition across multiple datasets.
Contribution
The paper proposes a novel end-to-end CNN architecture, RAN, that uses attention over adaptive semantic regions for robust recognition of gestures, head poses, and expressions.
Findings
Outperforms state-of-the-art methods on ten diverse datasets.
Effective in recognizing head pose, driver states, and facial expressions.
Utilizes attention to focus on relevant image regions, reducing reliance on precise body part detection.
Abstract
Affect is often expressed via non-verbal body language such as actions/gestures, which are vital indicators for human behaviors. Recent studies on recognition of fine-grained actions/gestures in monocular images have mainly focused on modeling spatial configuration of body parts representing body pose, human-objects interactions and variations in local appearance. The results show that this is a brittle approach since it relies on accurate body parts/objects detection. In this work, we argue that there exist local discriminative semantic regions, whose "informativeness" can be evaluated by the attention mechanism for inferring fine-grained gestures/actions. To this end, we propose a novel end-to-end \textbf{Regional Attention Network (RAN)}, which is a fully Convolutional Neural Network (CNN) to combine multiple contextual regions through attention mechanism, focusing on parts of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
