Action Recognition using Visual Attention
Shikhar Sharma, Ryan Kiros, Ruslan Salakhutdinov

TL;DR
This paper introduces a soft attention mechanism integrated with deep RNNs for action recognition in videos, enabling the model to focus on relevant frame regions and improve classification accuracy.
Contribution
It presents a novel attention-based deep RNN model that learns to selectively focus on important parts of video frames for action recognition.
Findings
Effective attention mechanism improves recognition accuracy.
Model adapts focus based on scene and action context.
Evaluations on multiple datasets demonstrate robustness.
Abstract
We propose a soft attention based model for the task of action recognition in videos. We use multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units which are deep both spatially and temporally. Our model learns to focus selectively on parts of the video frames and classifies videos after taking a few glimpses. The model essentially learns which parts in the frames are relevant for the task at hand and attaches higher importance to them. We evaluate the model on UCF-11 (YouTube Action), HMDB-51 and Hollywood2 datasets and analyze how the model focuses its attention depending on the scene and the action being performed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
