Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition
Swathikiran Sudhakaran, Oswald Lanz

TL;DR
This paper introduces a novel end-to-end deep learning model for egocentric activity recognition that uses object-centric spatial attention mechanisms, achieving state-of-the-art accuracy without strong supervision.
Contribution
The paper presents a weakly supervised, object-centric attention model that outperforms existing methods relying on strong supervision like hand segmentation.
Findings
Achieves up to +6% recognition accuracy over previous best methods.
Effectively identifies relevant objects in video frames through learned attention maps.
Demonstrates strong performance in weakly supervised egocentric activity recognition.
Abstract
In this paper we propose an end-to-end trainable deep neural network model for egocentric activity recognition. Our model is built on the observation that egocentric activities are highly characterized by the objects and their locations in the video. Based on this, we develop a spatial attention mechanism that enables the network to attend to regions containing objects that are correlated with the activity under consideration. We learn highly specialized attention maps for each frame using class-specific activations from a CNN pre-trained for generic image recognition, and use them for spatio-temporal encoding of the video with a convolutional LSTM. Our model is trained in a weakly supervised setting using raw video-level activity-class labels. Nonetheless, on standard egocentric activity benchmarks our model surpasses by up to +6% points recognition accuracy the currently best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Advanced Neural Network Applications
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
