Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries
Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

TL;DR
EgoACO is a novel deep neural network architecture that leverages attention-based pooling and structured label decoding to improve action recognition in egocentric videos, achieving state-of-the-art results.
Contribution
The paper introduces EgoACO, which uses class activation pooling with self-attention and a recurrent module for temporal modeling, explicitly decoding action, object, and context descriptors.
Findings
Achieves state-of-the-art performance on EPIC-KITCHENS and EGTEA datasets.
Effectively decodes action, object, and context descriptors from video features.
Provides built-in visual explanations for interpretability.
Abstract
We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling in EgoACO, we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
