Learning to Recognize Actions on Objects in Egocentric Video with   Attention Dictionaries

Swathikiran Sudhakaran; Sergio Escalera; Oswald Lanz

arXiv:2102.08065·cs.CV·February 17, 2021

Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

PDF

TL;DR

EgoACO is a novel deep neural network architecture that leverages attention-based pooling and structured label decoding to improve action recognition in egocentric videos, achieving state-of-the-art results.

Contribution

The paper introduces EgoACO, which uses class activation pooling with self-attention and a recurrent module for temporal modeling, explicitly decoding action, object, and context descriptors.

Findings

01

Achieves state-of-the-art performance on EPIC-KITCHENS and EGTEA datasets.

02

Effectively decodes action, object, and context descriptors from video features.

03

Provides built-in visual explanations for interpretability.

Abstract

We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling in EgoACO, we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory