Multi-Modal Temporal Convolutional Network for Anticipating Actions in   Egocentric Videos

Olga Zatsarynna; Yazan Abu Farha; Juergen Gall

arXiv:2107.09504·cs.CV·July 21, 2021

Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Olga Zatsarynna, Yazan Abu Farha, Juergen Gall

PDF

Open Access

TL;DR

This paper introduces a multi-modal temporal convolutional network for egocentric video action anticipation, emphasizing both high accuracy and fast inference to meet real-time application needs.

Contribution

It presents a novel multi-modal architecture based on temporal convolutions that avoids recurrent layers, enabling faster predictions while maintaining competitive accuracy.

Findings

01

Achieves comparable accuracy to state-of-the-art methods.

02

Significantly faster inference speed.

03

Effective multi-modal fusion capturing pairwise interactions.

Abstract

Anticipating human actions is an important task that needs to be addressed for the development of reliable intelligent agents, such as self-driving cars or robot assistants. While the ability to make future predictions with high accuracy is crucial for designing the anticipation approaches, the speed at which the inference is performed is not less important. Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process. Thus, this will increase the reaction time of the underlying system. This poses a problem for domains such as autonomous driving, where the reaction time is crucial. In this work, we propose a simple and effective multi-modal architecture based on temporal convolutions. Our approach stacks a hierarchy of temporal convolutional layers and does not rely on recurrent layers to ensure a fast prediction. We further introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Vision and Imaging