Interaction Region Visual Transformer for Egocentric Action Anticipation

Debaditya Roy; Ramanathan Rajendiran; Basura Fernando

arXiv:2211.14154·cs.CV·January 12, 2024

Interaction Region Visual Transformer for Egocentric Action Anticipation

Debaditya Roy, Ramanathan Rajendiran, Basura Fernando

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces InAViT, a novel transformer-based model that captures human-object interactions through appearance changes for egocentric action anticipation, achieving state-of-the-art results.

Contribution

The paper presents a new transformer variant with Spatial and Trajectory Cross-Attention for modeling interactions, improving egocentric action anticipation performance.

Findings

01

Achieves state-of-the-art results on EK100 and EGTEA Gaze+ datasets.

02

Outperforms other transformer-based methods in action anticipation.

03

Top-ranked on EK100 leaderboard, surpassing second-best by 3.3%.]

Abstract

Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lahaproject/inavit
pytorch

Videos

Interaction Region Visual Transformer for Egocentric Action Anticipation· youtube

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging

MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Adam · Softmax · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings