Interaction Region Visual Transformer for Egocentric Action Anticipation
Debaditya Roy, Ramanathan Rajendiran, Basura Fernando

TL;DR
This paper introduces InAViT, a novel transformer-based model that captures human-object interactions through appearance changes for egocentric action anticipation, achieving state-of-the-art results.
Contribution
The paper presents a new transformer variant with Spatial and Trajectory Cross-Attention for modeling interactions, improving egocentric action anticipation performance.
Findings
Achieves state-of-the-art results on EK100 and EGTEA Gaze+ datasets.
Outperforms other transformer-based methods in action anticipation.
Top-ranked on EK100 leaderboard, surpassing second-best by 3.3%.]
Abstract
Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Interaction Region Visual Transformer for Egocentric Action Anticipation· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Adam · Softmax · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings
