AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation
Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero, Giovanni, Maria Farinella, Antonino Furnari

TL;DR
This paper introduces STAformer, an attention-based model with affordance grounding modules, significantly improving short-term object interaction anticipation in egocentric videos for better human-robot interaction.
Contribution
The paper presents a novel attention-based architecture and two modules for modeling affordances, enhancing the accuracy of short-term interaction predictions from image-video pairs.
Findings
Up to +45% improvement in Top-5 mAP on Ego4D
Up to +42% improvement on curated EPIC-Kitchens dataset
Effective grounding of predictions using affordance modeling
Abstract
Short-Term object-interaction Anticipation consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants or human robot interaction to understand the user goals, but there is still room for improvement to perform STA in a precise and reliable way. In this work, we improve the performance of STA predictions with two contributions: 1. We propose STAformer, a novel attention-based architecture integrating frame guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair. 2. We introduce two novel modules to ground STA predictions on human behavior by modeling affordances.First, we integrate an environment affordance model which acts as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Reinforcement Learning in Robotics · Human Pose and Action Recognition
MethodsSparse Evolutionary Training
