Hand-Object Interaction Reasoning

Jian Ma; Dima Damen

arXiv:2201.04906·cs.CV·January 14, 2022

Hand-Object Interaction Reasoning

Jian Ma, Dima Damen

PDF

Open Access

TL;DR

This paper introduces an interaction reasoning network that leverages Transformer modules and positionally-encoded trajectories to improve understanding of hand-object interactions in egocentric videos, enhancing action recognition accuracy.

Contribution

It presents a novel interaction unit with Transformer-based reasoning for modeling spatio-temporal hand-object relationships in videos.

Findings

01

Improved action recognition on EPIC-KITCHENS and Something-Else datasets.

02

Modeling two-handed interactions is crucial for accurate recognition.

03

Positionally-encoded trajectories enhance interaction understanding.

Abstract

This paper proposes an interaction reasoning network for modelling spatio-temporal relationships between hands and objects in video. The proposed interaction unit utilises a Transformer module to reason about each acting hand, and its spatio-temporal relation to the other hand as well as objects being interacted with. We show that modelling two-handed interactions are critical for action recognition in egocentric video, and demonstrate that by using positionally-encoded trajectories, the network can better recognise observed interactions. We evaluate our proposal on EPIC-KITCHENS and Something-Else datasets, with an ablation study.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Multi-Head Attention · Softmax · Absolute Position Encodings · Byte Pair Encoding · Residual Connection · Layer Normalization