Hand-Object Interaction Reasoning
Jian Ma, Dima Damen

TL;DR
This paper introduces an interaction reasoning network that leverages Transformer modules and positionally-encoded trajectories to improve understanding of hand-object interactions in egocentric videos, enhancing action recognition accuracy.
Contribution
It presents a novel interaction unit with Transformer-based reasoning for modeling spatio-temporal hand-object relationships in videos.
Findings
Improved action recognition on EPIC-KITCHENS and Something-Else datasets.
Modeling two-handed interactions is crucial for accurate recognition.
Positionally-encoded trajectories enhance interaction understanding.
Abstract
This paper proposes an interaction reasoning network for modelling spatio-temporal relationships between hands and objects in video. The proposed interaction unit utilises a Transformer module to reason about each acting hand, and its spatio-temporal relation to the other hand as well as objects being interacted with. We show that modelling two-handed interactions are critical for action recognition in egocentric video, and demonstrate that by using positionally-encoded trajectories, the network can better recognise observed interactions. We evaluate our proposal on EPIC-KITCHENS and Something-Else datasets, with an ablation study.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Multi-Head Attention · Softmax · Absolute Position Encodings · Byte Pair Encoding · Residual Connection · Layer Normalization
