Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos
Junyi Ma, Jingyi Xu, Xieyuanli Chen, Hesheng Wang

TL;DR
This paper introduces Diff-IP2D, a diffusion-based model for predicting future hand trajectories and object interactions in egocentric videos, addressing autoregressive limitations and camera motion effects for improved accuracy.
Contribution
The novel diffusion-based approach enables non-autoregressive, joint prediction of hand-object interactions, incorporating camera motion awareness for enhanced performance.
Findings
Outperforms state-of-the-art baselines on multiple metrics
Effectively incorporates camera egomotion into predictions
Demonstrates significant accuracy improvements in experiments
Abstract
Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously forecast hand trajectories and object affordances on human egocentric videos. The joint prediction serves as a comprehensive representation of future hand-object interactions in 2D space, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Video Analysis and Summarization
Methodstravel james · Diffusion · Attentive Walk-Aggregating Graph Neural Network
