Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

Junyi Ma; Jingyi Xu; Xieyuanli Chen; Hesheng Wang

arXiv:2405.04370·cs.CV·November 17, 2025·2 cites

Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos

Junyi Ma, Jingyi Xu, Xieyuanli Chen, Hesheng Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Diff-IP2D, a diffusion-based model for predicting future hand trajectories and object interactions in egocentric videos, addressing autoregressive limitations and camera motion effects for improved accuracy.

Contribution

The novel diffusion-based approach enables non-autoregressive, joint prediction of hand-object interactions, incorporating camera motion awareness for enhanced performance.

Findings

01

Outperforms state-of-the-art baselines on multiple metrics

02

Effectively incorporates camera egomotion into predictions

03

Demonstrates significant accuracy improvements in experiments

Abstract

Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously forecast hand trajectories and object affordances on human egocentric videos. The joint prediction serves as a comprehensive representation of future hand-object interactions in 2D space, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

irmvlab/diff-ip2d
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Video Analysis and Summarization

Methodstravel james · Diffusion · Attentive Walk-Aggregating Graph Neural Network