MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos
Junyi Ma, Xieyuanli Chen, Wentao Bao, Jingyi Xu, Hesheng Wang

TL;DR
MADiff is a novel diffusion-based model that predicts hand trajectories in egocentric videos by integrating egomotion and high-level semantics, improving accuracy and real-time performance for applications in robotics and extended reality.
Contribution
The paper introduces MADiff, a motion-aware diffusion model that incorporates egomotion and semantic features for improved hand trajectory prediction without explicit affordance labels.
Findings
Achieves comparable accuracy to state-of-the-art methods.
Operates in real-time for practical applications.
Demonstrates effectiveness across five public datasets.
Abstract
Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Stroke Rehabilitation and Recovery
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · Diffusion
