iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer

Zhelun Shen; Chenming Wu; Junsheng Zhou; Chen Zhao; Kaisiyuan Wang; Hang Zhou; Yingying Li; Haocheng Feng; Wei He; Jingdong Wang

arXiv:2506.12847·cs.GR·June 17, 2025

iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer

Zhelun Shen, Chenming Wu, Junsheng Zhou, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Yingying Li, Haocheng Feng, Wei He, Jingdong Wang

PDF

Open Access

TL;DR

This paper introduces iDiT-HOI, a novel inpainting-based video diffusion transformer framework for realistic hand-object interaction reenactment, capable of generalizing to unseen objects and supporting long video generation.

Contribution

The paper proposes a unified inpainting token process and a two-stage diffusion transformer that enhances realism and generalization in HOI reenactment without extra parameters.

Findings

01

Outperforms existing methods in real-world scenes

02

Enables strong generalization to unseen objects and scenarios

03

Supports long-duration video generation

Abstract

Digital human video generation is gaining traction in fields like education and e-commerce, driven by advancements in head-body animation and lip-syncing technologies. However, realistic Hand-Object Interaction (HOI) - the complex dynamics between human hands and objects - continues to pose challenges. Generating natural and believable HOI reenactments is difficult due to issues such as occlusion between hands and objects, variations in object shapes and orientations, and the necessity for precise physical interactions, and importantly, the ability to generalize to unseen humans and objects. This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation. Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model. The first stage generates a key frame by inserting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Human Motion and Animation

MethodsDiffusion