Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

Haodong Yan; Hang Yu; Zhide Zhong; Weilin Yuan; Xin Gong; Zehang Luo; Chengxi Heyu; Junfeng Li; Wenxuan Song; Shunbo Zhou; Haoang Li

arXiv:2512.01677·cs.CV·December 2, 2025

Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

Haodong Yan, Hang Yu, Zhide Zhong, Weilin Yuan, Xin Gong, Zehang Luo, Chengxi Heyu, Junfeng Li, Wenxuan Song, Shunbo Zhou, Haoang Li

PDF

Open Access

TL;DR

This paper introduces a novel scalable representation for realistic hand-object interaction video generation that captures contact and occlusion without requiring 3D annotations, enabling better generalization and interaction fidelity.

Contribution

It proposes a structure and contact-aware representation combined with a joint-generation paradigm to improve realism and scalability in HOI video synthesis.

Findings

01

Outperforms state-of-the-art on real-world datasets

02

Generates physics-realistic and temporally coherent videos

03

Shows strong generalization to open-world scenarios

Abstract

Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis