GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects

Shujia Li; Haiyu Zhang; Xinyuan Chen; Yaohui Wang; Yutong Ban

arXiv:2506.15483·cs.CV·June 19, 2025

GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects

Shujia Li, Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Yutong Ban

PDF

Open Access

TL;DR

GenHOI introduces a two-stage framework combining object reconstruction and contact-aware diffusion to synthesize high-fidelity 4D human-object interactions, generalizing to unseen objects with state-of-the-art results.

Contribution

The paper presents a novel two-stage approach with Object-AnchorNet and ContactDM, enabling generalization to unseen objects and high-quality 4D HOI synthesis from limited datasets.

Findings

01

Achieves state-of-the-art results on OMOMO and 3D-FUTURE datasets.

02

Demonstrates strong generalization to unseen objects.

03

Produces high-fidelity, temporally coherent 4D HOI sequences.

Abstract

While diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, extending these advances to 4D human-object interaction (HOI) remains challenging, mainly due to the limited availability of large-scale 4D HOI datasets. In our study, we introduce GenHOI, a novel two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences. In the initial stage of our framework, we employ an Object-AnchorNet to reconstruct sparse 3D HOI keyframes for unseen objects, learning solely from 3D HOI datasets, thereby mitigating the dependence on large-scale 4D HOI datasets. Subsequently, we introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate sparse 3D HOI keyframes into densely temporally coherent 4D HOI sequences. To enhance the quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Human Motion and Animation