Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation
Yin Wang, Ziyao Zhang, Zhiying Leng, Haitian Liu, Frederick W. B. Li, Mu Li, Xiaohui Liang

TL;DR
This paper introduces MP-HOI, a multimodal priors-augmented framework for text-driven 3D human-object interaction generation, addressing key limitations of previous methods by leveraging multimodal data, enhanced object representations, and a cascaded diffusion process.
Contribution
The paper presents a novel multimodal priors-augmented framework with a modality-aware MoE model and cascaded diffusion for improved 3D HOI motion generation from text.
Findings
Outperforms existing methods in fidelity and detail of generated motions.
Effectively models human and object interactions with multimodal priors.
Achieves more natural and accurate human-object interaction motions.
Abstract
We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis
