Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

Yin Wang; Ziyao Zhang; Zhiying Leng; Haitian Liu; Frederick W. B. Li; Mu Li; Xiaohui Liang

arXiv:2602.10659·cs.CV·February 12, 2026

Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

Yin Wang, Ziyao Zhang, Zhiying Leng, Haitian Liu, Frederick W. B. Li, Mu Li, Xiaohui Liang

PDF

Open Access

TL;DR

This paper introduces MP-HOI, a multimodal priors-augmented framework for text-driven 3D human-object interaction generation, addressing key limitations of previous methods by leveraging multimodal data, enhanced object representations, and a cascaded diffusion process.

Contribution

The paper presents a novel multimodal priors-augmented framework with a modality-aware MoE model and cascaded diffusion for improved 3D HOI motion generation from text.

Findings

01

Outperforms existing methods in fidelity and detail of generated motions.

02

Effectively models human and object interactions with multimodal priors.

03

Achieves more natural and accurate human-object interaction motions.

Abstract

We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis