HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation

Ziyao Huang; Zixiang Zhou; Juan Cao; Yifeng Ma; Yi Chen; Zejing Rao; Zhiyong Xu; Hongmei Wang; Qin Lin; Yuan Zhou; Qinglin Lu; Fan Tang

arXiv:2506.08797·cs.CV·June 11, 2025

HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation

Ziyao Huang, Zixiang Zhou, Juan Cao, Yifeng Ma, Yi Chen, Zejing Rao, Zhiyong Xu, Hongmei Wang, Qin Lin, Yuan Zhou, Qinglin Lu, Fan Tang

PDF

Open Access

TL;DR

HunyuanVideo-HOMA is a multimodal framework for human-object interaction video generation that improves controllability, generalization, and accessibility by using weak supervision, a diffusion transformer, and adapters for audio and appearance guidance.

Contribution

It introduces a novel weakly conditioned multimodal-driven approach with adapters and a diffusion transformer for versatile and realistic human-object interaction video synthesis.

Findings

01

Achieves state-of-the-art interaction naturalness and generalization.

02

Demonstrates effective text-conditioned generation and object manipulation.

03

Provides a user-friendly demo interface.

Abstract

To address key limitations in human-object interaction (HOI) video generation -- specifically the reliance on curated motion data, limited generalization to novel objects/scenarios, and restricted accessibility -- we introduce HunyuanVideo-HOMA, a weakly conditioned multimodal-driven framework. HunyuanVideo-HOMA enhances controllability and reduces dependency on precise inputs through sparse, decoupled motion guidance. It encodes appearance and motion signals into the dual input space of a multimodal diffusion transformer (MMDiT), fusing them within a shared context space to synthesize temporally consistent and physically plausible interactions. To optimize training, we integrate a parameter-space HOI adapter initialized from pretrained MMDiT weights, preserving prior knowledge while enabling efficient adaptation, and a facial cross-attention adapter for anatomically accurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion · Adapter