UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes

Zichen Geng; Zeeshan Hayder; Wei Liu; and Ajmal Mian

arXiv:2505.12774·cs.GR·May 20, 2025

UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes

Zichen Geng, Zeeshan Hayder, Wei Liu, and Ajmal Mian

PDF

Open Access

TL;DR

UniHM is a novel diffusion-based model that enables realistic, scene-aware human motion synthesis from text prompts, supporting complex object interactions in indoor environments.

Contribution

It introduces a unified framework supporting both Text-to-Motion and Text-to-Human-Object Interaction generation, with innovative motion representation and a new dataset enhancement.

Findings

01

Achieves competitive results on OMOMO benchmark for HOI synthesis.

02

Yields strong performance on HumanML3D for general motion generation.

03

Outperforms traditional VQ-VAEs in motion reconstruction and generation.

Abstract

Human motion synthesis in complex scenes presents a fundamental challenge, extending beyond conventional Text-to-Motion tasks by requiring the integration of diverse modalities such as static environments, movable objects, natural language prompts, and spatial waypoints. Existing language-conditioned motion models often struggle with scene-aware motion generation due to limitations in motion tokenization, which leads to information loss and fails to capture the continuous, context-dependent nature of 3D human movement. To address these issues, we propose UniHM, a unified motion language model that leverages diffusion-based generation for synthesizing scene-aware human motion. UniHM is the first framework to support both Text-to-Motion and Text-to-Human-Object Interaction (HOI) in complex 3D scenes. Our approach introduces three key contributions: (1) a mixed-motion representation that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications