Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction
Mengfei Zhang, Jinlu Zhang, Zhigang Tu

TL;DR
Uni-HOI is a comprehensive framework that models the joint distribution of text, human motion, and object motion, enabling versatile HOI tasks with a unified approach using LLMs and VQ-VAEs.
Contribution
It introduces a unified model leveraging large language models and vector quantized autoencoders to handle diverse HOI tasks with a single framework.
Findings
Achieves state-of-the-art results on multiple HOI tasks.
Effectively integrates heterogeneous motion data into LLM-compatible tokens.
Demonstrates versatility across text-driven and motion-driven HOI generation.
Abstract
Modeling 4D human-object interaction (HOI) is a compelling challenge in computer vision and an essential technology powering virtual and mixed-reality applications. While existing works have achieved promising results on specific HOI tasks-such as text-conditioned HOI generation and human motion generation from object motion, they typically rely on task-specific architectures and lack a unified framework capable of handling diverse conditional inputs. Building on this, we propose Uni-HOI, a unified framework that learns the joint distribution among text, human motion, and object motion. By leveraging large language models (LLMs) and two motion-specific vector quantized variational autoencoders (VQ-VAEs), we convert heterogeneous motion data into token sequences compatible with LLM inputs, enabling seamless integration and joint modeling of all three modalities. We introduce a two-stage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
