Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

Mengfei Zhang; Jinlu Zhang; Zhigang Tu

arXiv:2604.27491·cs.CV·May 1, 2026

Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

Mengfei Zhang, Jinlu Zhang, Zhigang Tu

PDF

TL;DR

Uni-HOI is a comprehensive framework that models the joint distribution of text, human motion, and object motion, enabling versatile HOI tasks with a unified approach using LLMs and VQ-VAEs.

Contribution

It introduces a unified model leveraging large language models and vector quantized autoencoders to handle diverse HOI tasks with a single framework.

Findings

01

Achieves state-of-the-art results on multiple HOI tasks.

02

Effectively integrates heterogeneous motion data into LLM-compatible tokens.

03

Demonstrates versatility across text-driven and motion-driven HOI generation.

Abstract

Modeling 4D human-object interaction (HOI) is a compelling challenge in computer vision and an essential technology powering virtual and mixed-reality applications. While existing works have achieved promising results on specific HOI tasks-such as text-conditioned HOI generation and human motion generation from object motion, they typically rely on task-specific architectures and lack a unified framework capable of handling diverse conditional inputs. Building on this, we propose Uni-HOI, a unified framework that learns the joint distribution among text, human motion, and object motion. By leveraging large language models (LLMs) and two motion-specific vector quantized variational autoencoders (VQ-VAEs), we convert heterogeneous motion data into token sequences compatible with LLM inputs, enabling seamless integration and joint modeling of all three modalities. We introduce a two-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.