HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation

Mengge Liu; Yan Di; Gu Wang; Yun Qu; Dekai Zhu; Yanyan Li; Xiangyang Ji

arXiv:2601.20383·cs.CV·January 29, 2026

HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation

Mengge Liu, Yan Di, Gu Wang, Yun Qu, Dekai Zhu, Yanyan Li, Xiangyang Ji

PDF

Open Access 3 Reviews

TL;DR

HINT is an autoregressive framework that models complex multi-human interactions for text-guided motion generation, effectively handling variable agent counts and long sequences with hierarchical interaction modeling.

Contribution

HINT introduces a hierarchical interaction modeling approach within a diffusion-based autoregressive framework, enabling flexible, long-horizon multi-human motion generation with variable agents.

Findings

01

HINT matches strong offline models in performance.

02

HINT surpasses autoregressive baselines in benchmarks.

03

Achieves significant FID improvement on InterHuman dataset.

Abstract

Text-driven multi-human motion generation with complex interactions remains a challenging problem. Despite progress in performance, existing offline methods that generate fixed-length motions with a fixed number of agents, are inherently limited in handling long or variable text, and varying agent counts. These limitations naturally encourage autoregressive formulations, which predict future motions step by step conditioned on all past trajectories and current text guidance. In this work, we introduce HINT, the first autoregressive framework for multi-human motion generation with Hierarchical INTeraction modeling in diffusion. First, HINT leverages a disentangled motion representation within a canonicalized latent space, decoupling local motion semantics from inter-person interactions. This design facilitates direct adaptation to varying numbers of human participants without requiring…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- Clear decomposition (canonicalization + hierarchical conditioning) that makes variable-length, multi-agent generation straightforward to implement. - Solid quantitative improvements on realism (FID) with extensive ablations isolating the contribution of each condition. - Simple path to >2 agents without retraining (shared weights; partner-history concatenation).

Weaknesses

- The paper’s problem setting seems already addressed by prior MDM extensions (e.g., priorMDM); - Novelty feels incremental—canonicalization + sliding window + standard diffusion conditioning.

Reviewer 02Rating 6Confidence 3

Strengths

1. This paper insightfully identifies a potential obstacle to generalizing two-person interaction motion generation to larger groups—namely, the entanglement of global information—and proposes and validates a decoupling strategy, thereby demonstrating strong originality and clearly establishing the significance of the problem. 2. The exposition is clear; the authors’ ideas are easy to grasp. 3. The supplementary videos present qualitative multi-person generation results and experiments that co

Weaknesses

1. Extracting interaction information into the global diffusion pipeline may lengthen the conditioning vector. Extending a two-person scenario to three is still manageable, but scaling to larger groups becomes problematic. Taking agent A’s viewpoint as an example, the paper encodes partner history by using B’s rotation $R$ and translation $T$ relative to A. Section 3.5 further suggests that expanding from two to N people simply means concatenating the partner-history embeddings of all additional

Reviewer 03Rating 4Confidence 4

Strengths

* Clear motivation for addressing multi-human motion, a topic that demands richer spatial/temporal modeling than single-human generation. * A principled decoupling/conditioning design intended to preserve local motion quality while handling inter-person relations. * Practical use of a sliding window to extend diffusion beyond fixed-length clips, enabling continuous rollouts. * Competitive results on standard benchmarks.

Weaknesses

* **Unspecified text-to-length mapping.** Global conditioning uses the “total frame number” ($T_N$) *according to the textual description*, but the paper does not explain how ($T_N$) is inferred or parsed from text at test time. This is central to the **compositional command** narrative and should be made explicit. * **Canonicalization and drift across windows.** The approach removes absolute placement and re-injects pairwise transforms. The paper should discuss **how relative transforms are ob

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications