Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

Zichen Geng; Zeeshan Hayder; Bo Miao; Jian Liu; Wei Liu; Ajmal Mian

arXiv:2603.00144·cs.CV·March 3, 2026

Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

Zichen Geng, Zeeshan Hayder, Bo Miao, Jian Liu, Wei Liu, Ajmal Mian

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DHVAE, a hierarchical VAE with disentangled latent spaces and diffusion-based synthesis, to generate realistic, physically plausible 3D human interactions with improved fidelity and control.

Contribution

It proposes a novel disentangled hierarchical VAE with a diffusion process and contrastive learning to enhance 3D human interaction generation.

Findings

01

Achieves superior motion fidelity and physical plausibility.

02

Improves computational efficiency in interaction synthesis.

03

Enhances control over interaction semantics.

Abstract

Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper is well written and easy to follow. The motivation underlying the proposed methodology is clearly articulated and logically developed. 2. The approach is conceptually simple yet effective. Both the disentanglement of representations and the incorporation of contrastive learning to promote physical plausibility are well justified. 3. The experimental evaluation is thorough, and the ablation studies effectively demonstrate the contribution of each component.

Weaknesses

1. The design of the contrastive learning component requires further clarification. It is intuitive to generate negative samples for contact cases; however, for non-contact cases, Algorithm 1 suggests that negative samples are synthesized by artificially increasing the distance between agents. In this scenario, the agents become even more separated. Why should such samples be considered valid negatives for non-contact interactions? 2. It would be informative to report inference latency for inter

Reviewer 02Rating 4Confidence 4

Strengths

1. The figures and tables are well-presented, and the appendix is relatively comprehensive. 2. The method shows advantages in quantitative metrics on both InterHuman and InterX datasets.

Weaknesses

**Major** 1. **The contribution of the main component—DHVAE—has not been sufficiently discussed.** Compared to "a flat, unified latent representation" VAE, DHVAE uses three latent variables to represent an HHI sequence. Does this mean that the dimensionality of DHVAE's latent variables is three times that of a standard VAE? For example, on the InterHuman dataset, is D(DHVAE) : D(VAE) = 3×256 : 1×256? If so, are the comparisons in Table 2 and Table 3 fair? 2. **The paper's presentation and form

Reviewer 03Rating 4Confidence 3

Strengths

1. The use of contrastive learning for two-person interaction generation is theoretically sound and has been validated in other motion generation domains (e.g., sign-language synthesis). 2. The paper’s core contributions, the global interaction latent variable $z_{o}$ and the CoTransformer are both empirically verified through experiments. 3. The manuscript is clearly written; the authors’ intentions can be readily followed from the text, and the Contrastive Learning algorithm is presented in a

Weaknesses

1. The core contribution of the paper is rather narrow: encoding two-person motions into a single latent variable $z_{o}$ improves generation quality, but it does not yield creative or novel motion combinations. 2. The effectiveness of the contrastive-loss term is not reflected in the numerical metrics; the authors attempt to demonstrate its role through PCA visualizations in Appendix 6.5 (Figs. 7 & 8) and qualitative renderings. **Nevertheless, further evidence is required, as the high-quality

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · 3D Shape Modeling and Analysis