InterMask: 3D Human Interaction Generation via Collaborative Masked Modeling
Muhammad Gohar Javed, Chuan Guo, Li Cheng, Xingyu Li

TL;DR
InterMask is a novel framework that uses collaborative masked modeling and discrete motion tokens to generate realistic, high-fidelity 3D human interactions from text descriptions, outperforming previous methods.
Contribution
The paper introduces InterMask, a new approach combining VQ-VAE and transformer-based masked modeling for improved 3D human interaction generation.
Findings
Achieves state-of-the-art FID scores on InterHuman and InterX datasets.
Produces diverse and high-fidelity human interaction sequences.
Supports reaction generation without additional fine-tuning.
Abstract
Generating realistic 3D human-human interactions from textual descriptions remains a challenging task. Existing approaches, typically based on diffusion models, often produce results lacking realism and fidelity. In this work, we introduce InterMask, a novel framework for generating human interactions using collaborative masked modeling in discrete space. InterMask first employs a VQ-VAE to transform each motion sequence into a 2D discrete motion token map. Unlike traditional 1D VQ token maps, it better preserves fine-grained spatio-temporal details and promotes spatial awareness within each token. Building on this representation, InterMask utilizes a generative masked modeling framework to collaboratively model the tokens of two interacting individuals. This is achieved by employing a transformer architecture specifically designed to capture complex spatio-temporal inter-dependencies.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Virtual Reality Applications and Impacts
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax
