InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, Yuan Zhang, Mingyuan Gao, Dahua Lin

TL;DR
InterActHuman introduces a novel framework for multi-concept human animation that enables precise, region-specific control of multiple identities and objects in videos using layout-aligned multi-modal conditions, improving realism and customization.
Contribution
The paper presents a new method for multi-identity human animation that enforces region-specific condition binding and layout inference, allowing for multi-person dialogue videos and detailed scene customization.
Findings
Effective multi-concept animation with region-specific control.
Improved layout-aligned multi-modal condition matching.
Validated through empirical results and ablation studies.
Abstract
End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios where multiple concepts could appear in the same video with rich human-human interactions and human-object interactions. Such a global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match…
Peer Reviews
Decision·ICLR 2026 Poster
1: The framework introduces the capability for multi-person, audio-driven animation, correctly assigning distinct audio streams to specific individuals in the generated video. 2: The paper proposes a practical iterative mask-caching strategy to solve the "chicken-and-egg" problem of local conditioning, using the mask predicted at step $k$ to guide the local audio injection at step $k+1$. 3: Experiments show that the method significantly outperforms existing baselines (like Kling 1.6 w/ lip-syn
1: **Over-reliance on a Private, Curated Dataset.** The model is trained and evaluated on a new, large-scale dataset (2.6M pairs) curated by the authors. The multi-person test set is also newly collected by the authors. This makes it difficult to assess robustness and generalizability. Since the baselines were not trained on this specific, mask-annotated dataset, it's unclear if the performance gap is due to the model's architecture or its specialized training data, which may be perfectly tailor
1. The paper provides relatively comprehensive quantitative evaluations using mainstream avatar-related metrics. 2. The authors offer detailed descriptions of how they collected and cleaned a large-scale dataset combining reference images, audio, per-frame masks, and captions for diverse multi-human/object interactions (Section 3.3), which could be a valuable resource for empirical studies if released publicly.
1. While using layout-based guidance for multi-concept conditioning is a reasonable design, the claimed novelty of this framework is questionable. Similar strategies are already standard in both multi-concept image and video generation. Even within avatar-related tasks, several prior works, including MultiTalk[1], have adopted comparable layout-guided conditioning. Moreover, dynamic layout prediction was first introduced in Ingredients[2], where it serves a clear purpose in ipt2v task. However,
1. The paper tackles a practically important and under explored setting: multi-person, audio-conditioned human animation where each identity must keep its own appearance and voice, instead of the common single identity assumption in prior audio-driven portrait or OmniHuman style models. 2. The proposed iterative mask prediction and cached layout guided audio injection mechanism is elegant. By predicting per identity spatiotemporal masks using cross attention between reference appearance tokens
1. Runtime cost and scalability claims are mostly deferred to the appendix. The main text asserts minimal overhead and compatibility with long video generation, but does not quantify inference speed or memory usage when conditioning on multiple identities. 2. Multi speaker audio assignment appears to rely on injecting each audio stream only into the spatial region indicated by that speaker’s cached mask at inference time. It is unclear whether the model is ever explicitly trained on multi speak
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
