MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
Xirui Hu, Yanbo Ding, Jiahao Wang, Tingting Shi, Yali Wang, Guo Zhi Zhi, Weizhan Zhang

TL;DR
MotionWeaver introduces a holistic 4D-anchored framework for multi-humanoid image animation, enabling generalization across diverse humanoid forms and complex interactions, with state-of-the-art results on a new benchmark.
Contribution
The paper presents a unified motion representation and a 4D-anchored paradigm to extend character animation to multi-humanoid scenarios, addressing previous limitations.
Findings
Achieves state-of-the-art results on the new multi-humanoid benchmark.
Effectively generalizes across diverse humanoid forms and interactions.
Handles occlusions and complex multi-human scenarios robustly.
Abstract
Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper clearly argues that 2D control entangles appearance and motion and lacks explicit depth reasoning in multi-person scenes, and responds with a fully 4D-anchored pipeline (UCC, HSI and H4S) that separates motion from morphology, injects depth ordering, and supervises motion explicitly. 2. UCC standardizes SMPL joints to strip appearance cues and binds per-person motion with group attention. HSI adds depth-aware cross-attention plus Dynamic Cross-RoPE to align (t, x, y) between camera
My major concern about this paper is the detailed comparison over existing methods to show the contribution clearly. The multi-person interaction and 4D motion tokens are adopted by previous methods already. I hope the authors could clearly clarify the difference with existing methods. 1. MTVCrafter (Ding et al., 2025) also models raw 4D motion via a tokenizer (4DMoT) and conditions a DiT with 4D positional encodings, reporting large gains on open-world human animation. MotionWeaver likewise tr
This work is well motivated. Character video generation including two characters remains very challenging. The paper presented a clear analysis of the issues of motion representation and 4D modeling. The proposed technical approaches, i.e., UCC, HIS, H4S, are technical novel and make sense to improve the performance of multi-character video generation. The experiments on the new DualDynamics benchmark show clear improvement, which outperform recent related works substantially.
The paper only reported the performance comparison on the new DualDynamics benchmark. Why only part of the constructed benchmark will be released? Then how do subsequent works compare with MotionWeaver? All the technical approaches, i.e., UCC, HIS, H4S, are backward applicable to the single character case. So please show the performance of MotionWeaver on Fashion and TikTok, thus, the readers have a clear understanding of the advantages of these modules.
1. The multi-humanoid image animation is an important, practical and interesting task. 2. The proposed method designs several components to extract motion to improve generalization, which is well motivated and reasonable. 3. The results show that the proposed method outperforms previous methods.
1. The training setting of comparing methods seems to be different from the proposed method. Are the comparing methods trained on the same data? 2. The proposed method consists several steps. What is the inference time comparison to previous methods? How would the error propagation affect the final results? 3. The evaluation is only conducted on the proposed benchmark. It is unclear whether the proposed method outperforms comparing methods on existing benchmarks.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
