AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

Zhenyu Xie; Ji Xia; Michael Kampffmeyer; Panwen Hu; Zehua Ma; Yujian Zheng; Jing Wang; Zheng Chong; Xujie Zhang; Xianhang Cheng; Xiaodan Liang; Hao Li

arXiv:2603.15415·cs.CV·March 17, 2026

AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

Zhenyu Xie, Ji Xia, Michael Kampffmeyer, Panwen Hu, Zehua Ma, Yujian Zheng, Jing Wang, Zheng Chong, Xujie Zhang, Xianhang Cheng, Xiaodan Liang, Hao Li

PDF

Open Access

TL;DR

AnyCrowd introduces a novel diffusion transformer framework with instance-isolated encoding and decoupled attention to enable scalable, multi-character animation with improved identity control and reduced entanglement.

Contribution

The paper presents a new framework combining instance-isolated latent representations and decoupled attention mechanisms for multi-character animation.

Findings

01

Successfully scales to arbitrary number of characters.

02

Reduces identity entanglement and bleeding.

03

Achieves spatio-temporally consistent animations.

Abstract

Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Human Pose and Action Recognition