DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation

Junhao Chen; Mingjin Chen; Jianjin Xu; Xiang Li; Junting Dong; Mingze Sun; Puhua Jiang; Hongxiang Li; Yuhang Yang; Hao Zhao; Xiaoxiao Long; Ruqi Huang

arXiv:2505.18078·cs.CV·May 26, 2025

DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation

Junhao Chen, Mingjin Chen, Jianjin Xu, Xiang Li, Junting Dong, Mingze Sun, Puhua Jiang, Hongxiang Li, Yuhang Yang, Hao Zhao, Xiaoxiao Long, Ruqi Huang

PDF

TL;DR

DanceTogether introduces a novel diffusion-based framework for multi-person, identity-preserving video generation from a single reference image and pose streams, enabling realistic, interactive multi-actor videos.

Contribution

It presents the first end-to-end diffusion model with a MaskPoseAdapter for identity preservation in multi-actor video synthesis, along with new large-scale datasets and benchmarks.

Findings

01

Outperforms prior methods on the TogetherVideoBench benchmark.

02

Achieves convincing human-robot interaction videos with minimal fine-tuning.

03

Demonstrates broad generalization to embodied-AI and HRI tasks.

Abstract

Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds "who" and "how" at every denoising step by fusing robust tracking masks with semantically rich-but noisy-pose heat-maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.