ReactDance: Hierarchical Representation for High-Fidelity and Coherent Long-Form Reactive Dance Generation
Jingzhong Lin, Xinru Li, Yuanyuan Qi, Bohao Zhang, Wenxiang Liu, Kecheng Tang, Wenxuan Huang, Xiangfeng Xu, Bangyan Li, Changbo Wang, Gaoqi He

TL;DR
ReactDance introduces a hierarchical diffusion framework with novel multi-scale motion representation and blockwise sampling to generate high-fidelity, coherent long-form reactive dance conditioned on a lead dancer's motion.
Contribution
The paper presents ReactDance, a hierarchical diffusion model with HFSQ and BLC strategies, enabling detailed spatial control and efficient long-term temporal coherence in reactive dance generation.
Findings
Outperforms state-of-the-art in motion quality
Achieves superior long-term coherence
Enhances sampling efficiency
Abstract
Reactive dance generation (RDG), the task of generating a dance conditioned on a lead dancer's motion, holds significant promise for enhancing human-robot interaction and immersive digital entertainment. Despite progress in duet synchronization and motion-music alignment, two key challenges remain: generating fine-grained spatial interactions and ensuring long-term temporal coherence. In this work, we introduce \textbf{ReactDance}, a diffusion framework that operates on a novel hierarchical latent space to address these spatiotemporal challenges in RDG. First, for high-fidelity spatial expression and fine-grained control, we propose Hierarchical Finite Scalar Quantization (\textbf{HFSQ}). This multi-scale motion representation effectively disentangles coarse body posture from subtle limb dynamics, enabling independent and detailed control over both aspects through a layered guidance…
Peer Reviews
Decision·ICLR 2026 Poster
1. Hierarchical Finite Scalar Quantization (HFSQ) provides a stable, continuous multi-scale latent that disentangles coarse global posture from fine local articulations, enabling high-fidelity, layerwise control of motion detail. 2. Blockwise Local Context (BLC) provides a stable, continuous multi-scale latent that disentangles coarse global posture from fine local articulations, enabling high-fidelity, layerwise control of motion detail. 3. The paper is well-organized with a clear structure a
1. The HFSQ design does not specify the number of residual stages R, leaving unclear the representational capacity and the coarse–fine trade-off, which hinders reproducibility and complicates tuning of per-stage guidance weights s_r. 2. The BLC training/inference protocol is underspecified (e.g., whether sliding-window stride m can cross T), so it is unclear how periodic causal masking affects cross‑block continuity.
**Technical Comments:** Overall, this paper is quite novel in its design and methodology. 1. The paper identifies the limitations of FSQ in motion representation space and explores hierarchical and residual variants of FSQ for richer motion encoding. 2. Instead of following the conventional Quantize → GPT / autoregressive pipeline, the paper investigates a diffusion model targeting the dequantized FSQ features. This is particularly interesting because it implicitly raises an important quest
1. Although the Discussion section provides some interpretation of LDCFG, the explanation remains insufficiently clear. For example, how is {s_coarse,s_fine} sampled during training? What exactly is defined as the coarse part and what as the fine part? 2. As mentioned in the Strengths section, the authors found that both HFSQ and PM contribute positively to the subsequent diffusion process. However, no further exploration was conducted. For instance, how would a VAE perform as the diffusion en
1. The proposed HFSQ is technically sound for fine-grained motion generation. 2. The proposed BLC seems to effectively solve the long-term generation issue with improved sampling efficiency. 3. The method shows an impressive result according to the quantitative metrics and the video provided in the supplementary.
1. It is still unclear to me why the continuity is ensured in parallel inference. Why the model can " implicitly learns how to naturally begin and end motion phrases from any temporal point"? How does this related to the dense sliding window? Could the author further explain it? 2. To provide a more robust assessment and validate the effectiveness of proposed kinematic loss, the physical plausibility metrics should be added, such as PFC [1]. 3. I think HFSQ is proved to be effective for genera
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Music Technology and Sound Studies
MethodsDiffusion
