How Animals Dance (When You're Not Looking)
Xiaojuan Wang, Aleksander Holynski, Brian Curless, Ira Kemelmacher, Steve Seitz

TL;DR
This paper introduces a novel framework for generating animal dance videos synchronized with music, utilizing choreography patterns, keyframe estimation, and diffusion models to produce realistic, long-duration dance sequences.
Contribution
It presents a new high-level control signal for dance video synthesis through choreography patterns and a method for automatic keyframe estimation from human dance videos.
Findings
Can generate up to 30 seconds of animal dance videos from six keyframes
Works across various animals and music tracks
Uses graph optimization and diffusion models for realistic synthesis
Abstract
We present a framework for generating music-synchronized, choreography aware animal dance videos. Our framework introduces choreography patterns -- structured sequences of motion beats that define the long-range structure of a dance -- as a novel high-level control signal for dance video generation. These patterns can be automatically estimated from human dance videos. Starting from a few keyframes representing distinct animal poses, generated via text-to-image prompting or GPT-4o, we formulate dance synthesis as a graph optimization problem that seeks the optimal keyframe structure to satisfy a specified choreography pattern of beats. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 seconds…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper presents a creative application of "dancing animals" and addresses a practical limitation of current models, which lack intuitive controls for creating coherent, long-duration motion. Decomposing the complex task into a graph optimization problem is an effective strategy. It decouples the high-level temporal structure (the choreography) from the low-level frame generation (the diffusion model), making the overall problem more tractable. The paper also includes a well-designed percep
1. The core innovation in "choreography patterns" for long-range control has been previously explored in the 3D skeleton animation domain (e.g., ChoreoMaster). The main contribution here appears to be a combination of "automated choreography extraction from video" (also a relatively established task) with "keyframe-based video generation", which lacks a fundamental methodological breakthrough. 2. The demo reveals less fluid generated motion compared to the baseline with a distinct "pose-to-pose
1. The idea of first generating keyframes and then arranging dance units according to the music through a graph-based warping strategy is quite interesting. This design helps maintain stable image quality while producing coherent video sequences. 2. The writing is clear and easy to follow, and the presented demos are very cute and engaging. 3. The experimental results are comprehensive and well-presented.
1. From the results, the generated dances still appear limited to humanoid structures and behaviors. However, this paper claims to focus on animals. Why, then, does the framework rely on a human-based model (SMPL)? Can it be applied to non-humanoid species such as fish, snakes, reptiles (e.g., dinosaurs), or spiders? This leads to a more fundamental question: where does this framework actually generalize? If the intended motion domain remains largely human-like, why not directly adopt existing m
* Combining animal motion generation, musical beat synchronization, and keyframe optimization is original and creative. * The mirrored pose generation step addresses a real aesthetic issue in animal dancing (natural symmetry).
* The mirrored pose generation step makes the generation unreal. * I have seen the demos in the website, and I find that the generated result looks too stiff and lacks detail.
1. I think it is overall a nicely written graphics paper with a clear and concise introduction. The combination use of dynamic time warping, pose clustering, RAFT, graph optimization and generative video models show that the authors have a in depth understanding of the graphics pipeline, traditional and newly emerging generative approaches. 2. The authors are honest about limitations and future work directions. They discuss the failure cases (motion artifacts from video diffusion, background in
1. The motion / animation quality is not very ideal. There're visually very noticeable artifacts and jumps in the motion. While this might not be fair to compare indiviual researchers to industrial product teams which have far bigger budgets and resources, I think in terms of the visual quality and the motion smoothness for music conditioned video generation, this demo is not as good as the online entertainment applications released on social media platforms like TicTok. 2. The keyframe formul
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies
MethodsDiffusion · Attentive Walk-Aggregating Graph Neural Network
