TL;DR
DuetGen is a hierarchical masked modeling framework that generates synchronized two-person dances from music by encoding motions into discrete tokens and using transformers to produce realistic, interactive dance sequences.
Contribution
It introduces a novel hierarchical token-based approach with masked transformers for music-driven two-person dance generation, capturing complex interactions effectively.
Findings
Achieves state-of-the-art motion realism and synchronization.
Effectively models intricate partner interactions.
Demonstrates versatility across various dance genres.
Abstract
We present DuetGen, a novel framework for generating interactive two-person dances from music. The key challenge of this task lies in the inherent complexities of two-person dance interactions, where the partners need to synchronize both with each other and with the music. Inspired by the recent advances in motion synthesis, we propose a two-stage solution: encoding two-person motions into discrete tokens and then generating these tokens from music. To effectively capture intricate interactions, we represent both dancers' motions as a unified whole to learn the necessary motion tokens, and adopt a coarse-to-fine learning strategy in both the stages. Our first stage utilizes a VQ-VAE that hierarchically separates high-level semantic features at a coarse temporal resolution from low-level details at a finer resolution, producing two discrete token sequences at different abstraction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsVQ-VAE · ADaptive gradient method with the OPTimal convergence rate
