Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints
Xiangyue Zhang, Jianfang Li, Jianqiang Ren, Jiaxu Zhang

TL;DR
This paper introduces GlobalDiff, a diffusion-based framework that generates co-speech motion directly in global joint rotation space, reducing hierarchical errors and improving motion quality with multi-level constraints.
Contribution
The work is the first to operate diffusion models directly on global joint rotations for co-speech motion, introducing multi-level constraints to maintain structural and temporal consistency.
Findings
Achieves 46.0% improvement over SOTA in motion accuracy.
Generates smoother and more stable co-speech motions.
Effectively reduces error accumulation in hierarchical motion generation.
Abstract
Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint's prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsPhonetics and Phonology Research · Human Motion and Animation · Speech and Audio Processing
