SmoothSync: Dual-Stream Diffusion Transformers for Jitter-Robust Beat-Synchronized Gesture Generation from Quantized Audio
Yujiao Jiang, Qingmin Liao, Zongqing Lu

TL;DR
SmoothSync introduces a dual-stream diffusion transformer framework utilizing quantized audio tokens to generate synchronized, smooth, and diverse co-speech gestures, effectively reducing jitter and foot sliding issues.
Contribution
The paper proposes a novel dual-stream diffusion transformer architecture with jitter-suppression and probabilistic quantization for improved gesture synchronization and diversity.
Findings
Outperforms state-of-the-art by -30.6% FGD on BEAT2
Reduces jitter and foot sliding by over 60%
Enhances gesture diversity by 8.4%
Abstract
Co-speech gesture generation is a critical area of research aimed at synthesizing speech-synchronized human-like gestures. Existing methods often suffer from issues such as rhythmic inconsistency, motion jitter, foot sliding and limited multi-sampling diversity. In this paper, we present SmoothSync, a novel framework that leverages quantized audio tokens in a novel dual-stream Diffusion Transformer (DiT) architecture to synthesis holistic gestures and enhance sampling variation. Specifically, we (1) fuse audio-motion features via complementary transformer streams to achieve superior synchronization, (2) introduce a jitter-suppression loss to improve temporal smoothness, (3) implement probabilistic audio quantization to generate distinct gesture sequences from identical inputs. To reliably evaluate beat synchronization under jitter, we introduce Smooth-BC, a robust variant of the beat…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Human Motion and Animation · Interactive and Immersive Displays
