SmoothSync: Dual-Stream Diffusion Transformers for Jitter-Robust Beat-Synchronized Gesture Generation from Quantized Audio

Yujiao Jiang; Qingmin Liao; Zongqing Lu

arXiv:2601.04236·cs.SD·January 9, 2026

SmoothSync: Dual-Stream Diffusion Transformers for Jitter-Robust Beat-Synchronized Gesture Generation from Quantized Audio

Yujiao Jiang, Qingmin Liao, Zongqing Lu

PDF

Open Access

TL;DR

SmoothSync introduces a dual-stream diffusion transformer framework utilizing quantized audio tokens to generate synchronized, smooth, and diverse co-speech gestures, effectively reducing jitter and foot sliding issues.

Contribution

The paper proposes a novel dual-stream diffusion transformer architecture with jitter-suppression and probabilistic quantization for improved gesture synchronization and diversity.

Findings

01

Outperforms state-of-the-art by -30.6% FGD on BEAT2

02

Reduces jitter and foot sliding by over 60%

03

Enhances gesture diversity by 8.4%

Abstract

Co-speech gesture generation is a critical area of research aimed at synthesizing speech-synchronized human-like gestures. Existing methods often suffer from issues such as rhythmic inconsistency, motion jitter, foot sliding and limited multi-sampling diversity. In this paper, we present SmoothSync, a novel framework that leverages quantized audio tokens in a novel dual-stream Diffusion Transformer (DiT) architecture to synthesis holistic gestures and enhance sampling variation. Specifically, we (1) fuse audio-motion features via complementary transformer streams to achieve superior synchronization, (2) introduce a jitter-suppression loss to improve temporal smoothness, (3) implement probabilistic audio quantization to generate distinct gesture sequences from identical inputs. To reliably evaluate beat synchronization under jitter, we introduce Smooth-BC, a robust variant of the beat…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Human Motion and Animation · Interactive and Immersive Displays