CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation
Fengyi Fang, Sicheng Yang, Wenming Yang

TL;DR
CoordSpeaker introduces a novel framework that uses captioning and diffusion models to generate coordinated, semantically coherent gestures synchronized with speech, addressing key challenges in multimodal gesture synthesis.
Contribution
It pioneers the use of captioning and bidirectional gesture-text mapping to improve gesture generation, bridging the semantic gap and enabling controlled, synchronized gestures.
Findings
Produces high-quality, rhythmically synchronized gestures
Achieves superior performance and efficiency over existing methods
Demonstrates effective semantic coherence with arbitrary captions
Abstract
Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Speech and dialogue systems
