CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

Fengyi Fang; Sicheng Yang; Wenming Yang

arXiv:2511.22863·cs.CV·December 1, 2025

CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

Fengyi Fang, Sicheng Yang, Wenming Yang

PDF

Open Access

TL;DR

CoordSpeaker introduces a novel framework that uses captioning and diffusion models to generate coordinated, semantically coherent gestures synchronized with speech, addressing key challenges in multimodal gesture synthesis.

Contribution

It pioneers the use of captioning and bidirectional gesture-text mapping to improve gesture generation, bridging the semantic gap and enabling controlled, synchronized gestures.

Findings

01

Produces high-quality, rhythmically synchronized gestures

02

Achieves superior performance and efficiency over existing methods

03

Demonstrates effective semantic coherence with arbitrary captions

Abstract

Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Speech and dialogue systems