Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models

Bohong Chen; Yumeng Li; Youyi Zheng; Yao-Xiang Ding; Kun Zhou

arXiv:2507.20220·cs.CV·July 29, 2025

Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models

Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, Kun Zhou

PDF

Open Access

TL;DR

This paper introduces MECo, a novel framework that leverages large language models to generate co-speech gestures controlled by motion examples, preserving detailed motion characteristics and enabling diverse, granular, multimodal inputs.

Contribution

The paper presents a new approach using LLMs for motion-example-controlled gesture generation, moving beyond pseudo-labeling to explicit query-based guidance, achieving state-of-the-art results.

Findings

01

State-of-the-art performance on FGD, diversity, and similarity metrics

02

Supports granular control of individual body parts

03

Handles multiple input modalities including text and video

Abstract

The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve gesture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs' comprehension capabilities through fine-tuning to simultaneously interpret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Speech and dialogue systems