Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models
Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, Kun Zhou

TL;DR
This paper introduces MECo, a novel framework that leverages large language models to generate co-speech gestures controlled by motion examples, preserving detailed motion characteristics and enabling diverse, granular, multimodal inputs.
Contribution
The paper presents a new approach using LLMs for motion-example-controlled gesture generation, moving beyond pseudo-labeling to explicit query-based guidance, achieving state-of-the-art results.
Findings
State-of-the-art performance on FGD, diversity, and similarity metrics
Supports granular control of individual body parts
Handles multiple input modalities including text and video
Abstract
The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve gesture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs' comprehension capabilities through fine-tuning to simultaneously interpret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Speech and dialogue systems
