LLM Gesticulator: Leveraging Large Language Models for Scalable and   Controllable Co-Speech Gesture Synthesis

Haozhou Pang; Tianwei Ding; Lanshan He; Ming Tao; Lu Zhang; Qi Gan

arXiv:2410.10851·cs.GR·October 23, 2024

LLM Gesticulator: Leveraging Large Language Models for Scalable and Controllable Co-Speech Gesture Synthesis

Haozhou Pang, Tianwei Ding, Lanshan He, Ming Tao, Lu Zhang, Qi Gan

PDF

Open Access

TL;DR

This paper introduces LLM Gesticulator, a scalable, controllable framework that uses large language models to generate natural, full-body co-speech gestures synchronized with audio, outperforming previous methods.

Contribution

It is the first to leverage large language models for co-speech gesture synthesis, demonstrating scalability and controllability through text prompts.

Findings

01

Model performance improves proportionally with LLM size.

02

Framework achieves natural, rhythmically aligned gestures.

03

Outperforms prior methods in objective metrics and user studies.

Abstract

In this work, we present LLM Gesticulator, an LLM-based audio-driven co-speech gesture generation framework that synthesizes full-body animations that are rhythmically aligned with the input audio while exhibiting natural movements and editability. Compared to previous work, our model demonstrates substantial scalability. As the size of the backbone LLM model increases, our framework shows proportional improvements in evaluation metrics (a.k.a. scaling law). Our method also exhibits strong controllability where the content, style of the generated gestures can be controlled by text prompt. To the best of our knowledge, LLM gesticulator is the first work that use LLM on the co-speech generation task. Evaluation with existing objective metrics and user studies indicate that our framework outperforms prior works.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques