FastTalker: Jointly Generating Speech and Conversational Gestures from   Text

Zixin Guo; Jian Zhang

arXiv:2409.16404·cs.MM·September 26, 2024

FastTalker: Jointly Generating Speech and Conversational Gestures from Text

Zixin Guo, Jian Zhang

PDF

Open Access

TL;DR

FastTalker is a novel framework that efficiently generates synchronized speech and 3D gestures from text in real-time by reusing speech features and optimizing architecture.

Contribution

It introduces an end-to-end model that jointly produces speech and gestures, utilizing intermediate speech features and NAS for improved speed and quality.

Findings

01

Achieves state-of-the-art gesture and speech synthesis performance.

02

Processes speech and gestures in 0.17 seconds per second on NVIDIA 3090.

03

Reuses intermediate speech features for better gesture alignment.

Abstract

Generating 3D human gestures and speech from a text script is critical for creating realistic talking avatars. One solution is to leverage separate pipelines for text-to-speech (TTS) and speech-to-gesture (STG), but this approach suffers from poor alignment of speech and gestures and slow inference times. In this paper, we introduce FastTalker, an efficient and effective framework that simultaneously generates high-quality speech audio and 3D human gestures at high inference speeds. Our key insight is reusing the intermediate features from speech synthesis for gesture generation, as these features contain more precise rhythmic information than features re-extracted from generated speech. Specifically, 1) we propose an end-to-end framework that concurrently generates speech waveforms and full-body gestures, using intermediate speech features such as pitch, onset, energy, and duration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques