FastTalker: Jointly Generating Speech and Conversational Gestures from Text
Zixin Guo, Jian Zhang

TL;DR
FastTalker is a novel framework that efficiently generates synchronized speech and 3D gestures from text in real-time by reusing speech features and optimizing architecture.
Contribution
It introduces an end-to-end model that jointly produces speech and gestures, utilizing intermediate speech features and NAS for improved speed and quality.
Findings
Achieves state-of-the-art gesture and speech synthesis performance.
Processes speech and gestures in 0.17 seconds per second on NVIDIA 3090.
Reuses intermediate speech features for better gesture alignment.
Abstract
Generating 3D human gestures and speech from a text script is critical for creating realistic talking avatars. One solution is to leverage separate pipelines for text-to-speech (TTS) and speech-to-gesture (STG), but this approach suffers from poor alignment of speech and gestures and slow inference times. In this paper, we introduce FastTalker, an efficient and effective framework that simultaneously generates high-quality speech audio and 3D human gestures at high inference speeds. Our key insight is reusing the intermediate features from speech synthesis for gesture generation, as these features contain more precise rhythmic information than features re-extracted from generated speech. Specifically, 1) we propose an end-to-end framework that concurrently generates speech waveforms and full-body gestures, using intermediate speech features such as pitch, onset, energy, and duration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
