Generating coherent spontaneous speech and gesture from text
Simon Alexanderson, \'Eva Sz\'ekely, Gustav Eje Henter, Taras, Kucherenko, Jonas Beskow

TL;DR
This paper presents a novel system that jointly generates coherent speech and full-body gestures from text, integrating recent advances in text-to-speech and motion generation for more natural embodied communication.
Contribution
It is the first to combine speech and gesture generation trained on the same speaker's spontaneous speech and motion data, creating more synchronized and realistic outputs.
Findings
Successful joint generation of speech and gestures from text
Visualization of gesture spaces and alignments
Demonstration of coherent speech-gesture synthesis
Abstract
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
