Generating coherent spontaneous speech and gesture from text

Simon Alexanderson; \'Eva Sz\'ekely; Gustav Eje Henter; Taras; Kucherenko; Jonas Beskow

arXiv:2101.05684·cs.LG·January 15, 2021

Generating coherent spontaneous speech and gesture from text

Simon Alexanderson, \'Eva Sz\'ekely, Gustav Eje Henter, Taras, Kucherenko, Jonas Beskow

PDF

TL;DR

This paper presents a novel system that jointly generates coherent speech and full-body gestures from text, integrating recent advances in text-to-speech and motion generation for more natural embodied communication.

Contribution

It is the first to combine speech and gesture generation trained on the same speaker's spontaneous speech and motion data, creating more synchronized and realistic outputs.

Findings

01

Successful joint generation of speech and gestures from text

02

Visualization of gesture spaces and alignments

03

Demonstration of coherent speech-gesture synthesis

Abstract

Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.