Fake it to make it: Using synthetic data to remedy the data shortage in   joint multimodal speech-and-gesture synthesis

Shivam Mehta; Anna Deichler; Jim O'Regan; Birger Mo\"ell; Jonas; Beskow; Gustav Eje Henter; Simon Alexanderson

arXiv:2404.19622·cs.HC·May 1, 2024

Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

Shivam Mehta, Anna Deichler, Jim O'Regan, Birger Mo\"ell, Jonas, Beskow, Gustav Eje Henter, Simon Alexanderson

PDF

Open Access

TL;DR

This paper introduces a synthetic data generation approach using unimodal models to enhance joint speech and gesture synthesis from text, addressing data scarcity and improving output quality with a new controllable prosody architecture.

Contribution

It presents a novel method of synthesizing large-scale multimodal training data from unimodal models and proposes an improved synthesis architecture with better prosody control.

Findings

01

Pre-training on synthetic data enhances speech and gesture quality.

02

The new architecture improves controllability and synthesis performance.

03

Synthetic data pre-training yields significant benefits over limited real data.

Abstract

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use unimodal synthesis models trained on large datasets to create multimodal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Hearing Impairment and Communication · Natural Language Processing Techniques