Synthesis of Tongue Motion and Acoustics from Text using a Multimodal   Articulatory Database

Ingmar Steiner; S\'ebastien Le Maguer; Alexander Hewer

arXiv:1612.09352·cs.HC·April 17, 2018

Synthesis of Tongue Motion and Acoustics from Text using a Multimodal Articulatory Database

Ingmar Steiner, S\'ebastien Le Maguer, Alexander Hewer

PDF

TL;DR

This paper introduces a novel end-to-end TTS system that synthesizes both speech and synchronized tongue motion from text, utilizing a 3D tongue model and articulatory data to improve multimodal speech synthesis.

Contribution

It presents a new method for generating synchronized tongue motion alongside speech directly from text using a 3D tongue model and articulatory data, without requiring additional data.

Findings

01

Achieved less than 2.8 mm mean Euclidean distance in articulatory prediction

02

Successfully integrated tongue motion synthesis into TTS without extra data

03

Demonstrated potential for multimodal speech applications

Abstract

We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates of predicted articulatory movements against the reference data. The results indicate a global mean Euclidean distance of less than 2.8 mm, and our approach can be adapted to add an articulatory modality to conventional TTS applications without the need for extra data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.