Extending Text-to-Speech Synthesis with Articulatory Movement Prediction   using Ultrasound Tongue Imaging

Tam\'as G\'abor Csap\'o

arXiv:2107.05550·eess.AS·July 13, 2021

Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

Tam\'as G\'abor Csap\'o

PDF

1 Repo

TL;DR

This study explores predicting ultrasound tongue images from text to enhance speech synthesis, demonstrating that neural networks can generate realistic articulatory movements, which could benefit audiovisual speech applications.

Contribution

It introduces a novel approach combining text-to-speech with articulatory movement prediction using ultrasound imaging, showing feasibility with limited data and comparing neural network architectures.

Findings

01

FC-DNNs outperform LSTMs for sequential prediction with limited data

02

Generated ultrasound videos closely resemble natural tongue movements

03

The method is feasible for audiovisual speech synthesis applications

Abstract

In this paper, we present our first experiments in text-to-articulation prediction, using ultrasound tongue image targets. We extend a traditional (vocoder-based) DNN-TTS framework with predicting PCA-compressed ultrasound images, of which the continuous tongue motion can be reconstructed in synchrony with synthesized speech. We use the data of eight speakers, train fully connected and recurrent neural networks, and show that FC-DNNs are more suitable for the prediction of sequential data than LSTMs, in case of limited training data. Objective experiments and visualized predictions show that the proposed solution is feasible and the generated ultrasound videos are close to natural tongue movement. Articulatory movement prediction from text input can be useful for audiovisual speech synthesis or computer-assisted pronunciation training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BME-SmartLab/txt2ult
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.