Adaptation of Tacotron2-based Text-To-Speech for   Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

Csaba Zaink\'o; L\'aszl\'o T\'oth; Amin Honarmandi Shandiz; G\'abor; Gosztolya; Alexandra Mark\'o; G\'eza N\'emeth; Tam\'as G\'abor Csap\'o

arXiv:2107.12051·eess.AS·July 27, 2021

Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

Csaba Zaink\'o, L\'aszl\'o T\'oth, Amin Honarmandi Shandiz, G\'abor, Gosztolya, Alexandra Mark\'o, G\'eza N\'emeth, Tam\'as G\'abor Csap\'o

PDF

1 Repo

TL;DR

This paper adapts a pre-trained Tacotron2 TTS model for articulatory-to-acoustic mapping using ultrasound tongue imaging, employing transfer learning to improve speech synthesis quality with limited data.

Contribution

It introduces a transfer learning approach that adapts Tacotron2 for ultrasound-based articulatory-to-acoustic conversion, combining CNN, Tacotron2, and WaveGlow for improved naturalness.

Findings

01

Enhanced speech naturalness over previous models

02

Effective transfer learning with limited ultrasound data

03

Successful integration of CNN, Tacotron2, and WaveGlow

Abstract

For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder. The articulatory-to-acoustic conversion contains three steps: 1) from a sequence of ultrasound tongue image recordings, a 3D convolutional neural network predicts the inputs of the pre-trained Tacotron2 model, 2) the Tacotron2 model converts this intermediate representation to an 80-dimensional mel-spectrogram, and 3) the WaveGlow model is applied for final inference. This generated speech contains the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BME-SmartLab/UTI-to-STFT-Tacotron2
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsInvertible 1x1 Convolution · Affine Coupling · Normalizing Flows · WaveGlow