TL;DR
This paper adapts a pre-trained Tacotron2 TTS model for articulatory-to-acoustic mapping using ultrasound tongue imaging, employing transfer learning to improve speech synthesis quality with limited data.
Contribution
It introduces a transfer learning approach that adapts Tacotron2 for ultrasound-based articulatory-to-acoustic conversion, combining CNN, Tacotron2, and WaveGlow for improved naturalness.
Findings
Enhanced speech naturalness over previous models
Effective transfer learning with limited ultrasound data
Successful integration of CNN, Tacotron2, and WaveGlow
Abstract
For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder. The articulatory-to-acoustic conversion contains three steps: 1) from a sequence of ultrasound tongue image recordings, a 3D convolutional neural network predicts the inputs of the pre-trained Tacotron2 model, 2) the Tacotron2 model converts this intermediate representation to an 80-dimensional mel-spectrogram, and 3) the WaveGlow model is applied for final inference. This generated speech contains the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsInvertible 1x1 Convolution · Affine Coupling · Normalizing Flows · WaveGlow
