Speaker Adaptation with Continuous Vocoder-based DNN-TTS

Ali Raheem Mandeel; Mohammed Salah Al-Radhi; Tam\'as G\'abor Csap\'o

arXiv:2108.01154·cs.SD·August 4, 2021

Speaker Adaptation with Continuous Vocoder-based DNN-TTS

Ali Raheem Mandeel, Mohammed Salah Al-Radhi, Tam\'as G\'abor Csap\'o

PDF

TL;DR

This paper explores a continuous vocoder-based DNN-TTS system that enables efficient, real-time speaker adaptation with quality comparable to traditional vocoders, using minimal data from new speakers.

Contribution

It introduces a continuous vocoder for DNN-TTS that allows effective speaker adaptation with only 400 utterances, demonstrating real-time capability and comparable quality.

Findings

01

Speaker adaptation feasible with 400 utterances

02

Objective quality similar to WORLD vocoder baseline

03

Supports real-time synthesis with high naturalness

Abstract

Traditional vocoder-based statistical parametric speech synthesis can be advantageous in applications that require low computational complexity. Recent neural vocoders, which can produce high naturalness, still cannot fulfill the requirement of being real-time during synthesis. In this paper, we experiment with our earlier continuous vocoder, in which the excitation is modeled with two one-dimensional parameters: continuous F0 and Maximum Voiced Frequency. We show on the data of 9 speakers that an average voice can be trained for DNN-TTS, and speaker adaptation is feasible 400 utterances (about 14 minutes). Objective experiments support that the quality of speaker adaptation with Continuous Vocoder-based DNN-TTS is similar to the quality of the speaker adaptation with a WORLD Vocoder-based baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.