Msdtron: a high-capability multi-speaker speech synthesis system for   diverse data using characteristic information

Qinghua Wu; Quanbo Shen; Jian Luan; YuJun Wang

arXiv:2107.03065·cs.SD·February 14, 2022·1 cites

Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

Qinghua Wu, Quanbo Shen, Jian Luan, YuJun Wang

PDF

Open Access

TL;DR

This paper introduces Msdtron, a multi-speaker speech synthesis system that leverages characteristic speech information and novel neural components to better model diverse speaker data, significantly improving synthesis quality.

Contribution

The paper presents Msdtron, a novel multi-speaker speech synthesis system with a new excitation spectrogram representation and conditional gated LSTM for enhanced modeling of speaker diversity.

Findings

01

Reduced mel-spectrogram reconstruction error

02

Significant improvement in subjective speaker adaptation quality

03

Effective handling of diverse speaker data

Abstract

In multi-speaker speech synthesis, data from a number of speakers usually tend to have great diversity due to the fact that the speakers may differ largely in ages, speaking styles, emotions, and so on. It is important but challenging to improve the modeling capabilities for multi-speaker speech synthesis. To address the issue, this paper proposes a high-capability speech synthesis system, called Msdtron, in which 1) a representation of the harmonic structure of speech, called excitation spectrogram, is designed to directly guide the learning of harmonics in mel-spectrogram. 2) conditional gated LSTM (CGLSTM) is proposed to control the flow of text content information through the network by re-weighting the gates of LSTM using speaker information. The experiments show a significant reduction in reconstruction error of mel-spectrogram in the training of the multi-speaker model, and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing