Improving Trajectory Modelling for DNN-based Speech Synthesis by using   Stacked Bottleneck Features and Minimum Generation Error Training

Zhizheng Wu; Simon King

arXiv:1602.06727·cs.SD·November 17, 2016

Improving Trajectory Modelling for DNN-based Speech Synthesis by using Stacked Bottleneck Features and Minimum Generation Error Training

Zhizheng Wu, Simon King

PDF

TL;DR

This paper introduces stacking bottleneck features and minimum generation error training to enhance DNN-based speech synthesis, resulting in more natural speech by better modeling linguistic context and feature interactions.

Contribution

The paper presents two novel techniques that improve speech synthesis quality by incorporating detailed linguistic context and optimizing trajectory errors across entire utterances.

Findings

01

Significant improvement in naturalness of synthetic speech

02

Effective combination of the two techniques enhances performance

03

Objective and subjective evaluations confirm the benefits

Abstract

We propose two novel techniques --- stacking bottleneck features and minimum generation error training criterion --- to improve the performance of deep neural network (DNN)-based speech synthesis. The techniques address the related issues of frame-by-frame independence and ignorance of the relationship between static and dynamic features, within current typical DNN-based synthesis frameworks. Stacking bottleneck features, which are an acoustically--informed linguistic representation, provides an efficient way to include more detailed linguistic context at the input. The minimum generation error training criterion minimises overall output trajectory error across an utterance, rather than minimising the error per frame independently, and thus takes into account the interaction between static and dynamic features. The two techniques can be easily combined to further improve performance. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.