Improving Trajectory Modelling for DNN-based Speech Synthesis by using Stacked Bottleneck Features and Minimum Generation Error Training
Zhizheng Wu, Simon King

TL;DR
This paper introduces stacking bottleneck features and minimum generation error training to enhance DNN-based speech synthesis, resulting in more natural speech by better modeling linguistic context and feature interactions.
Contribution
The paper presents two novel techniques that improve speech synthesis quality by incorporating detailed linguistic context and optimizing trajectory errors across entire utterances.
Findings
Significant improvement in naturalness of synthetic speech
Effective combination of the two techniques enhances performance
Objective and subjective evaluations confirm the benefits
Abstract
We propose two novel techniques --- stacking bottleneck features and minimum generation error training criterion --- to improve the performance of deep neural network (DNN)-based speech synthesis. The techniques address the related issues of frame-by-frame independence and ignorance of the relationship between static and dynamic features, within current typical DNN-based synthesis frameworks. Stacking bottleneck features, which are an acoustically--informed linguistic representation, provides an efficient way to include more detailed linguistic context at the input. The minimum generation error training criterion minimises overall output trajectory error across an utterance, rather than minimising the error per frame independently, and thus takes into account the interaction between static and dynamic features. The two techniques can be easily combined to further improve performance. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
