A Waveform Representation Framework for High-quality Statistical   Parametric Speech Synthesis

Bo Fan; Siu Wa Lee; Xiaohai Tian; Lei Xie; Minghui Dong

arXiv:1510.01443·cs.SD·October 8, 2015

A Waveform Representation Framework for High-quality Statistical Parametric Speech Synthesis

Bo Fan, Siu Wa Lee, Xiaohai Tian, Lei Xie, Minghui Dong

PDF

Open Access

TL;DR

This paper introduces a waveform representation framework that incorporates phase information into statistical parametric speech synthesis, significantly improving speech quality over traditional vocoder-based methods.

Contribution

It proposes a novel phase-embedded waveform representation and joint modeling platform that enhances speech synthesis quality beyond existing vocoded and neural network approaches.

Findings

01

Outperforms STRAIGHT in waveform reconstruction quality.

02

Surpasses DBLSTM-RNN baseline in multiple objective metrics.

03

Demonstrates the importance of phase information in high-quality speech synthesis.

Abstract

State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing