A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis
Guoqiang Hu, Huaning Tan, Ruilai Li

TL;DR
This paper introduces a novel Mel spectrogram enhancement method based on continuous wavelet transform (CWT) to improve speech synthesis clarity, validated across different models with consistent quality improvements.
Contribution
The paper proposes a new enhancement paradigm using CWT to produce more detailed Mel spectrograms, improving speech synthesis quality across various architectures.
Findings
Higher MOS scores with the enhancement paradigm
Effective in both autoregressive and non-autoregressive models
Demonstrates universality across different speech synthesis architectures
Abstract
Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
