A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Guoqiang Hu; Huaning Tan; Ruilai Li

arXiv:2406.12164·cs.SD·July 11, 2024

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Guoqiang Hu, Huaning Tan, Ruilai Li

PDF

Open Access

TL;DR

This paper introduces a novel Mel spectrogram enhancement method based on continuous wavelet transform (CWT) to improve speech synthesis clarity, validated across different models with consistent quality improvements.

Contribution

The paper proposes a new enhancement paradigm using CWT to produce more detailed Mel spectrograms, improving speech synthesis quality across various architectures.

Findings

01

Higher MOS scores with the enhancement paradigm

02

Effective in both autoregressive and non-autoregressive models

03

Demonstrates universality across different speech synthesis architectures

Abstract

Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis