Training a Neural Speech Waveform Model using Spectral Losses of   Short-Time Fourier Transform and Continuous Wavelet Transform

Shinji Takaki; Hirokazu Kameoka; Junichi Yamagishi

arXiv:1903.12392·eess.AS·April 9, 2019·1 cites

Training a Neural Speech Waveform Model using Spectral Losses of Short-Time Fourier Transform and Continuous Wavelet Transform

Shinji Takaki, Hirokazu Kameoka, Junichi Yamagishi

PDF

Open Access

TL;DR

This paper introduces a training scheme for neural speech waveform models using spectral losses derived from both STFT and CWT, leveraging their complementary time-frequency resolutions to improve speech quality.

Contribution

It generalizes previous spectral loss frameworks by incorporating CWT, enabling more human-auditory-like time-frequency analysis in training neural speech models.

Findings

01

CWT-based spectral loss achieves comparable speech quality to STFT-based loss.

02

Combining STFT and CWT losses can enhance training effectiveness.

03

Proposed method captures diverse spectral features for better speech synthesis.

Abstract

Recently, we proposed short-time Fourier transform (STFT)-based loss functions for training a neural speech waveform model. In this paper, we generalize the above framework and propose a training scheme for such models based on spectral amplitude and phase losses obtained by either STFT or continuous wavelet transform (CWT), or both of them. Since CWT is capable of having time and frequency resolutions different from those of STFT and is cable of considering those closer to human auditory scales, the proposed loss functions could provide complementary information on speech signals. Experimental results showed that it is possible to train a high-quality model by using the proposed CWT spectral loss and is as good as one using STFT-based loss.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Neural Networks and Applications