STFT spectral loss for training a neural speech waveform model
Shinji Takaki, Toru Nakashika, Xin Wang, Junichi Yamagishi

TL;DR
This paper introduces an STFT spectral loss incorporating both amplitude and phase spectra for training neural speech waveform models, leading to high-quality speech synthesis.
Contribution
It proposes a novel spectral loss function based on STFT spectra that improves neural speech waveform training and quality.
Findings
The proposed loss enhances speech synthesis quality.
The model achieves high-fidelity waveform generation.
Training based on the loss aligns with maximum likelihood principles.
Abstract
This paper proposes a new loss using short-time Fourier transform (STFT) spectra for the aim of training a high-performance neural speech waveform model that predicts raw continuous speech waveform samples directly. Not only amplitude spectra but also phase spectra obtained from generated speech waveforms are used to calculate the proposed loss. We also mathematically show that training of the waveform model on the basis of the proposed loss can be interpreted as maximum likelihood training that assumes the amplitude and phase spectra of generated speech waveforms following Gaussian and von Mises distributions, respectively. Furthermore, this paper presents a simple network architecture as the speech waveform model, which is composed of uni-directional long short-term memories (LSTMs) and an auto-regressive structure. Experimental results showed that the proposed neural model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Speech and Audio Processing · Speech Recognition and Synthesis
