STFT spectral loss for training a neural speech waveform model

Shinji Takaki; Toru Nakashika; Xin Wang; Junichi Yamagishi

arXiv:1810.11945·eess.AS·October 31, 2018·1 cites

STFT spectral loss for training a neural speech waveform model

Shinji Takaki, Toru Nakashika, Xin Wang, Junichi Yamagishi

PDF

Open Access 1 Repo

TL;DR

This paper introduces an STFT spectral loss incorporating both amplitude and phase spectra for training neural speech waveform models, leading to high-quality speech synthesis.

Contribution

It proposes a novel spectral loss function based on STFT spectra that improves neural speech waveform training and quality.

Findings

01

The proposed loss enhances speech synthesis quality.

02

The model achieves high-fidelity waveform generation.

03

Training based on the loss aligns with maximum likelihood principles.

Abstract

This paper proposes a new loss using short-time Fourier transform (STFT) spectra for the aim of training a high-performance neural speech waveform model that predicts raw continuous speech waveform samples directly. Not only amplitude spectra but also phase spectra obtained from generated speech waveforms are used to calculate the proposed loss. We also mathematically show that training of the waveform model on the basis of the proposed loss can be interpreted as maximum likelihood training that assumes the amplitude and phase spectra of generated speech waveforms following Gaussian and von Mises distributions, respectively. Furthermore, this paper presents a simple network architecture as the speech waveform model, which is composed of uni-directional long short-term memories (LSTMs) and an auto-regressive structure. Experimental results showed that the proposed neural model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nii-yamagishilab/TSNetVocoder
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Speech and Audio Processing · Speech Recognition and Synthesis