iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating   Inverse Short-Time Fourier Transform

Takuhiro Kaneko; Kou Tanaka; Hirokazu Kameoka; Shogo Seki

arXiv:2203.02395·cs.SD·March 7, 2022

iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka, Shogo Seki

PDF

Open Access 2 Repos 10 Models

TL;DR

iSTFTNet is a fast, lightweight mel-spectrogram vocoder that incorporates inverse STFT to improve efficiency and speech quality in text-to-speech systems.

Contribution

It introduces a novel approach replacing parts of the neural vocoder with inverse STFT, reducing computational cost and leveraging time-frequency structures.

Findings

01

Models became faster and more lightweight.

02

Maintained reasonable speech quality.

03

Applicable to multiple HiFi-GAN variants.

Abstract

In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-spectrogram vocoder solves these problems jointly and implicitly using a convolutional neural network, including temporal upsampling layers, when directly calculating a raw waveform. Such an approach allows skipping redundant processes during waveform synthesis (e.g., the direct reconstruction of high-dimensional original-scale spectrograms). By contrast, the approach solves all problems in a black box and cannot effectively employ the time-frequency structures existing in a mel-spectrogram. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques

MethodsHiFi-GAN