iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform
Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka, Shogo Seki

TL;DR
iSTFTNet is a fast, lightweight mel-spectrogram vocoder that incorporates inverse STFT to improve efficiency and speech quality in text-to-speech systems.
Contribution
It introduces a novel approach replacing parts of the neural vocoder with inverse STFT, reducing computational cost and leveraging time-frequency structures.
Findings
Models became faster and more lightweight.
Maintained reasonable speech quality.
Applicable to multiple HiFi-GAN variants.
Abstract
In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-spectrogram vocoder solves these problems jointly and implicitly using a convolutional neural network, including temporal upsampling layers, when directly calculating a raw waveform. Such an approach allows skipping redundant processes during waveform synthesis (e.g., the direct reconstruction of high-dimensional original-scale spectrograms). By contrast, the approach solves all problems in a black box and cannot effectively employ the time-frequency structures existing in a mel-spectrogram. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗hexgrad/Kokoro-82Mmodel· 9.6M dl· ♡ 58669.6M dl♡ 5866
- 🤗geneing/Kokoromodel· 7 dl· ♡ 167 dl♡ 16
- 🤗MaziyarPanahi/Kokoro-82Mmodel· 1 dl· ♡ 51 dl♡ 5
- 🤗AliceJohnson/Darwin-AImodel· 2 dl2 dl
- 🤗ctranslate2-4you/Kokoro-82M-lightmodel· 9 dl· ♡ 99 dl♡ 9
- 🤗prince-canuma/Kokoro-82Mmodel· 723 dl· ♡ 5723 dl♡ 5
- 🤗prince-canuma/Kokoro-82M-4bitmodel· 15 dl15 dl
- 🤗prince-canuma/Kokoro-82M-6bitmodel
- 🤗prince-canuma/Kokoro-82M-8bitmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗prince-canuma/Kokoro-82M-3bitmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
MethodsHiFi-GAN
