Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform
Masaya Kawamura, Yuma Shirahata, Ryuichi Yamamoto, Kentaro Tachibana

TL;DR
This paper introduces a lightweight end-to-end text-to-speech model that combines multi-band waveform generation with inverse short-time Fourier transform, achieving high naturalness and real-time synthesis speed on standard CPUs.
Contribution
It presents a novel, efficient TTS model that integrates multi-band synthesis and inverse STFT, enabling end-to-end training and significantly faster inference compared to prior models.
Findings
Achieves speech naturalness comparable to VITS.
Real-time factor of 0.066 on CPU, 4.1 times faster than VITS.
Smaller model version outperforms lightweight baselines in naturalness and speed.
Abstract
We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsKnowledge Distillation
