Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band   Generation and Inverse Short-Time Fourier Transform

Masaya Kawamura; Yuma Shirahata; Ryuichi Yamamoto; Kentaro Tachibana

arXiv:2210.15975·eess.AS·February 22, 2023

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

Masaya Kawamura, Yuma Shirahata, Ryuichi Yamamoto, Kentaro Tachibana

PDF

Open Access 1 Repo

TL;DR

This paper introduces a lightweight end-to-end text-to-speech model that combines multi-band waveform generation with inverse short-time Fourier transform, achieving high naturalness and real-time synthesis speed on standard CPUs.

Contribution

It presents a novel, efficient TTS model that integrates multi-band synthesis and inverse STFT, enabling end-to-end training and significantly faster inference compared to prior models.

Findings

01

Achieves speech naturalness comparable to VITS.

02

Real-time factor of 0.066 on CPU, 4.1 times faster than VITS.

03

Smaller model version outperforms lightweight baselines in naturalness and speed.

Abstract

We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

masayakawamura/mb-istft-vits
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsKnowledge Distillation