It\^oTTS and It\^oWave: Linear Stochastic Differential Equation Is All   You Need For Audio Generation

Shoule Wu; Ziqiang Shi

arXiv:2105.07583·cs.SD·February 1, 2022·1 cites

It\^oTTS and It\^oWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation

Shoule Wu, Ziqiang Shi

PDF

Open Access

TL;DR

This paper introduces ItôTTS and ItôWave, a unified framework using linear stochastic differential equations for high-quality, realistic text-to-speech and vocoder audio generation, surpassing current state-of-the-art methods.

Contribution

The paper presents a novel unified SDE-based approach for both TTS and vocoder tasks, simplifying and improving audio synthesis quality.

Findings

01

MOS scores exceed state-of-the-art methods

02

ItôTTS and ItôWave generate realistic speech and audio

03

Framework unifies TTS and vocoder into one model

Abstract

In this paper, we propose to unify the two aspects of voice synthesis, namely text-to-speech (TTS) and vocoder, into one framework based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of mel spectrogram (or wave), that we want to generate, into a simple and tractable distribution. The other is the generation procedure that turns this tractable simple signal into the target mel spectrogram (or wave). The model that generates mel spectrogram is called It\^oTTS, and the model that generates wave is called It\^oWave. It\^oTTS and It\^oWave use the Wiener process as a driver to gradually subtract the excess signal from the noise signal to generate realistic corresponding meaningful mel spectrogram and audio respectively, under the conditional inputs of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing