Mellotron: Multispeaker expressive voice synthesis by conditioning on   rhythm, pitch and global style tokens

Rafael Valle; Jason Li; Ryan Prenger; Bryan Catanzaro

arXiv:1910.11997·cs.SD·October 29, 2019

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

Rafael Valle, Jason Li, Ryan Prenger, Bryan Catanzaro

PDF

5 Repos

TL;DR

Mellotron is a multispeaker voice synthesis model that generates expressive speech and singing by conditioning on rhythm, pitch, and style tokens, trained solely on read speech data without explicit alignments.

Contribution

It introduces a method to synthesize expressive and singing voices without specialized training data, using explicit conditioning on rhythm and pitch from audio or music scores.

Findings

01

Effective style transfer across speakers and styles

02

High-quality singing synthesis without singing training data

03

Ability to manipulate rhythm and pitch procedurally

Abstract

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDilated Causal Convolution · Zoneout · Long Short-Term Memory · WaveNet · Mixture of Logistic Distributions · Location Sensitive Attention · Bidirectional LSTM · Linear Layer · Tacotron2