R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS
Kyle Kastner, Aaron Courville

TL;DR
R-MelNet is a neural TTS model that efficiently synthesizes speech using a reduced mel-spectral approach, combining a MelNet-based frontend with a WaveRNN decoder, enabling high-quality, controllable speech synthesis on limited hardware.
Contribution
The paper introduces R-MelNet, a novel neural TTS architecture that reduces memory usage and computational complexity while maintaining high-quality speech synthesis capabilities.
Findings
Uses under 11 GB GPU memory on a single NVIDIA 2080Ti.
Produces highly varied audio with multi-sample inference.
Effective for single speaker TTS with controllable output.
Abstract
This paper introduces R-MelNet, a two-part autoregressive architecture with a frontend based on the first tier of MelNet and a backend WaveRNN-style audio decoder for neural text-to-speech synthesis. Taking as input a mixed sequence of characters and phonemes, with an optional audio priming sequence, this model produces low-resolution mel-spectral features which are interpolated and used by a WaveRNN decoder to produce an audio waveform. Coupled with half precision training, R-MelNet uses under 11 gigabytes of GPU memory on a single commodity GPU (NVIDIA 2080Ti). We detail a number of critical implementation details for stable half precision training, including an approximate, numerically stable mixture of logistics attention. Using a stochastic, multi-sample per step inference scheme, the resulting model generates highly varied audio, while enabling text and audio based controls to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Music and Audio Processing
MethodsSigmoid Activation · Softmax · Tanh Activation · *Communicated@Fast*How Do I Communicate to Expedia? · WaveRNN
