Energy-Based Models For Speech Synthesis
Wanli Sun, Zehai Tu, Anton Ragni

TL;DR
This paper introduces energy-based models (EBMs) for speech synthesis, utilizing noise contrastive estimation and Langevin MCMC, demonstrating improved performance over Tacotron 2 on the LJSpeech dataset.
Contribution
It expands non-autoregressive speech synthesis models by applying EBMs with novel training and sampling strategies, linking EBMs to diffusion models.
Findings
EBMs trained with noise contrastive estimation improve speech synthesis quality.
Langevin MCMC enables effective sampling from EBMs.
Proposed approach outperforms Tacotron 2 on LJSpeech dataset.
Abstract
Recently there has been a lot of interest in non-autoregressive (non-AR) models for speech synthesis, such as FastSpeech 2 and diffusion models. Unlike AR models, these models do not have autoregressive dependencies among outputs which makes inference efficient. This paper expands the range of available non-AR models with another member called energy-based models (EBMs). The paper describes how noise contrastive estimation, which relies on the comparison between positive and negative samples, can be used to train EBMs. It proposes a number of strategies for generating effective negative samples, including using high-performing AR models. It also describes how sampling from EBMs can be performed using Langevin Markov Chain Monte-Carlo (MCMC). The use of Langevin MCMC enables to draw connections between EBMs and currently popular diffusion models. Experiments on LJSpeech dataset show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
Methods[LivE@PeRson]How do I talk to a real person at Expedia? · *Communicated@Fast*How Do I Communicate to Expedia? · Multi-Head Attention · Attention Is All You Need · Sigmoid Activation · Long Short-Term Memory · Tanh Activation · Highway Layer · Linear Layer · Dilated Causal Convolution
