Energy-Based Models For Speech Synthesis

Wanli Sun; Zehai Tu; Anton Ragni

arXiv:2310.12765·cs.SD·October 20, 2023·2 cites

Energy-Based Models For Speech Synthesis

Wanli Sun, Zehai Tu, Anton Ragni

PDF

Open Access

TL;DR

This paper introduces energy-based models (EBMs) for speech synthesis, utilizing noise contrastive estimation and Langevin MCMC, demonstrating improved performance over Tacotron 2 on the LJSpeech dataset.

Contribution

It expands non-autoregressive speech synthesis models by applying EBMs with novel training and sampling strategies, linking EBMs to diffusion models.

Findings

01

EBMs trained with noise contrastive estimation improve speech synthesis quality.

02

Langevin MCMC enables effective sampling from EBMs.

03

Proposed approach outperforms Tacotron 2 on LJSpeech dataset.

Abstract

Recently there has been a lot of interest in non-autoregressive (non-AR) models for speech synthesis, such as FastSpeech 2 and diffusion models. Unlike AR models, these models do not have autoregressive dependencies among outputs which makes inference efficient. This paper expands the range of available non-AR models with another member called energy-based models (EBMs). The paper describes how noise contrastive estimation, which relies on the comparison between positive and negative samples, can be used to train EBMs. It proposes a number of strategies for generating effective negative samples, including using high-performing AR models. It also describes how sampling from EBMs can be performed using Langevin Markov Chain Monte-Carlo (MCMC). The use of Langevin MCMC enables to draw connections between EBMs and currently popular diffusion models. Experiments on LJSpeech dataset show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

Methods[LivE@PeRson]How do I talk to a real person at Expedia? · *Communicated@Fast*How Do I Communicate to Expedia? · Multi-Head Attention · Attention Is All You Need · Sigmoid Activation · Long Short-Term Memory · Tanh Activation · Highway Layer · Linear Layer · Dilated Causal Convolution