VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via   Monotonic Alignment

Bing Han; Long Zhou; Shujie Liu; Sanyuan Chen; Lingwei Meng; Yanming; Qian; Yanqing Liu; Sheng Zhao; Jinyu Li; Furu Wei

arXiv:2406.07855·cs.CL·June 13, 2024·1 cites

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming, Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, Furu Wei

PDF

Open Access

TL;DR

VALL-E R is a zero-shot TTS system that enhances robustness and efficiency by using phoneme alignment and codec-merging, achieving high-quality speech with reduced inference time.

Contribution

It introduces a phoneme monotonic alignment strategy and a codec-merging approach to improve robustness and speed in zero-shot TTS.

Findings

01

Approaches near ground-truth WER in robustness.

02

Over 60% reduction in inference time.

03

Maintains high speech quality with fewer autoregressive steps.

Abstract

With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings huge computational overhead to the inference process of autoregression. To address these issues, we propose VALL-E R, a robust and efficient zero-shot TTS system, building upon the foundation of VALL-E. Specifically, we introduce a phoneme monotonic alignment strategy to strengthen the connection between phonemes and acoustic sequence, ensuring a more precise alignment by constraining the acoustic tokens to match their associated phonemes. Furthermore, we employ a codec-merging approach to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings