Bayesian Speech Synthesizers Can Learn from Multiple Teachers
Ziyang Zhang, Yifan Gao, Xuenan Xu, Baoxiang Li, Wen Wu, Chao Zhang

TL;DR
This paper introduces BELLE, a Bayesian TTS framework that models speech uncertainty, improves naturalness, and outperforms larger models by leveraging a novel training strategy and probabilistic modeling.
Contribution
BELLE is the first Bayesian TTS model that captures data-dependent uncertainty without increasing inference latency or model size.
Findings
BELLE reduces WER by 25.8% compared to larger models.
It effectively models speech variability using a Normal-Inverse-Gamma distribution.
Supports high-quality streaming speech synthesis.
Abstract
Text-to-Speech (TTS) is inherently a "one-to-many" mapping characterized by intrinsic uncertainty, yet current paradigms often oversimplify it into a deterministic regression task. While continuous-valued autoregressive (AR) models have recently emerged as a promising alternative to discrete codec-based approaches, they typically rely on a fixed-variance prior, fundamentally constraining generation to a static point estimate that ignores the dynamic variability of natural speech. To bridge this gap, we propose BELLE (Bayesian evidential learning with language modelling), a framework that shifts from deterministic prediction to principled Bayesian inference without increasing model parameters or inference latency. By modeling the acoustic target as a Normal-Inverse-Gamma distribution, BELLE captures data-dependent aleatoric uncertainty. To enable accurate variance estimation on standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
