Bayesian Speech Synthesizers Can Learn from Multiple Teachers

Ziyang Zhang; Yifan Gao; Xuenan Xu; Baoxiang Li; Wen Wu; Chao Zhang

arXiv:2510.24372·cs.SD·February 11, 2026

Bayesian Speech Synthesizers Can Learn from Multiple Teachers

Ziyang Zhang, Yifan Gao, Xuenan Xu, Baoxiang Li, Wen Wu, Chao Zhang

PDF

TL;DR

This paper introduces BELLE, a Bayesian TTS framework that models speech uncertainty, improves naturalness, and outperforms larger models by leveraging a novel training strategy and probabilistic modeling.

Contribution

BELLE is the first Bayesian TTS model that captures data-dependent uncertainty without increasing inference latency or model size.

Findings

01

BELLE reduces WER by 25.8% compared to larger models.

02

It effectively models speech variability using a Normal-Inverse-Gamma distribution.

03

Supports high-quality streaming speech synthesis.

Abstract

Text-to-Speech (TTS) is inherently a "one-to-many" mapping characterized by intrinsic uncertainty, yet current paradigms often oversimplify it into a deterministic regression task. While continuous-valued autoregressive (AR) models have recently emerged as a promising alternative to discrete codec-based approaches, they typically rely on a fixed-variance prior, fundamentally constraining generation to a static point estimate that ignores the dynamic variability of natural speech. To bridge this gap, we propose BELLE (Bayesian evidential learning with language modelling), a framework that shifts from deterministic prediction to principled Bayesian inference without increasing model parameters or inference latency. By modeling the acoustic target as a Normal-Inverse-Gamma distribution, BELLE captures data-dependent aleatoric uncertainty. To enable accurate variance estimation on standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.