Neural networks for Text-to-Speech evaluation
Ilya Trofimenko, David Kocharyan, Aleksandr Zaitsev, Pavel Repnikov, Mark Levin, Nikita Shevtsov

TL;DR
This paper develops neural models to automatically evaluate Text-to-Speech quality, matching or surpassing human assessments in accuracy and efficiency, and explores various architectures and ensemble methods.
Contribution
It introduces NeuralSBS and improved MOSNet models, along with WhisperBert, to better approximate human judgments for TTS evaluation, addressing limitations of traditional subjective methods.
Findings
NeuralSBS achieves 73.7% accuracy on SOMOS dataset.
Best MOS models reach RMSE of ~0.40, outperforming human inter-rater baseline.
Ensemble stacking outperforms direct latent fusion in model performance.
Abstract
Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
