SQuId: Measuring Speech Naturalness in Many Languages

Thibault Sellam; Ankur Bapna; Joshua Camp; Diana Mackinnon; Ankur P.; Parikh; Jason Riesa

arXiv:2210.06324·cs.CL·June 2, 2023

SQuId: Measuring Speech Naturalness in Many Languages

Thibault Sellam, Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P., Parikh, Jason Riesa

PDF

Open Access

TL;DR

SQuId is a multilingual speech naturalness prediction model trained on over a million ratings across 65 locales, significantly reducing reliance on costly human evaluations and improving cross-locale transfer in speech quality assessment.

Contribution

The paper introduces SQuId, the largest multilingual speech naturalness prediction model trained on extensive ratings, demonstrating superior performance and effective cross-locale transfer capabilities.

Findings

01

Outperforms baseline by 50% in naturalness prediction

02

Effective zero-shot localization without fine-tuning

03

Model benefits from diverse pre-training and balanced language data

Abstract

Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems