Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction   in Text-to-Speech for Low-Resource Languages

Phat Do; Matt Coler; Jelske Dijkstra; Esther Klabbers

arXiv:2305.19396·eess.AS·June 1, 2023·2 cites

Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages

Phat Do, Matt Coler, Jelske Dijkstra, Esther Klabbers

PDF

Open Access

TL;DR

This paper proposes resource-efficient fine-tuning strategies for MOS prediction in low-resource language TTS, demonstrating effective zero-shot prediction and minimal data requirements for accurate system evaluation.

Contribution

It introduces a fine-tuning approach using wav2vec 2.0 that enhances MOS prediction accuracy with limited data in low-resource languages.

Findings

01

Pre-training on BVCC improves accuracy for low-resource language data.

02

Using more than 30% of data yields diminishing returns.

03

Single listener fine-tuning shows promising system-level accuracy.

Abstract

We train a MOS prediction model based on wav2vec 2.0 using the open-access data sets BVCC and SOMOS. Our test with neural TTS data in the low-resource language (LRL) West Frisian shows that pre-training on BVCC before fine-tuning on SOMOS leads to the best accuracy for both fine-tuned and zero-shot prediction. Further fine-tuning experiments show that using more than 30 percent of the total data does not lead to significant improvements. In addition, fine-tuning with data from a single listener shows promising system-level accuracy, supporting the viability of one-participant pilot tests. These findings can all assist the resource-conscious development of TTS for LRLs by progressing towards better zero-shot MOS prediction and informing the design of listening tests, especially in early-stage evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsTest