MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module
Ond\v{r}ej Pl\'atek, Ond\v{r}ej Du\v{s}ek

TL;DR
MooseNet is a trainable speech quality metric combining SSL embeddings with a PLDA model, effectively predicting listener MOS scores with minimal training data and outperforming existing models.
Contribution
Introduces MooseNet, a novel speech quality assessment method integrating PLDA with SSL embeddings, demonstrating superior performance with low-resource training.
Findings
PLDA improves MOS prediction across models
MooseNet outperforms baseline models on VoiceMOS data
Effective in low-resource training scenarios
Abstract
We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neural network (NN) model. We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistently improves various neural MOS prediction models, even state-of-the-art models with task-specific fine-tuning. Our ablation study shows PLDA training superiority over SSL model fine-tuning in a low-resource scenario. We also improve SSL model fine-tuning using a convenient optimizer choice and additional contrastive and multi-task training objectives. The fine-tuned MooseNet NN with the PLDA module achieves the best results,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research
