MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module

Ond\v{r}ej Pl\'atek; Ond\v{r}ej Du\v{s}ek

arXiv:2301.07087·cs.CL·October 27, 2023

MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module

Ond\v{r}ej Pl\'atek, Ond\v{r}ej Du\v{s}ek

PDF

Open Access 1 Repo

TL;DR

MooseNet is a trainable speech quality metric combining SSL embeddings with a PLDA model, effectively predicting listener MOS scores with minimal training data and outperforming existing models.

Contribution

Introduces MooseNet, a novel speech quality assessment method integrating PLDA with SSL embeddings, demonstrating superior performance with low-resource training.

Findings

01

PLDA improves MOS prediction across models

02

MooseNet outperforms baseline models on VoiceMOS data

03

Effective in low-resource training scenarios

Abstract

We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neural network (NN) model. We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistently improves various neural MOS prediction models, even state-of-the-art models with task-specific fine-tuning. Our ablation study shows PLDA training superiority over SSL model fine-tuning in a low-resource scenario. We also improve SSL model fine-tuning using a convenient optimizer choice and additional contrastive and multi-task training objectives. The fine-tuned MooseNet NN with the PLDA module achieves the best results,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oplatek/moosenet-plda
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research