Improving Self-Supervised Learning-based MOS Prediction Networks

B\'alint Gyires-T\'oth; Csaba Zaink\'o

arXiv:2204.11030·eess.AS·April 26, 2022·1 cites

Improving Self-Supervised Learning-based MOS Prediction Networks

B\'alint Gyires-T\'oth, Csaba Zaink\'o

PDF

Open Access 1 Repo

TL;DR

This paper enhances self-supervised learning models for predicting Mean Opinion Scores (MOS) in speech systems, aiming to reduce reliance on costly human evaluations through various training and data processing improvements.

Contribution

It introduces specific data, training, and post-training techniques to improve MOS prediction accuracy of a self-supervised model based on wav2vec 2.0.

Findings

01

Improved prediction accuracy on Voice MOS challenge dataset

02

Effective use of transfer learning and data preprocessing

03

Enhanced model robustness with dropout accumulation and quantization

Abstract

MOS (Mean Opinion Score) is a subjective method used for the evaluation of a system's quality. Telecommunications (for voice and video), and speech synthesis systems (for generated speech) are a few of the many applications of the method. While MOS tests are widely accepted, they are time-consuming and costly since human input is required. In addition, since the systems and subjects of the tests differ, the results are not really comparable. On the other hand, a large number of previous tests allow us to train machine learning models that are capable of predicting MOS value. By automatically predicting MOS values, both the aforementioned issues can be resolved. The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model. We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BME-SmartLab/DeepMOS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Dropout