Improving Self-Supervised Learning-based MOS Prediction Networks
B\'alint Gyires-T\'oth, Csaba Zaink\'o

TL;DR
This paper enhances self-supervised learning models for predicting Mean Opinion Scores (MOS) in speech systems, aiming to reduce reliance on costly human evaluations through various training and data processing improvements.
Contribution
It introduces specific data, training, and post-training techniques to improve MOS prediction accuracy of a self-supervised model based on wav2vec 2.0.
Findings
Improved prediction accuracy on Voice MOS challenge dataset
Effective use of transfer learning and data preprocessing
Enhanced model robustness with dropout accumulation and quantization
Abstract
MOS (Mean Opinion Score) is a subjective method used for the evaluation of a system's quality. Telecommunications (for voice and video), and speech synthesis systems (for generated speech) are a few of the many applications of the method. While MOS tests are widely accepted, they are time-consuming and costly since human input is required. In addition, since the systems and subjects of the tests differ, the results are not really comparable. On the other hand, a large number of previous tests allow us to train machine learning models that are capable of predicting MOS value. By automatically predicting MOS values, both the aforementioned issues can be resolved. The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model. We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Dropout
