Comparison of Speech Representations for the MOS Prediction System
Aki Kunikoshi, Jaebok Kim, Wonsuk Jun, K\r{a}re Sj\"olander, (ReadSpeaker)

TL;DR
This study compares different speech representations, including self-supervised wav2vec features and spectral features, for predicting the quality of Text-to-Speech systems, and proposes combining them to improve generalization.
Contribution
It evaluates the generalization of SSL and spectral features for MOS prediction and introduces a combined feature approach to enhance performance.
Findings
Wav2vec features showed superior generalization.
Combining features yielded the best MOS prediction results.
The approach bridges spectral and SSL feature gaps.
Abstract
Automatic methods to predict Mean Opinion Score (MOS) of listeners have been researched to assure the quality of Text-to-Speech systems. Many previous studies focus on architectural advances (e.g. MBNet, LDNet, etc.) to capture relations between spectral features and MOS in a more effective way and achieved high accuracy. However, the optimal representation in terms of generalization capability still largely remains unknown. To this end, we compare the performance of Self-Supervised Learning (SSL) features obtained by the wav2vec framework to that of spectral features such as magnitude of spectrogram and melspectrogram. Moreover, we propose to combine the SSL features and features which we believe to retain essential information to the automatic MOS to compensate each other for their drawbacks. We conduct comprehensive experiments on a large-scale listening test corpus collected from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsTest
