Comparison of Speech Representations for the MOS Prediction System

Aki Kunikoshi; Jaebok Kim; Wonsuk Jun; K\r{a}re Sj\"olander; (ReadSpeaker)

arXiv:2206.13817·cs.SD·June 29, 2022·1 cites

Comparison of Speech Representations for the MOS Prediction System

Aki Kunikoshi, Jaebok Kim, Wonsuk Jun, K\r{a}re Sj\"olander, (ReadSpeaker)

PDF

Open Access

TL;DR

This study compares different speech representations, including self-supervised wav2vec features and spectral features, for predicting the quality of Text-to-Speech systems, and proposes combining them to improve generalization.

Contribution

It evaluates the generalization of SSL and spectral features for MOS prediction and introduces a combined feature approach to enhance performance.

Findings

01

Wav2vec features showed superior generalization.

02

Combining features yielded the best MOS prediction results.

03

The approach bridges spectral and SSL feature gaps.

Abstract

Automatic methods to predict Mean Opinion Score (MOS) of listeners have been researched to assure the quality of Text-to-Speech systems. Many previous studies focus on architectural advances (e.g. MBNet, LDNet, etc.) to capture relations between spectral features and MOS in a more effective way and achieved high accuracy. However, the optimal representation in terms of generalization capability still largely remains unknown. To this end, we compare the performance of Self-Supervised Learning (SSL) features obtained by the wav2vec framework to that of spectral features such as magnitude of spectrogram and melspectrogram. Moreover, we propose to combine the SSL features and features which we believe to retain essential information to the automatic MOS to compensate each other for their drawbacks. We conduct comprehensive experiments on a large-scale listening test corpus collected from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsTest