Fusion of Self-supervised Learned Models for MOS Prediction

Zhengdong Yang; Wangjin Zhou; Chenhui Chu; Sheng Li; Raj Dabre,; Raphael Rubino; Yi Zhao

arXiv:2204.04855·cs.SD·April 12, 2022

Fusion of Self-supervised Learned Models for MOS Prediction

Zhengdong Yang, Wangjin Zhou, Chenhui Chu, Sheng Li, Raj Dabre,, Raphael Rubino, Yi Zhao

PDF

Open Access

TL;DR

This paper presents a fusion framework of seven self-supervised models for MOS prediction, achieving top rankings in the 2022 challenge, especially excelling in out-of-domain speech evaluation.

Contribution

The paper introduces a novel fusion approach combining multiple SSL models and semi-supervised learning to enhance MOS prediction accuracy, particularly for out-of-domain data.

Findings

01

Achieved 1st rank in 6 out of 16 metrics in the challenge.

02

Significant improvement over basic SSL models, especially on OOD data.

03

Top system performance on main and OOD tracks for key metrics.

Abstract

We participated in the mean opinion score (MOS) prediction challenge, 2022. This challenge aims to predict MOS scores of synthetic speech on two tracks, the main track and a more challenging sub-track: out-of-domain (OOD). To improve the accuracy of the predicted scores, we have explored several model fusion-related strategies and proposed a fused framework in which seven pretrained self-supervised learned (SSL) models have been engaged. These pretrained SSL models are derived from three ASR frameworks, including Wav2Vec, Hubert, and WavLM. For the OOD track, we followed the 7 SSL models selected on the main track and adopted a semi-supervised learning method to exploit the unlabeled data. According to the official analysis results, our system has achieved 1st rank in 6 out of 16 metrics and is one of the top 3 systems for 13 out of 16 metrics. Specifically, we have achieved the highest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research

MethodsLipschitz Constant Constraint