TL;DR
This paper introduces a self-supervised learning framework for predicting Mean Opinion Score (MOS) in speech quality assessment, eliminating the need for ground truth data and outperforming previous models on benchmark datasets.
Contribution
The study demonstrates that self-supervised pre-trained models can be effectively fine-tuned for MOS prediction, achieving superior performance without requiring labeled data.
Findings
Outperforms previous state-of-the-art models on Voice Conversion Challenge datasets.
Self-supervised representations can distinguish between clean and noisy speech.
The approach is reproducible with publicly available tools.
Abstract
Speech quality assessment has been a critical issue in speech processing for decades. Existing automatic evaluations usually require clean references or parallel ground truth data, which is infeasible when the amount of data soars. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. However, such a test is expensive and time-consuming because crowd work is necessary. It thus becomes highly desired to develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data. In this paper, we use self-supervised pre-trained models for MOS prediction. We show their representations can distinguish between clean and noisy audios. Then, we fine-tune these pre-trained models followed by simple linear layers in an end-to-end manner. The experiment results showed that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
