Analysis of XLS-R for Speech Quality Assessment
Bastiaan Tamm, Rik Vandenberghe, Hugo Van hamme

TL;DR
This paper investigates the use of XLS-R pre-trained embeddings for automated speech quality assessment, analyzing layer-specific features, model sizes, and their relation to noise and speech content for improved MOS prediction.
Contribution
It provides an in-depth analysis of XLS-R embeddings across layers and model sizes, revealing optimal feature regions and their roles in speech quality prediction.
Findings
Lower-level features capture noise and acoustics
High-level features focus on speech content
Fusion of features improves prediction accuracy
Abstract
In online conferencing applications, estimating the perceived quality of an audio signal is crucial to ensure high quality of experience for the end user. The most reliable way to assess the quality of a speech signal is through human judgments in the form of the mean opinion score (MOS) metric. However, such an approach is labor intensive and not feasible for large-scale applications. The focus has therefore shifted towards automated speech quality assessment through end-to-end training of deep neural networks. Recently, it was shown that leveraging pre-trained wav2vec-based XLS-R embeddings leads to state-of-the-art performance for the task of speech quality prediction. In this paper, we perform an in-depth analysis of the pre-trained model. First, we analyze the performance of embeddings extracted from each layer of XLS-R and also for each size of the model (300M, 1B, 2B parameters).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsFocus
