TL;DR
This paper explores using pre-trained wav2vec-based XLS-R speech representations as features for automated speech quality assessment in online conferencing, showing improved accuracy over traditional features without fine-tuning.
Contribution
It introduces a feature extraction approach using XLS-R embeddings for speech quality prediction, reducing model complexity and enhancing performance compared to MFCC-based methods.
Findings
XLS-R features outperform MFCC in MOS prediction accuracy.
Using pre-trained embeddings reduces the number of trainable parameters.
The approach achieves lower RMSE on the ConferencingSpeech 2022 dataset.
Abstract
Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R model. However, the number of parameters of such a model exceeds task-specific DNNs by several orders of magnitude, which poses a challenge for resulting fine-tuning procedures on smaller datasets. Therefore, we opt to use pre-trained speech representations from XLS-R in a feature extraction rather than a fine-tuning setting, thereby significantly reducing the number of trainable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Pooling · OPT
