Pre-trained Speech Representations as Feature Extractors for Speech   Quality Assessment in Online Conferencing Applications

Bastiaan Tamm; Helena Balabin; Rik Vandenberghe; Hugo Van hamme

arXiv:2210.00259·eess.AS·October 4, 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Bastiaan Tamm, Helena Balabin, Rik Vandenberghe, Hugo Van hamme

PDF

1 Repo

TL;DR

This paper explores using pre-trained wav2vec-based XLS-R speech representations as features for automated speech quality assessment in online conferencing, showing improved accuracy over traditional features without fine-tuning.

Contribution

It introduces a feature extraction approach using XLS-R embeddings for speech quality prediction, reducing model complexity and enhancing performance compared to MFCC-based methods.

Findings

01

XLS-R features outperform MFCC in MOS prediction accuracy.

02

Using pre-trained embeddings reduces the number of trainable parameters.

03

The approach achieves lower RMSE on the ConferencingSpeech 2022 dataset.

Abstract

Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R model. However, the number of parameters of such a model exceeds task-specific DNNs by several orders of magnitude, which poses a challenge for resulting fine-tuning procedures on smaller datasets. Therefore, we opt to use pre-trained speech representations from XLS-R in a feature extraction rather than a fine-tuning setting, thereby significantly reducing the number of trainable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lcn-kul/conferencing-speech-2022
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Pooling · OPT