Exploring the influence of fine-tuning data on wav2vec 2.0 model for   blind speech quality prediction

Helard Becerra; Alessandro Ragano; Andrew Hines

arXiv:2204.02135·eess.AS·April 6, 2022·Interspeech·1 cites

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction

Helard Becerra, Alessandro Ragano, Andrew Hines

PDF

Open Access

TL;DR

This study investigates how different fine-tuning datasets, varying in language and size, influence wav2vec 2.0's effectiveness in predicting speech quality across diverse conferencing scenarios, highlighting the importance of data diversity and volume.

Contribution

It systematically analyzes the impact of fine-tuning data characteristics on wav2vec 2.0's speech quality prediction performance across multiple languages and dataset sizes.

Findings

01

Larger fine-tuning datasets improve performance.

02

Language diversity enhances model adaptability.

03

Fine-tuned models compete with baseline models.

Abstract

Recent studies have shown how self-supervised models can produce accurate speech quality predictions. Speech representations generated by the pre-trained wav2vec 2.0 model allows constructing robust predicting models using small amounts of annotated data. This opens the possibility of developing strong models in scenarios where labelled data is scarce. It is known that fine-tuning improves the model's performance; however, it is unclear how the data (e.g., language, amount of samples) used for fine-tuning is influencing that performance. In this paper, we explore how using different speech corpus to fine-tune the wav2vec 2.0 can influence its performance. We took four speech datasets containing degradations found in common conferencing applications and fine-tuned wav2vec 2.0 targeting different languages and data size scenarios. The fine-tuned models were tested across all four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders