A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion   Recognition, Speaker Verification and Spoken Language Understanding

Yingzhi Wang; Abdelmoumene Boumadane; Abdelwahab Heba

arXiv:2111.02735·cs.CL·October 5, 2022·93 cites

A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding

Yingzhi Wang, Abdelmoumene Boumadane, Abdelwahab Heba

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of fine-tuned wav2vec 2.0 and HuBERT models on non-ASR speech tasks, demonstrating their strong performance in emotion recognition, speaker verification, and spoken language understanding.

Contribution

It introduces a comprehensive benchmark for fine-tuning wav2vec 2.0 and HuBERT on diverse speech tasks beyond ASR, with simple frameworks and detailed performance analysis.

Findings

01

Achieved 79.58% weighted accuracy in Speech Emotion Recognition

02

Reduced EER to 2.36% in Speaker Verification

03

Attained 89.38% accuracy in Intent Classification

Abstract

Speech self-supervised models such as wav2vec 2.0 and HuBERT are making revolutionary progress in Automatic Speech Recognition (ASR). However, they have not been totally proven to produce better performance on tasks other than ASR. In this work, we explored partial fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models for three non-ASR speech tasks: Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. With simple proposed downstream frameworks, the best scores reached 79.58% weighted accuracy on speaker-dependent setting and 73.01% weighted accuracy on speaker-independent setting for Speech Emotion Recognition on IEMOCAP, 2.36% equal error rate for Speaker Verification on VoxCeleb1, 89.38% accuracy for Intent Classification and 78.92% F1 for Slot Filling on SLURP, showing the strength of fine-tuned wav2vec 2.0 and HuBERT on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing