Multi-task Voice Activated Framework using Self-supervised Learning
Shehzeen Hussain, Van Nguyen, Shuhua Zhang, Erik Visser

TL;DR
This paper presents a versatile framework that adapts self-supervised wav2vec 2.0 speech representations for multiple voice-activated tasks, achieving state-of-the-art results in speaker verification and keyword spotting.
Contribution
It introduces a general-purpose, multi-task learning framework that fine-tunes wav2vec 2.0 for various voice tasks using shared transformer architectures.
Findings
Achieved 1.98% EER on VoxCeleb1 for speaker verification.
Achieved 98.23% accuracy on Google Speech Commands.
Demonstrated effective multi-task learning with shared models.
Abstract
Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data that are useful for speech recognition. Since these representations are learned without any task-specific supervision, they can also be useful for other voice-activated tasks like speaker verification, keyword spotting, emotion classification etc. In our work, we propose a general purpose framework for adapting a pre-trained wav2vec 2.0 model for different voice-activated tasks. We develop downstream network architectures that operate on the contextualized speech representations of wav2vec 2.0 to adapt the representations for solving a given task. Finally, we extend our framework to perform multi-task learning by jointly optimizing the network parameters on multiple voice activated tasks using a shared transformer backbone.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsTest
