Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition
Bethan Thomas, Samuel Kessler, Salah Karout

TL;DR
This paper introduces adapter modules for self-supervised speech models like wav2vec 2.0, enabling efficient transfer to automatic speech recognition tasks by significantly reducing training parameters while maintaining performance.
Contribution
It demonstrates that applying adapters to wav2vec 2.0 reduces parameter requirements for ASR, enhancing scalability across tasks and languages with minimal performance loss.
Findings
Adapters enable ASR with less than 10% of parameters compared to full fine-tuning.
Applying adapters to top layers yields similar performance to full transfer.
Using adapters improves scalability for multi-task and multilingual speech recognition.
Abstract
Self-supervised learning (SSL) is a powerful tool that allows learning of underlying representations from unlabeled data. Transformer based models such as wav2vec 2.0 and HuBERT are leading the field in the speech domain. Generally these models are fine-tuned on a small amount of labeled data for a downstream task such as Automatic Speech Recognition (ASR). This involves re-training the majority of the model for each task. Adapters are small lightweight modules which are commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. In this paper we propose applying adapters to wav2vec 2.0 to reduce the number of parameters required for downstream ASR tasks, and increase scalability of the model to multiple tasks or languages. Using adapters we can perform ASR while training fewer than 10% of parameters per task compared to full fine-tuning with little…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Layer Normalization · Byte Pair Encoding · Dense Connections · Residual Connection · Absolute Position Encodings · Softmax
