Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models
Vrunda N. Sukhadia, S. Umesh

TL;DR
This paper presents a domain adaptation method for low-resource ASR that leverages embeddings from well-trained models' encoder layers, combined with Spectral Augmentation, to significantly improve target-domain recognition performance.
Contribution
The study introduces a novel approach of using encoder layer embeddings from pre-trained ASR models for target-domain adaptation, enhancing low-resource ASR performance.
Findings
30% relative improvement on LibriSpeech-100-clean data
50% relative improvement on WSJ data
Effective combination of encoder embeddings and Spectral Augmentation
Abstract
In this paper, we investigate domain adaptation for low-resource Automatic Speech Recognition (ASR) of target-domain data, when a well-trained ASR model trained with a large dataset is available. We argue that in the encoder-decoder framework, the decoder of the well-trained ASR model is largely tuned towards the source-domain, hurting the performance of target-domain models in vanilla transfer-learning. On the other hand, the encoder layers of the well-trained ASR model mostly capture the acoustic characteristics. We, therefore, propose to use the embeddings tapped from these encoder layers as features for a downstream Conformer target-domain model and show that they provide significant improvements. We do ablation studies on which encoder layer is optimal to tap the embeddings, as well as the effect of freezing or updating the well-trained ASR model's encoder layers. We further show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
