Efficient infusion of self-supervised representations in Automatic Speech Recognition
Darshan Prabhu, Sai Ganesh Mirishkar, Pankaj Wasnik

TL;DR
This paper introduces two efficient methods to incorporate self-supervised speech representations into ASR systems, achieving faster training and improved performance without increasing model size.
Contribution
The paper proposes two novel approaches using framewise addition and cross-attention to integrate SSL models into ASR, enabling faster training and better results.
Findings
Faster training compared to baseline models
Significant performance improvements on Librispeech and Tedlium datasets
Effective integration of SSL representations without increasing model size
Abstract
Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some approaches suggest incorporating these models as a trainable encoder or a learnable frontend, training such systems is extremely slow and requires a lot of computation cycles. In this work, we propose two simple approaches that use (1) framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model(s) into the ASR architecture, resulting in models that are comparable in size with standard encoder-decoder conformer systems while also avoiding the usage of SSL models during training. Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
