Efficient infusion of self-supervised representations in Automatic   Speech Recognition

Darshan Prabhu; Sai Ganesh Mirishkar; Pankaj Wasnik

arXiv:2404.12628·cs.CL·April 22, 2024

Efficient infusion of self-supervised representations in Automatic Speech Recognition

Darshan Prabhu, Sai Ganesh Mirishkar, Pankaj Wasnik

PDF

Open Access

TL;DR

This paper introduces two efficient methods to incorporate self-supervised speech representations into ASR systems, achieving faster training and improved performance without increasing model size.

Contribution

The paper proposes two novel approaches using framewise addition and cross-attention to integrate SSL models into ASR, enabling faster training and better results.

Findings

01

Faster training compared to baseline models

02

Significant performance improvements on Librispeech and Tedlium datasets

03

Effective integration of SSL representations without increasing model size

Abstract

Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some approaches suggest incorporating these models as a trainable encoder or a learnable frontend, training such systems is extremely slow and requires a lot of computation cycles. In this work, we propose two simple approaches that use (1) framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model(s) into the ASR architecture, resulting in models that are comparable in size with standard encoder-decoder conformer systems while also avoiding the usage of SSL models during training. Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques